diff mbox series

[V6,13/13] xfs: Stress test with bmap_alloc_minlen_extent error tag enabled

Message ID 20210309050124.23797-14-chandanrlinux@gmail.com (mailing list archive)
State Accepted, archived
Headers show
Series xfs: Tests to verify inode fork extent count overflow detection | expand

Commit Message

Chandan Babu R March 9, 2021, 5:01 a.m. UTC
This commit adds a stress test that executes fsstress with
bmap_alloc_minlen_extent error tag enabled.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 tests/xfs/537     | 84 +++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/537.out |  7 ++++
 tests/xfs/group   |  1 +
 3 files changed, 92 insertions(+)
 create mode 100755 tests/xfs/537
 create mode 100644 tests/xfs/537.out

Comments

Allison Henderson March 10, 2021, 11:49 p.m. UTC | #1
On 3/8/21 10:01 PM, Chandan Babu R wrote:
> This commit adds a stress test that executes fsstress with
> bmap_alloc_minlen_extent error tag enabled.
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
Ok makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   tests/xfs/537     | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>   tests/xfs/537.out |  7 ++++
>   tests/xfs/group   |  1 +
>   3 files changed, 92 insertions(+)
>   create mode 100755 tests/xfs/537
>   create mode 100644 tests/xfs/537.out
> 
> diff --git a/tests/xfs/537 b/tests/xfs/537
> new file mode 100755
> index 00000000..77fa60d9
> --- /dev/null
> +++ b/tests/xfs/537
> @@ -0,0 +1,84 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2021 Chandan Babu R.  All Rights Reserved.
> +#
> +# FS QA Test 537
> +#
> +# Execute fsstress with bmap_alloc_minlen_extent error tag enabled.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/inject
> +. ./common/populate
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_debug
> +_require_test_program "punch-alternating"
> +_require_xfs_io_error_injection "bmap_alloc_minlen_extent"
> +
> +echo "Format and mount fs"
> +_scratch_mkfs_sized $((1024 * 1024 * 1024)) >> $seqres.full
> +_scratch_mount >> $seqres.full
> +
> +bsize=$(_get_file_block_size $SCRATCH_MNT)
> +
> +echo "Consume free space"
> +fillerdir=$SCRATCH_MNT/fillerdir
> +nr_free_blks=$(stat -f -c '%f' $SCRATCH_MNT)
> +nr_free_blks=$((nr_free_blks * 90 / 100))
> +
> +_fill_fs $((bsize * nr_free_blks)) $fillerdir $bsize 0 >> $seqres.full 2>&1
> +
> +echo "Create fragmented filesystem"
> +for dentry in $(ls -1 $fillerdir/); do
> +	$here/src/punch-alternating $fillerdir/$dentry >> $seqres.full
> +done
> +
> +echo "Inject bmap_alloc_minlen_extent error tag"
> +_scratch_inject_error bmap_alloc_minlen_extent 1
> +
> +echo "Scale fsstress args"
> +args=$(_scale_fsstress_args -p $((LOAD_FACTOR * 75)) -n $((TIME_FACTOR * 1000)))
> +
> +echo "Execute fsstress in background"
> +$FSSTRESS_PROG -d $SCRATCH_MNT $args \
> +		 -f bulkstat=0 \
> +		 -f bulkstat1=0 \
> +		 -f fiemap=0 \
> +		 -f getattr=0 \
> +		 -f getdents=0 \
> +		 -f getfattr=0 \
> +		 -f listfattr=0 \
> +		 -f mread=0 \
> +		 -f read=0 \
> +		 -f readlink=0 \
> +		 -f readv=0 \
> +		 -f stat=0 \
> +		 -f aread=0 \
> +		 -f dread=0 > /dev/null 2>&1
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/537.out b/tests/xfs/537.out
> new file mode 100644
> index 00000000..19633a07
> --- /dev/null
> +++ b/tests/xfs/537.out
> @@ -0,0 +1,7 @@
> +QA output created by 537
> +Format and mount fs
> +Consume free space
> +Create fragmented filesystem
> +Inject bmap_alloc_minlen_extent error tag
> +Scale fsstress args
> +Execute fsstress in background
> diff --git a/tests/xfs/group b/tests/xfs/group
> index ba674a58..5c827727 100644
> --- a/tests/xfs/group
> +++ b/tests/xfs/group
> @@ -534,3 +534,4 @@
>   534 auto quick reflink
>   535 auto quick reflink
>   536 auto quick
> +537 auto stress
>
Darrick J. Wong March 22, 2021, 6:54 p.m. UTC | #2
On Tue, Mar 09, 2021 at 10:31:24AM +0530, Chandan Babu R wrote:
> This commit adds a stress test that executes fsstress with
> bmap_alloc_minlen_extent error tag enabled.

Continuing along the theme of watching the magic smoke come out when dir
block size > fs block size, I also observed the following assertion when
running this test:

 XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 687
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 3892 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
 Modules linked in: xfs(O) libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
 CPU: 0 PID: 3892 Comm: fsstress Tainted: G           O      5.12.0-rc4-xfsx #rc4
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
 RIP: 0010:assfail+0x3c/0x40 [xfs]
 Code: d0 d5 41 a0 e8 81 f9 ff ff 8a 1d 5b 44 0e 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 b0 d5 4d a0 e8 93 dc fc e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
 RSP: 0018:ffffc900035bba38 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff88804f204100 RCX: 0000000000000000
 RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa040c157
 RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000a
 R10: 000000000000000a R11: f000000000000000 R12: ffff88805920b880
 R13: ffff888003778bb0 R14: 0000000000000000 R15: ffff88800f0f63c0
 FS:  00007fe7b5e2f740(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fe7b6055000 CR3: 0000000053094005 CR4: 00000000001706b0
 Call Trace:
  xfs_dir2_shrink_inode+0x22f/0x270 [xfs]
  xfs_dir2_block_to_sf+0x29a/0x420 [xfs]
  xfs_dir2_block_removename+0x221/0x290 [xfs]
  xfs_dir_removename+0x1a0/0x220 [xfs]
  xfs_dir_rename+0x343/0x3b0 [xfs]
  xfs_rename+0x79e/0xae0 [xfs]
  xfs_vn_rename+0xdb/0x150 [xfs]
  vfs_rename+0x4e2/0x8e0
  do_renameat2+0x393/0x550
  __x64_sys_rename+0x40/0x50
  do_syscall_64+0x2d/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7fe7b5e9800b
 Code: e8 aa ce 0a 00 85 c0 0f 95 c0 0f b6 c0 f7 d8 5d c3 66 0f 1f 44 00 00 b8 ff ff ff ff 5d c3 90 f3 0f 1e fa b8 52 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 51 4e 18 00 f7 d8
 RSP: 002b:00007ffeb526c698 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
 RAX: ffffffffffffffda RBX: 00007ffeb526c970 RCX: 00007fe7b5e9800b
 RDX: 0000000000000000 RSI: 000055d6ccdb9250 RDI: 000055d6ccdb9270
 RBP: 00007ffeb526c980 R08: 0000000000000001 R09: 0000000000000003
 R10: 000055d6cc3b20dc R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000040 R14: 00007ffeb526c970 R15: 00007ffeb526c980
 ---[ end trace 98f99784621d65fe ]---

It looks to me as though we return from xfs_bunmapi having not completed
all the unmapping work, though I can't tell if that's because bunmapi
returned early because it thought it would overflow the extent count; or
some other reason.

OH CRAP, I just realized that xfs_dir2_shrink_inode only calls
xfs_bunmapi once, which means that if the directory block it's removing
is a multi-fsb block, it will remove the last extent map.  It then trips
the assertion, having left the rest of the directory block still mapped.

This is also what's going on when xfs_inactive_symlink_rmt trips the
same ASSERT(done), because the symlink remote block can span multiple
(two?) fs blocks but we only ever call xfs_bunmapi once.

So, no, there's nothing wrong with this test, but it _did_ shake loose
a couple of XFS bugs.  Congratulations!

So... who wants to tackle this one?  This isn't trivial to clean up
because you'll have to clean up all callers of xfs_dir2_shrink_inode to
handle rolling of the transaction, and I bet the only way to fix this is
to use deferred bunmap items to make sure the unmap always completes.

--D

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---
>  tests/xfs/537     | 84 +++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/537.out |  7 ++++
>  tests/xfs/group   |  1 +
>  3 files changed, 92 insertions(+)
>  create mode 100755 tests/xfs/537
>  create mode 100644 tests/xfs/537.out
> 
> diff --git a/tests/xfs/537 b/tests/xfs/537
> new file mode 100755
> index 00000000..77fa60d9
> --- /dev/null
> +++ b/tests/xfs/537
> @@ -0,0 +1,84 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2021 Chandan Babu R.  All Rights Reserved.
> +#
> +# FS QA Test 537
> +#
> +# Execute fsstress with bmap_alloc_minlen_extent error tag enabled.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/inject
> +. ./common/populate
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +_supported_fs xfs
> +_require_scratch
> +_require_xfs_debug
> +_require_test_program "punch-alternating"
> +_require_xfs_io_error_injection "bmap_alloc_minlen_extent"
> +
> +echo "Format and mount fs"
> +_scratch_mkfs_sized $((1024 * 1024 * 1024)) >> $seqres.full
> +_scratch_mount >> $seqres.full
> +
> +bsize=$(_get_file_block_size $SCRATCH_MNT)
> +
> +echo "Consume free space"
> +fillerdir=$SCRATCH_MNT/fillerdir
> +nr_free_blks=$(stat -f -c '%f' $SCRATCH_MNT)
> +nr_free_blks=$((nr_free_blks * 90 / 100))
> +
> +_fill_fs $((bsize * nr_free_blks)) $fillerdir $bsize 0 >> $seqres.full 2>&1
> +
> +echo "Create fragmented filesystem"
> +for dentry in $(ls -1 $fillerdir/); do
> +	$here/src/punch-alternating $fillerdir/$dentry >> $seqres.full
> +done
> +
> +echo "Inject bmap_alloc_minlen_extent error tag"
> +_scratch_inject_error bmap_alloc_minlen_extent 1
> +
> +echo "Scale fsstress args"
> +args=$(_scale_fsstress_args -p $((LOAD_FACTOR * 75)) -n $((TIME_FACTOR * 1000)))
> +
> +echo "Execute fsstress in background"
> +$FSSTRESS_PROG -d $SCRATCH_MNT $args \
> +		 -f bulkstat=0 \
> +		 -f bulkstat1=0 \
> +		 -f fiemap=0 \
> +		 -f getattr=0 \
> +		 -f getdents=0 \
> +		 -f getfattr=0 \
> +		 -f listfattr=0 \
> +		 -f mread=0 \
> +		 -f read=0 \
> +		 -f readlink=0 \
> +		 -f readv=0 \
> +		 -f stat=0 \
> +		 -f aread=0 \
> +		 -f dread=0 > /dev/null 2>&1
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/537.out b/tests/xfs/537.out
> new file mode 100644
> index 00000000..19633a07
> --- /dev/null
> +++ b/tests/xfs/537.out
> @@ -0,0 +1,7 @@
> +QA output created by 537
> +Format and mount fs
> +Consume free space
> +Create fragmented filesystem
> +Inject bmap_alloc_minlen_extent error tag
> +Scale fsstress args
> +Execute fsstress in background
> diff --git a/tests/xfs/group b/tests/xfs/group
> index ba674a58..5c827727 100644
> --- a/tests/xfs/group
> +++ b/tests/xfs/group
> @@ -534,3 +534,4 @@
>  534 auto quick reflink
>  535 auto quick reflink
>  536 auto quick
> +537 auto stress
> -- 
> 2.29.2
>
Chandan Babu R March 23, 2021, 5:28 a.m. UTC | #3
On 23 Mar 2021 at 00:24, Darrick J. Wong wrote:
> On Tue, Mar 09, 2021 at 10:31:24AM +0530, Chandan Babu R wrote:
>> This commit adds a stress test that executes fsstress with
>> bmap_alloc_minlen_extent error tag enabled.
>
> Continuing along the theme of watching the magic smoke come out when dir
> block size > fs block size, I also observed the following assertion when
> running this test:
>
>  XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 687
>  ------------[ cut here ]------------
>  WARNING: CPU: 0 PID: 3892 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
>  Modules linked in: xfs(O) libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
>  CPU: 0 PID: 3892 Comm: fsstress Tainted: G           O      5.12.0-rc4-xfsx #rc4
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
>  RIP: 0010:assfail+0x3c/0x40 [xfs]
>  Code: d0 d5 41 a0 e8 81 f9 ff ff 8a 1d 5b 44 0e 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 b0 d5 4d a0 e8 93 dc fc e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
>  RSP: 0018:ffffc900035bba38 EFLAGS: 00010246
>  RAX: 0000000000000000 RBX: ffff88804f204100 RCX: 0000000000000000
>  RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa040c157
>  RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000a
>  R10: 000000000000000a R11: f000000000000000 R12: ffff88805920b880
>  R13: ffff888003778bb0 R14: 0000000000000000 R15: ffff88800f0f63c0
>  FS:  00007fe7b5e2f740(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fe7b6055000 CR3: 0000000053094005 CR4: 00000000001706b0
>  Call Trace:
>   xfs_dir2_shrink_inode+0x22f/0x270 [xfs]
>   xfs_dir2_block_to_sf+0x29a/0x420 [xfs]
>   xfs_dir2_block_removename+0x221/0x290 [xfs]
>   xfs_dir_removename+0x1a0/0x220 [xfs]
>   xfs_dir_rename+0x343/0x3b0 [xfs]
>   xfs_rename+0x79e/0xae0 [xfs]
>   xfs_vn_rename+0xdb/0x150 [xfs]
>   vfs_rename+0x4e2/0x8e0
>   do_renameat2+0x393/0x550
>   __x64_sys_rename+0x40/0x50
>   do_syscall_64+0x2d/0x40
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>  RIP: 0033:0x7fe7b5e9800b
>  Code: e8 aa ce 0a 00 85 c0 0f 95 c0 0f b6 c0 f7 d8 5d c3 66 0f 1f 44 00 00 b8 ff ff ff ff 5d c3 90 f3 0f 1e fa b8 52 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 51 4e 18 00 f7 d8
>  RSP: 002b:00007ffeb526c698 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
>  RAX: ffffffffffffffda RBX: 00007ffeb526c970 RCX: 00007fe7b5e9800b
>  RDX: 0000000000000000 RSI: 000055d6ccdb9250 RDI: 000055d6ccdb9270
>  RBP: 00007ffeb526c980 R08: 0000000000000001 R09: 0000000000000003
>  R10: 000055d6cc3b20dc R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000040 R14: 00007ffeb526c970 R15: 00007ffeb526c980
>  ---[ end trace 98f99784621d65fe ]---
>
> It looks to me as though we return from xfs_bunmapi having not completed
> all the unmapping work, though I can't tell if that's because bunmapi
> returned early because it thought it would overflow the extent count; or
> some other reason.

It could also be because the following conditions may have evaluated to true,

if (!wasdel && !isrt) {
    agno = XFS_FSB_TO_AGNO(mp, del.br_startblock);
    if (prev_agno != NULLAGNUMBER && prev_agno > agno)
          break;
    prev_agno = agno;
}

i.e. the fs block being unmapped belongs to an AG whose agno is less than that
of the previous fs block that was successfully unmapped.

I can't seem to recreate this bug. I tried it with 64k directory block size
with both 4k and 1k fs block sizes. I will keep trying.

However I hit another call trace with directory block size > fs block size,

------------[ cut here ]------------
WARNING: CPU: 2 PID: 6136 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0
Modules linked in:
CPU: 2 PID: 6136 Comm: fsstress Tainted: G        W         5.12.0-rc2-chandan #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:xfs_bmap_extents_to_btree+0x520/0x5d0
Code: 5f fb ff ff 89 0c 24 48 c7 c2 1d 14 ac b2 b9 9d 02 00 00 31 ff 48 c7 c6 bf 14 ac b2 e8 1e 70 ac 00 44 8b 14 24 e9 55 ff ff ff <0f> 0b c7 44 24 0c e4 ff ff ff e9 0f ff ff ff b9 0e 03 00 00 48 c7
RSP: 0018:ffffb43d4939f5c0 EFLAGS: 00010246
RAX: ffffffffffffffff RBX: ffffa011c8887048 RCX: 0000000000000de5
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffa010e29c9000
RBP: ffffa010e2e00000 R08: ffffffffb180e81a R09: 0000000000000116
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa011c8887000
R13: 0000000000000000 R14: ffffa010e11ed7e0 R15: ffffa011c8468100
FS:  00007efd9458fb80(0000) GS:ffffa011f7c80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efd9458d000 CR3: 00000001fff6c000 CR4: 00000000000006e0
Call Trace:
 ? __cond_resched+0x16/0x40
 ? __kmalloc_track_caller+0x6d/0x260
 xfs_bmap_add_extent_hole_real+0x747/0x960
 xfs_bmapi_allocate+0x380/0x410
 xfs_bmapi_write+0x507/0x9e0
 xfs_da_grow_inode_int+0x1cd/0x330
 ? _xfs_trans_bjoin+0x72/0x110
 xfs_dir2_grow_inode+0x62/0x110
 ? xfs_trans_log_inode+0xce/0x2d0
 xfs_dir2_sf_to_block+0x103/0x940
 ? xfs_dir2_sf_check+0x8c/0x210
 ? xfs_da_compname+0x19/0x30
 ? xfs_dir2_sf_lookup+0xd0/0x3d0
 xfs_dir2_sf_addname+0x10d/0x910
 xfs_dir_createname+0x1ad/0x210
 ? xfs_trans_log_inode+0xce/0x2d0
 xfs_rename+0x803/0xbb0
 ? avc_has_perm_noaudit+0x83/0x100
 xfs_vn_rename+0xdb/0x150
 vfs_rename+0x691/0xa90
 do_renameat2+0x393/0x540
 __x64_sys_renameat2+0x4b/0x60
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7efd945e8e9f
Code: 44 00 00 48 8b 15 f1 5f 16 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 49 89 ca 45 85 c0 74 40 b8 3c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 41 89 c0 83 f8 ff 74 09 44 89 c0 c3 0f 1f
RSP: 002b:00007fff55bbfee8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 0000563e5f50654f RCX: 00007efd945e8e9f
RDX: 00000000ffffff9c RSI: 0000563e5ff284e0 RDI: 00000000ffffff9c
RBP: 00007fff55bc0140 R08: 0000000000000004 R09: feff7efdff2f3562
R10: 0000563e5ff286c0 R11: 0000000000000202 R12: 0000563e5f4fb630
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
---[ end trace 6d859f8beaa17680 ]---

>
> OH CRAP, I just realized that xfs_dir2_shrink_inode only calls
> xfs_bunmapi once, which means that if the directory block it's removing
> is a multi-fsb block, it will remove the last extent map.  It then trips
> the assertion, having left the rest of the directory block still mapped.
>
> This is also what's going on when xfs_inactive_symlink_rmt trips the
> same ASSERT(done), because the symlink remote block can span multiple
> (two?) fs blocks but we only ever call xfs_bunmapi once.
>
> So, no, there's nothing wrong with this test, but it _did_ shake loose
> a couple of XFS bugs.  Congratulations!
>
> So... who wants to tackle this one?  This isn't trivial to clean up
> because you'll have to clean up all callers of xfs_dir2_shrink_inode to
> handle rolling of the transaction, and I bet the only way to fix this is
> to use deferred bunmap items to make sure the unmap always completes.

I will work on both the above listed call traces. Thanks for reporting the bug
and also the follow up analysis.

--
chandan
Chandan Babu R March 23, 2021, 11:27 a.m. UTC | #4
On 23 Mar 2021 at 10:58, Chandan Babu R wrote:
> On 23 Mar 2021 at 00:24, Darrick J. Wong wrote:
>> On Tue, Mar 09, 2021 at 10:31:24AM +0530, Chandan Babu R wrote:
>>> This commit adds a stress test that executes fsstress with
>>> bmap_alloc_minlen_extent error tag enabled.
>>
>> Continuing along the theme of watching the magic smoke come out when dir
>> block size > fs block size, I also observed the following assertion when
>> running this test:
>>
>>  XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 687
>>  ------------[ cut here ]------------
>>  WARNING: CPU: 0 PID: 3892 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
>>  Modules linked in: xfs(O) libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
>>  CPU: 0 PID: 3892 Comm: fsstress Tainted: G           O      5.12.0-rc4-xfsx #rc4
>>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
>>  RIP: 0010:assfail+0x3c/0x40 [xfs]
>>  Code: d0 d5 41 a0 e8 81 f9 ff ff 8a 1d 5b 44 0e 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 b0 d5 4d a0 e8 93 dc fc e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
>>  RSP: 0018:ffffc900035bba38 EFLAGS: 00010246
>>  RAX: 0000000000000000 RBX: ffff88804f204100 RCX: 0000000000000000
>>  RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa040c157
>>  RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000a
>>  R10: 000000000000000a R11: f000000000000000 R12: ffff88805920b880
>>  R13: ffff888003778bb0 R14: 0000000000000000 R15: ffff88800f0f63c0
>>  FS:  00007fe7b5e2f740(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 00007fe7b6055000 CR3: 0000000053094005 CR4: 00000000001706b0
>>  Call Trace:
>>   xfs_dir2_shrink_inode+0x22f/0x270 [xfs]
>>   xfs_dir2_block_to_sf+0x29a/0x420 [xfs]
>>   xfs_dir2_block_removename+0x221/0x290 [xfs]
>>   xfs_dir_removename+0x1a0/0x220 [xfs]
>>   xfs_dir_rename+0x343/0x3b0 [xfs]
>>   xfs_rename+0x79e/0xae0 [xfs]
>>   xfs_vn_rename+0xdb/0x150 [xfs]
>>   vfs_rename+0x4e2/0x8e0
>>   do_renameat2+0x393/0x550
>>   __x64_sys_rename+0x40/0x50
>>   do_syscall_64+0x2d/0x40
>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>  RIP: 0033:0x7fe7b5e9800b
>>  Code: e8 aa ce 0a 00 85 c0 0f 95 c0 0f b6 c0 f7 d8 5d c3 66 0f 1f 44 00 00 b8 ff ff ff ff 5d c3 90 f3 0f 1e fa b8 52 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 51 4e 18 00 f7 d8
>>  RSP: 002b:00007ffeb526c698 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
>>  RAX: ffffffffffffffda RBX: 00007ffeb526c970 RCX: 00007fe7b5e9800b
>>  RDX: 0000000000000000 RSI: 000055d6ccdb9250 RDI: 000055d6ccdb9270
>>  RBP: 00007ffeb526c980 R08: 0000000000000001 R09: 0000000000000003
>>  R10: 000055d6cc3b20dc R11: 0000000000000246 R12: 0000000000000000
>>  R13: 0000000000000040 R14: 00007ffeb526c970 R15: 00007ffeb526c980
>>  ---[ end trace 98f99784621d65fe ]---
>>
>> It looks to me as though we return from xfs_bunmapi having not completed
>> all the unmapping work, though I can't tell if that's because bunmapi
>> returned early because it thought it would overflow the extent count; or
>> some other reason.
>
> It could also be because the following conditions may have evaluated to true,
>
> if (!wasdel && !isrt) {
>     agno = XFS_FSB_TO_AGNO(mp, del.br_startblock);
>     if (prev_agno != NULLAGNUMBER && prev_agno > agno)
>           break;
>     prev_agno = agno;
> }
>
> i.e. the fs block being unmapped belongs to an AG whose agno is less than that
> of the previous fs block that was successfully unmapped.
>
> I can't seem to recreate this bug. I tried it with 64k directory block size
> with both 4k and 1k fs block sizes. I will keep trying.
>
> However I hit another call trace with directory block size > fs block size,
>
> ------------[ cut here ]------------
> WARNING: CPU: 2 PID: 6136 at fs/xfs/libxfs/xfs_bmap.c:717 xfs_bmap_extents_to_btree+0x520/0x5d0
> Modules linked in:
> CPU: 2 PID: 6136 Comm: fsstress Tainted: G        W         5.12.0-rc2-chandan #7
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> RIP: 0010:xfs_bmap_extents_to_btree+0x520/0x5d0
> Code: 5f fb ff ff 89 0c 24 48 c7 c2 1d 14 ac b2 b9 9d 02 00 00 31 ff 48 c7 c6 bf 14 ac b2 e8 1e 70 ac 00 44 8b 14 24 e9 55 ff ff ff <0f> 0b c7 44 24 0c e4 ff ff ff e9 0f ff ff ff b9 0e 03 00 00 48 c7
> RSP: 0018:ffffb43d4939f5c0 EFLAGS: 00010246
> RAX: ffffffffffffffff RBX: ffffa011c8887048 RCX: 0000000000000de5
> RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffa010e29c9000
> RBP: ffffa010e2e00000 R08: ffffffffb180e81a R09: 0000000000000116
> R10: 0000000000000000 R11: 0000000000000000 R12: ffffa011c8887000
> R13: 0000000000000000 R14: ffffa010e11ed7e0 R15: ffffa011c8468100
> FS:  00007efd9458fb80(0000) GS:ffffa011f7c80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007efd9458d000 CR3: 00000001fff6c000 CR4: 00000000000006e0
> Call Trace:
>  ? __cond_resched+0x16/0x40
>  ? __kmalloc_track_caller+0x6d/0x260
>  xfs_bmap_add_extent_hole_real+0x747/0x960
>  xfs_bmapi_allocate+0x380/0x410
>  xfs_bmapi_write+0x507/0x9e0
>  xfs_da_grow_inode_int+0x1cd/0x330
>  ? _xfs_trans_bjoin+0x72/0x110
>  xfs_dir2_grow_inode+0x62/0x110
>  ? xfs_trans_log_inode+0xce/0x2d0
>  xfs_dir2_sf_to_block+0x103/0x940
>  ? xfs_dir2_sf_check+0x8c/0x210
>  ? xfs_da_compname+0x19/0x30
>  ? xfs_dir2_sf_lookup+0xd0/0x3d0
>  xfs_dir2_sf_addname+0x10d/0x910
>  xfs_dir_createname+0x1ad/0x210
>  ? xfs_trans_log_inode+0xce/0x2d0
>  xfs_rename+0x803/0xbb0
>  ? avc_has_perm_noaudit+0x83/0x100
>  xfs_vn_rename+0xdb/0x150
>  vfs_rename+0x691/0xa90
>  do_renameat2+0x393/0x540
>  __x64_sys_renameat2+0x4b/0x60
>  do_syscall_64+0x33/0x40
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7efd945e8e9f
> Code: 44 00 00 48 8b 15 f1 5f 16 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 49 89 ca 45 85 c0 74 40 b8 3c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 39 41 89 c0 83 f8 ff 74 09 44 89 c0 c3 0f 1f
> RSP: 002b:00007fff55bbfee8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
> RAX: ffffffffffffffda RBX: 0000563e5f50654f RCX: 00007efd945e8e9f
> RDX: 00000000ffffff9c RSI: 0000563e5ff284e0 RDI: 00000000ffffff9c
> RBP: 00007fff55bc0140 R08: 0000000000000004 R09: feff7efdff2f3562
> R10: 0000563e5ff286c0 R11: 0000000000000202 R12: 0000563e5f4fb630
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> ---[ end trace 6d859f8beaa17680 ]---
>

The following patch fixed the bug,

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index e0905ad171f0..585f7e795023 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3586,7 +3586,8 @@ xfs_bmap_exact_minlen_extent_alloc(
 	args.fsbno = ap->blkno;
 	args.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
 	args.type = XFS_ALLOCTYPE_FIRST_AG;
-	args.total = args.minlen = args.maxlen = ap->minlen;
+	args.minlen = args.maxlen = ap->minlen;
+	args.total = ap->total;
 
 	args.alignment = 1;
 	args.minalignslop = 0;

Without the above change, the space allocator could allocate a block from an
AG which has less than ap->total blocks available. Hence, future allocation
requests from the same transaction could fail.

I will have to go through some more code to confirm the correctness of the
patch. I will post the patch to the mailing list if I find it to be indeed
correct.
Chandan Babu R March 26, 2021, 4:05 a.m. UTC | #5
On 23 Mar 2021 at 10:58, Chandan Babu R wrote:
> On 23 Mar 2021 at 00:24, Darrick J. Wong wrote:
>> On Tue, Mar 09, 2021 at 10:31:24AM +0530, Chandan Babu R wrote:
>>> This commit adds a stress test that executes fsstress with
>>> bmap_alloc_minlen_extent error tag enabled.
>>
>> Continuing along the theme of watching the magic smoke come out when dir
>> block size > fs block size, I also observed the following assertion when
>> running this test:

Apart from "xfs_dir2_shrink_inode only calls xfs_bunmapi once" bug, I
noticed that scrub has detected metadata inconsistency,

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 debian-guest 5.12.0-rc4-chandan #30 SMP Thu Mar 25 11:00:08 IST 2021
MKFS_OPTIONS  -- -f -b size=1k -m rmapbt=0,reflink=0 -n size=64k /dev/vdc2
MOUNT_OPTIONS -- /dev/vdc2 /mnt/scratch

xfs/538 43s ... _check_xfs_filesystem: filesystem on /dev/vdc2 failed scrub
(see /root/repos/xfstests-dev/results//xfs/538.full for details)

Ran: xfs/538
Failures: xfs/538
Failed 1 of 1 tests

I will work on fixing this one as well.
Darrick J. Wong March 26, 2021, 4:21 a.m. UTC | #6
On Fri, Mar 26, 2021 at 09:35:33AM +0530, Chandan Babu R wrote:
> On 23 Mar 2021 at 10:58, Chandan Babu R wrote:
> > On 23 Mar 2021 at 00:24, Darrick J. Wong wrote:
> >> On Tue, Mar 09, 2021 at 10:31:24AM +0530, Chandan Babu R wrote:
> >>> This commit adds a stress test that executes fsstress with
> >>> bmap_alloc_minlen_extent error tag enabled.
> >>
> >> Continuing along the theme of watching the magic smoke come out when dir
> >> block size > fs block size, I also observed the following assertion when
> >> running this test:
> 
> Apart from "xfs_dir2_shrink_inode only calls xfs_bunmapi once" bug, I
> noticed that scrub has detected metadata inconsistency,
> 
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 debian-guest 5.12.0-rc4-chandan #30 SMP Thu Mar 25 11:00:08 IST 2021
> MKFS_OPTIONS  -- -f -b size=1k -m rmapbt=0,reflink=0 -n size=64k /dev/vdc2
> MOUNT_OPTIONS -- /dev/vdc2 /mnt/scratch
> 
> xfs/538 43s ... _check_xfs_filesystem: filesystem on /dev/vdc2 failed scrub
> (see /root/repos/xfstests-dev/results//xfs/538.full for details)
> 
> Ran: xfs/538
> Failures: xfs/538
> Failed 1 of 1 tests
> 
> I will work on fixing this one as well.

Yeah, I noticed that too.  There are two bugs -- the first one is that
the "Block directories only have a single block at offset 0" check in
xchk_directory_blocks is wrong -- they can have a single *dir* block at
offset 0.  The correct check is (I think):

	if (is_block && got.br_startoff >= args.geo->fsbcount) {
		xchk_fblock_set_corrupt(...);
		break;
	}

The second problem is ... somewhere in the bmbt scrubber.  It looks like
there's a file with a data fork in btree format; the data fork has a
single child block; and there are few enough key/ptrs in that child
block that it /could/ fit in the root.  For some reason the btree code
didn't promote it, however.  Seeing as nobody's touched that code
lately, that probably means that scrub is wrong, but OTOH that /does/
seem like a waste of a block.

But, that's as far as I've gotten on that second one and it's late.
Thanks for taking a look! :)

--D

> -- 
> chandan
Chandan Babu R April 15, 2021, 9:33 a.m. UTC | #7
On 23 Mar 2021 at 00:24, Darrick J. Wong wrote:
> On Tue, Mar 09, 2021 at 10:31:24AM +0530, Chandan Babu R wrote:
>> This commit adds a stress test that executes fsstress with
>> bmap_alloc_minlen_extent error tag enabled.
>
> Continuing along the theme of watching the magic smoke come out when dir
> block size > fs block size, I also observed the following assertion when
> running this test:
>
>  XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 687
>  ------------[ cut here ]------------
>  WARNING: CPU: 0 PID: 3892 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
>  Modules linked in: xfs(O) libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
>  CPU: 0 PID: 3892 Comm: fsstress Tainted: G           O      5.12.0-rc4-xfsx #rc4
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
>  RIP: 0010:assfail+0x3c/0x40 [xfs]
>  Code: d0 d5 41 a0 e8 81 f9 ff ff 8a 1d 5b 44 0e 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 b0 d5 4d a0 e8 93 dc fc e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
>  RSP: 0018:ffffc900035bba38 EFLAGS: 00010246
>  RAX: 0000000000000000 RBX: ffff88804f204100 RCX: 0000000000000000
>  RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa040c157
>  RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000a
>  R10: 000000000000000a R11: f000000000000000 R12: ffff88805920b880
>  R13: ffff888003778bb0 R14: 0000000000000000 R15: ffff88800f0f63c0
>  FS:  00007fe7b5e2f740(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fe7b6055000 CR3: 0000000053094005 CR4: 00000000001706b0
>  Call Trace:
>   xfs_dir2_shrink_inode+0x22f/0x270 [xfs]
>   xfs_dir2_block_to_sf+0x29a/0x420 [xfs]
>   xfs_dir2_block_removename+0x221/0x290 [xfs]
>   xfs_dir_removename+0x1a0/0x220 [xfs]
>   xfs_dir_rename+0x343/0x3b0 [xfs]
>   xfs_rename+0x79e/0xae0 [xfs]
>   xfs_vn_rename+0xdb/0x150 [xfs]
>   vfs_rename+0x4e2/0x8e0
>   do_renameat2+0x393/0x550
>   __x64_sys_rename+0x40/0x50
>   do_syscall_64+0x2d/0x40
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>  RIP: 0033:0x7fe7b5e9800b
>  Code: e8 aa ce 0a 00 85 c0 0f 95 c0 0f b6 c0 f7 d8 5d c3 66 0f 1f 44 00 00 b8 ff ff ff ff 5d c3 90 f3 0f 1e fa b8 52 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 51 4e 18 00 f7 d8
>  RSP: 002b:00007ffeb526c698 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
>  RAX: ffffffffffffffda RBX: 00007ffeb526c970 RCX: 00007fe7b5e9800b
>  RDX: 0000000000000000 RSI: 000055d6ccdb9250 RDI: 000055d6ccdb9270
>  RBP: 00007ffeb526c980 R08: 0000000000000001 R09: 0000000000000003
>  R10: 000055d6cc3b20dc R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000040 R14: 00007ffeb526c970 R15: 00007ffeb526c980
>  ---[ end trace 98f99784621d65fe ]---
>
> It looks to me as though we return from xfs_bunmapi having not completed
> all the unmapping work, though I can't tell if that's because bunmapi
> returned early because it thought it would overflow the extent count; or
> some other reason.
>
> OH CRAP, I just realized that xfs_dir2_shrink_inode only calls
> xfs_bunmapi once, which means that if the directory block it's removing
> is a multi-fsb block, it will remove the last extent map.  It then trips
> the assertion, having left the rest of the directory block still mapped.
>
> This is also what's going on when xfs_inactive_symlink_rmt trips the
> same ASSERT(done), because the symlink remote block can span multiple
> (two?) fs blocks but we only ever call xfs_bunmapi once.
>
> So, no, there's nothing wrong with this test, but it _did_ shake loose
> a couple of XFS bugs.  Congratulations!
>
> So... who wants to tackle this one?  This isn't trivial to clean up
> because you'll have to clean up all callers of xfs_dir2_shrink_inode to
> handle rolling of the transaction, and I bet the only way to fix this is
> to use deferred bunmap items to make sure the unmap always completes.
>

I was wondering as to why the above described bug does not occur when
allocating blocks via xfs_bmap_btalloc(). This led me to the following,

1. When using xfs_bmap_btalloc() to allocate a directory block,
   xfs_bmalloca->total is set to total number of fs blocks required for the
   transaction to complete successfully. This includes blocks required to
   allocate
   - Data block
   - Free index block
   - Dabtree blocks and
   - Bmbt blocks.

2. Most of the time (please refer to step #5 for a description of the
   exceptional case), xfs_bmap_btalloc() chooses an AG for space allocation
   only when the AG has atleast xfs_bmalloca->total number of free blocks. On
   finding such an AG, the corresponding AGF buffer is locked by the
   transaction and this guarantees that the fs blocks that make up a directory
   block are allocated from within the same AG. This is probably the reason
   for xfs_dir2_shrink_inode() to assume that __xfs_bunmapi() will be able to
   unmap all the constituent fs blocks.

3. The call trace posted above occurs when __xfs_bunmapi() starts unmapping
   the fs blocks of a directory block and one of the fs blocks happens to be
   from an AG whose AG number is less than that of fs block that was unmapped
   earlier.

4. The xfs_bmap_exact_minlen_extent_alloc() allocator can cause allocation of
   a directory block whose constituent fs blocks are from different AGs. This
   occurs because,
   - xfs_bmap_exact_minlen_extent_alloc() gets an AG which has atleast
     xfs_bmalloca->total free fs blocks.
   - However some of those free fs blocks do not correspond to one-block sized
     extents (NOTE: xfs/538 test fragments 90% of the fs free space).
   - Once the current AG runs out of one-block sized extents, we move onto the
     next AG. This happens because xfs_bmap_exact_minlen_extent_alloc() uses
     XFS_ALLOCTYPE_FIRST_AG as the allocation type and this in turn causes the
     allocator code to iterate across AGs to get free blocks.

5. From code reading, I noticed that the scenario described in step #4 could
   also occur when using xfs_bmap_btalloc(). This happens when the filesystem
   is highly fragmented and is also running low on free space. In such a
   scenario, XFS_TRANS_LOWMODE is enabled causing xfs_bmap_btalloc() to
   execute the same sequence of steps described in step #4. This scenario
   (i.e. fragmented fs and running low on free space) is probably quite rare
   to occur in practice and hence this may be the reason as to why this
   problem was not observed earlier.

--
chandan
Darrick J. Wong April 17, 2021, 12:26 a.m. UTC | #8
On Thu, Apr 15, 2021 at 03:03:47PM +0530, Chandan Babu R wrote:
> On 23 Mar 2021 at 00:24, Darrick J. Wong wrote:
> > On Tue, Mar 09, 2021 at 10:31:24AM +0530, Chandan Babu R wrote:
> >> This commit adds a stress test that executes fsstress with
> >> bmap_alloc_minlen_extent error tag enabled.
> >
> > Continuing along the theme of watching the magic smoke come out when dir
> > block size > fs block size, I also observed the following assertion when
> > running this test:
> >
> >  XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 687
> >  ------------[ cut here ]------------
> >  WARNING: CPU: 0 PID: 3892 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
> >  Modules linked in: xfs(O) libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables iptable_filter bfq sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
> >  CPU: 0 PID: 3892 Comm: fsstress Tainted: G           O      5.12.0-rc4-xfsx #rc4
> >  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> >  RIP: 0010:assfail+0x3c/0x40 [xfs]
> >  Code: d0 d5 41 a0 e8 81 f9 ff ff 8a 1d 5b 44 0e 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 b0 d5 4d a0 e8 93 dc fc e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
> >  RSP: 0018:ffffc900035bba38 EFLAGS: 00010246
> >  RAX: 0000000000000000 RBX: ffff88804f204100 RCX: 0000000000000000
> >  RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa040c157
> >  RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000000a
> >  R10: 000000000000000a R11: f000000000000000 R12: ffff88805920b880
> >  R13: ffff888003778bb0 R14: 0000000000000000 R15: ffff88800f0f63c0
> >  FS:  00007fe7b5e2f740(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 00007fe7b6055000 CR3: 0000000053094005 CR4: 00000000001706b0
> >  Call Trace:
> >   xfs_dir2_shrink_inode+0x22f/0x270 [xfs]
> >   xfs_dir2_block_to_sf+0x29a/0x420 [xfs]
> >   xfs_dir2_block_removename+0x221/0x290 [xfs]
> >   xfs_dir_removename+0x1a0/0x220 [xfs]
> >   xfs_dir_rename+0x343/0x3b0 [xfs]
> >   xfs_rename+0x79e/0xae0 [xfs]
> >   xfs_vn_rename+0xdb/0x150 [xfs]
> >   vfs_rename+0x4e2/0x8e0
> >   do_renameat2+0x393/0x550
> >   __x64_sys_rename+0x40/0x50
> >   do_syscall_64+0x2d/0x40
> >   entry_SYSCALL_64_after_hwframe+0x44/0xae
> >  RIP: 0033:0x7fe7b5e9800b
> >  Code: e8 aa ce 0a 00 85 c0 0f 95 c0 0f b6 c0 f7 d8 5d c3 66 0f 1f 44 00 00 b8 ff ff ff ff 5d c3 90 f3 0f 1e fa b8 52 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 51 4e 18 00 f7 d8
> >  RSP: 002b:00007ffeb526c698 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
> >  RAX: ffffffffffffffda RBX: 00007ffeb526c970 RCX: 00007fe7b5e9800b
> >  RDX: 0000000000000000 RSI: 000055d6ccdb9250 RDI: 000055d6ccdb9270
> >  RBP: 00007ffeb526c980 R08: 0000000000000001 R09: 0000000000000003
> >  R10: 000055d6cc3b20dc R11: 0000000000000246 R12: 0000000000000000
> >  R13: 0000000000000040 R14: 00007ffeb526c970 R15: 00007ffeb526c980
> >  ---[ end trace 98f99784621d65fe ]---
> >
> > It looks to me as though we return from xfs_bunmapi having not completed
> > all the unmapping work, though I can't tell if that's because bunmapi
> > returned early because it thought it would overflow the extent count; or
> > some other reason.
> >
> > OH CRAP, I just realized that xfs_dir2_shrink_inode only calls
> > xfs_bunmapi once, which means that if the directory block it's removing
> > is a multi-fsb block, it will remove the last extent map.  It then trips
> > the assertion, having left the rest of the directory block still mapped.
> >
> > This is also what's going on when xfs_inactive_symlink_rmt trips the
> > same ASSERT(done), because the symlink remote block can span multiple
> > (two?) fs blocks but we only ever call xfs_bunmapi once.
> >
> > So, no, there's nothing wrong with this test, but it _did_ shake loose
> > a couple of XFS bugs.  Congratulations!
> >
> > So... who wants to tackle this one?  This isn't trivial to clean up
> > because you'll have to clean up all callers of xfs_dir2_shrink_inode to
> > handle rolling of the transaction, and I bet the only way to fix this is
> > to use deferred bunmap items to make sure the unmap always completes.
> >
> 
> I was wondering as to why the above described bug does not occur when
> allocating blocks via xfs_bmap_btalloc(). This led me to the following,
> 
> 1. When using xfs_bmap_btalloc() to allocate a directory block,
>    xfs_bmalloca->total is set to total number of fs blocks required for the
>    transaction to complete successfully. This includes blocks required to
>    allocate
>    - Data block
>    - Free index block
>    - Dabtree blocks and
>    - Bmbt blocks.
> 
> 2. Most of the time (please refer to step #5 for a description of the
>    exceptional case), xfs_bmap_btalloc() chooses an AG for space allocation
>    only when the AG has atleast xfs_bmalloca->total number of free blocks. On
>    finding such an AG, the corresponding AGF buffer is locked by the
>    transaction and this guarantees that the fs blocks that make up a directory
>    block are allocated from within the same AG. This is probably the reason
>    for xfs_dir2_shrink_inode() to assume that __xfs_bunmapi() will be able to
>    unmap all the constituent fs blocks.

That sounds right to me...

> 
> 3. The call trace posted above occurs when __xfs_bunmapi() starts unmapping
>    the fs blocks of a directory block and one of the fs blocks happens to be
>    from an AG whose AG number is less than that of fs block that was unmapped
>    earlier.
> 
> 4. The xfs_bmap_exact_minlen_extent_alloc() allocator can cause allocation of
>    a directory block whose constituent fs blocks are from different AGs. This
>    occurs because,
>    - xfs_bmap_exact_minlen_extent_alloc() gets an AG which has atleast
>      xfs_bmalloca->total free fs blocks.
>    - However some of those free fs blocks do not correspond to one-block sized
>      extents (NOTE: xfs/538 test fragments 90% of the fs free space).
>    - Once the current AG runs out of one-block sized extents, we move onto the
>      next AG. This happens because xfs_bmap_exact_minlen_extent_alloc() uses
>      XFS_ALLOCTYPE_FIRST_AG as the allocation type and this in turn causes the
>      allocator code to iterate across AGs to get free blocks.

...and the effect of your minlen debug knob is that we will pound on the
lowmode allocator much more frequently.  Multi-fsb directories are
already kind of rare, which means there probably aren't a lot of people
dealing with this combination.

IIRC the /only/ way you can have a multiblock symlink is on a v5
filesystem with 1k blocks, since the max symlink target length is 1024
bytes.

> 5. From code reading, I noticed that the scenario described in step #4 could
>    also occur when using xfs_bmap_btalloc(). This happens when the filesystem
>    is highly fragmented and is also running low on free space. In such a
>    scenario, XFS_TRANS_LOWMODE is enabled causing xfs_bmap_btalloc() to
>    execute the same sequence of steps described in step #4. This scenario
>    (i.e. fragmented fs and running low on free space) is probably quite rare
>    to occur in practice and hence this may be the reason as to why this
>    problem was not observed earlier.

Yes, I think this is accurate.

--D

> 
> --
> chandan
diff mbox series

Patch

diff --git a/tests/xfs/537 b/tests/xfs/537
new file mode 100755
index 00000000..77fa60d9
--- /dev/null
+++ b/tests/xfs/537
@@ -0,0 +1,84 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2021 Chandan Babu R.  All Rights Reserved.
+#
+# FS QA Test 537
+#
+# Execute fsstress with bmap_alloc_minlen_extent error tag enabled.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/inject
+. ./common/populate
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+_supported_fs xfs
+_require_scratch
+_require_xfs_debug
+_require_test_program "punch-alternating"
+_require_xfs_io_error_injection "bmap_alloc_minlen_extent"
+
+echo "Format and mount fs"
+_scratch_mkfs_sized $((1024 * 1024 * 1024)) >> $seqres.full
+_scratch_mount >> $seqres.full
+
+bsize=$(_get_file_block_size $SCRATCH_MNT)
+
+echo "Consume free space"
+fillerdir=$SCRATCH_MNT/fillerdir
+nr_free_blks=$(stat -f -c '%f' $SCRATCH_MNT)
+nr_free_blks=$((nr_free_blks * 90 / 100))
+
+_fill_fs $((bsize * nr_free_blks)) $fillerdir $bsize 0 >> $seqres.full 2>&1
+
+echo "Create fragmented filesystem"
+for dentry in $(ls -1 $fillerdir/); do
+	$here/src/punch-alternating $fillerdir/$dentry >> $seqres.full
+done
+
+echo "Inject bmap_alloc_minlen_extent error tag"
+_scratch_inject_error bmap_alloc_minlen_extent 1
+
+echo "Scale fsstress args"
+args=$(_scale_fsstress_args -p $((LOAD_FACTOR * 75)) -n $((TIME_FACTOR * 1000)))
+
+echo "Execute fsstress in background"
+$FSSTRESS_PROG -d $SCRATCH_MNT $args \
+		 -f bulkstat=0 \
+		 -f bulkstat1=0 \
+		 -f fiemap=0 \
+		 -f getattr=0 \
+		 -f getdents=0 \
+		 -f getfattr=0 \
+		 -f listfattr=0 \
+		 -f mread=0 \
+		 -f read=0 \
+		 -f readlink=0 \
+		 -f readv=0 \
+		 -f stat=0 \
+		 -f aread=0 \
+		 -f dread=0 > /dev/null 2>&1
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/537.out b/tests/xfs/537.out
new file mode 100644
index 00000000..19633a07
--- /dev/null
+++ b/tests/xfs/537.out
@@ -0,0 +1,7 @@ 
+QA output created by 537
+Format and mount fs
+Consume free space
+Create fragmented filesystem
+Inject bmap_alloc_minlen_extent error tag
+Scale fsstress args
+Execute fsstress in background
diff --git a/tests/xfs/group b/tests/xfs/group
index ba674a58..5c827727 100644
--- a/tests/xfs/group
+++ b/tests/xfs/group
@@ -534,3 +534,4 @@ 
 534 auto quick reflink
 535 auto quick reflink
 536 auto quick
+537 auto stress