diff mbox series

mkfs.btrfs failure on zoned block devices with DUP metadata profile

Message ID 20220802060323.khfjqwhwwbpjbegb@shindev (mailing list archive)
State New, archived
Headers show
Series mkfs.btrfs failure on zoned block devices with DUP metadata profile | expand

Commit Message

Shinichiro Kawasaki Aug. 2, 2022, 6:03 a.m. UTC
Hello Qu,

I found that the latest version of mkfs.btrfs fails on zoned block device when
DUP profile is specified for metadata. The failure depends on btrfs-progs
version and the trigger commit is 2a93728391a1 ("btrfs-progs: use
write_data_to_disk() to replace write_extent_to_disk()"). After some
investigation, it looks for me that this is a common problem for zoned and non-
zoned block devices. Here I describe findings below. Your comments will be
appreciated.

On the failure, mkfs.btrfs reports "Error writing" to the device [1]. At this
time, kernel reports an I/O error, which indicates "Unaligned write command
error" to sequential write required zones. With strace, I observed that the mkfs
writes data to same sector twice. This repeated writes were observed regardless
whether the device is zoned or non-zoned. It does not cause an error for non-
zoned devices, but it causes the I/O error for zoned devices.

Using strace, I compared the write to a non-zoned disk image file before and
after the commit 2a93728391a1. It shows that the commit introduced the repeated
writes to same sectors [2]. Is the repeated write expected after the commit?
If not, common fix is needed for zoned and non-zoned devices.


[1]

(/dev/nullb0 is a zoned block device with no conventional zone)
$ sudo ./mkfs.btrfs /dev/nullb0
btrfs-progs v5.18.1
See http://btrfs.wiki.kernel.org for more information.

Zoned: /dev/nullb0: host-managed device detected, setting zoned feature
Resetting device zones /dev/nullb0 (64 zones) ...
NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)

Error writing to device 5
kernel-shared/transaction.c:156: __commit_transaction: BUG_ON `ret` triggered, value 5
./mkfs.btrfs(__commit_transaction+0x197)[0x4382a6]
./mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43840a]
./mkfs.btrfs(main+0x171c)[0x40d64e]
/lib64/libc.so.6(+0x29550)[0x7f5108629550]
/lib64/libc.so.6(__libc_start_main+0x89)[0x7f5108629609]
./mkfs.btrfs(_start+0x25)[0x40b965]
Aborted


[2]

$ truncate -s 1G /tmp/btrfs.img
$ git reset --hard 2a937283~; make -j
$ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup /tmp/btrfs.img |& grep btrfs.img > /tmp/pre.log
$ git reset --hard 2a937283; make -j
$ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup /tmp/btrfs.img |& grep btrfs.img > /tmp/post.log
$ diff -u /tmp/pre.log /tmp/post.log

Comments

Qu Wenruo Aug. 2, 2022, 6:17 a.m. UTC | #1
On 2022/8/2 14:03, Shinichiro Kawasaki wrote:
> Hello Qu,
>
> I found that the latest version of mkfs.btrfs fails on zoned block device when
> DUP profile is specified for metadata. The failure depends on btrfs-progs
> version and the trigger commit is 2a93728391a1 ("btrfs-progs: use
> write_data_to_disk() to replace write_extent_to_disk()"). After some
> investigation, it looks for me that this is a common problem for zoned and non-
> zoned block devices. Here I describe findings below. Your comments will be
> appreciated.
>
> On the failure, mkfs.btrfs reports "Error writing" to the device [1]. At this
> time, kernel reports an I/O error, which indicates "Unaligned write command
> error" to sequential write required zones. With strace, I observed that the mkfs
> writes data to same sector twice. This repeated writes were observed regardless
> whether the device is zoned or non-zoned. It does not cause an error for non-
> zoned devices, but it causes the I/O error for zoned devices.
>
> Using strace, I compared the write to a non-zoned disk image file before and
> after the commit 2a93728391a1. It shows that the commit introduced the repeated
> writes to same sectors [2]. Is the repeated write expected after the commit?
> If not, common fix is needed for zoned and non-zoned devices.

Thanks for the detailed report.

To me, the repeated write is definitely not an expected behavior.

Although a quick glance doesn't show me much hint, I'll definitely look
into the case.

My initial guess is write_data_to_disk() has something wrong related to
the mirror iteration, but no concrete conclusion yet.

I'll definitely keep you updated since this incorrect behavior is now
affecting zoned device now, and it will be treated as an emergency.

Thanks,
Qu
>
>
> [1]
>
> (/dev/nullb0 is a zoned block device with no conventional zone)
> $ sudo ./mkfs.btrfs /dev/nullb0
> btrfs-progs v5.18.1
> See http://btrfs.wiki.kernel.org for more information.
>
> Zoned: /dev/nullb0: host-managed device detected, setting zoned feature
> Resetting device zones /dev/nullb0 (64 zones) ...
> NOTE: several default settings have changed in version 5.15, please make sure
>        this does not affect your deployments:
>        - DUP for metadata (-m dup)
>        - enabled no-holes (-O no-holes)
>        - enabled free-space-tree (-R free-space-tree)
>
> Error writing to device 5
> kernel-shared/transaction.c:156: __commit_transaction: BUG_ON `ret` triggered, value 5
> ./mkfs.btrfs(__commit_transaction+0x197)[0x4382a6]
> ./mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43840a]
> ./mkfs.btrfs(main+0x171c)[0x40d64e]
> /lib64/libc.so.6(+0x29550)[0x7f5108629550]
> /lib64/libc.so.6(__libc_start_main+0x89)[0x7f5108629609]
> ./mkfs.btrfs(_start+0x25)[0x40b965]
> Aborted
>
>
> [2]
>
> $ truncate -s 1G /tmp/btrfs.img
> $ git reset --hard 2a937283~; make -j
> $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup /tmp/btrfs.img |& grep btrfs.img > /tmp/pre.log
> $ git reset --hard 2a937283; make -j
> $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup /tmp/btrfs.img |& grep btrfs.img > /tmp/post.log
> $ diff -u /tmp/pre.log /tmp/post.log
> --- /tmp/pre.log        2022-08-02 14:12:19.670688371 +0900
> +++ /tmp/post.log       2022-08-02 14:12:47.019382686 +0900
> @@ -32,36 +32,66 @@
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 5357568) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 4096, 65536) = 4096
>   pwrite64(5</tmp/btrfs.img>, ""..., 4096, 67108864) = 4096
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384
> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384
>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384
>   ....
>
Qu Wenruo Aug. 2, 2022, 7:10 a.m. UTC | #2
On 2022/8/2 14:17, Qu Wenruo wrote:
>
>
> On 2022/8/2 14:03, Shinichiro Kawasaki wrote:
>> Hello Qu,
>>
>> I found that the latest version of mkfs.btrfs fails on zoned block
>> device when
>> DUP profile is specified for metadata. The failure depends on btrfs-progs
>> version and the trigger commit is 2a93728391a1 ("btrfs-progs: use
>> write_data_to_disk() to replace write_extent_to_disk()"). After some
>> investigation, it looks for me that this is a common problem for zoned
>> and non-
>> zoned block devices. Here I describe findings below. Your comments
>> will be
>> appreciated.
>>
>> On the failure, mkfs.btrfs reports "Error writing" to the device [1].
>> At this
>> time, kernel reports an I/O error, which indicates "Unaligned write
>> command
>> error" to sequential write required zones. With strace, I observed
>> that the mkfs
>> writes data to same sector twice. This repeated writes were observed
>> regardless
>> whether the device is zoned or non-zoned. It does not cause an error
>> for non-
>> zoned devices, but it causes the I/O error for zoned devices.
>>
>> Using strace, I compared the write to a non-zoned disk image file
>> before and
>> after the commit 2a93728391a1. It shows that the commit introduced the
>> repeated
>> writes to same sectors [2]. Is the repeated write expected after the
>> commit?
>> If not, common fix is needed for zoned and non-zoned devices.
>
> Thanks for the detailed report.
>
> To me, the repeated write is definitely not an expected behavior.
>
> Although a quick glance doesn't show me much hint, I'll definitely look
> into the case.
>
> My initial guess is write_data_to_disk() has something wrong related to
> the mirror iteration, but no concrete conclusion yet.
>
> I'll definitely keep you updated since this incorrect behavior is now
> affecting zoned device now, and it will be treated as an emergency.

Pinned down the root cause.

It's the write_data_to_disk behavior not correct for multi-copy profiles.

I just added some extra debug output:

For SINGLE profile everything is fine:

   write_data_to_disk: logical=5341184 len=16384 mirror=1
   btrfs_pwrite: fd=5 offset=5341184 count=16384
   write_data_to_disk: logical=5357568 len=16384 mirror=1
   btrfs_pwrite: fd=5 offset=5357568 count=16384

But the problem is related to profiles with multiple copies (DUP included).

 From higher layer, we will call write_data_to_disk() for EACH mirror:

write_data_to_disk: logical=30408704 len=16384 mirror=1
btrfs_pwrite: fd=5 offset=38797312 count=16384
btrfs_pwrite: fd=5 offset=307232768 count=16384
write_data_to_disk: logical=30408704 len=16384 mirror=2
btrfs_pwrite: fd=5 offset=38797312 count=16384
btrfs_pwrite: fd=5 offset=307232768 count=16384

But inside write_data_to_disk() we call btrfs_map_block() which will
always return the full mirrors for write.

The fix can go either way:

- Don't try to iterate mirrors when calling write_data_to_disk().
   Just pass mirror = 0 into write_data_to_disk.

- Make write_data_to_disk() to really respect the mirror number
   Thus not really passing WRITE for btrfs_map_block().

I'll evaluate above solutions then send out a proper fix.

Thanks,
Qu
>
> Thanks,
> Qu
>>
>>
>> [1]
>>
>> (/dev/nullb0 is a zoned block device with no conventional zone)
>> $ sudo ./mkfs.btrfs /dev/nullb0
>> btrfs-progs v5.18.1
>> See http://btrfs.wiki.kernel.org for more information.
>>
>> Zoned: /dev/nullb0: host-managed device detected, setting zoned feature
>> Resetting device zones /dev/nullb0 (64 zones) ...
>> NOTE: several default settings have changed in version 5.15, please
>> make sure
>>        this does not affect your deployments:
>>        - DUP for metadata (-m dup)
>>        - enabled no-holes (-O no-holes)
>>        - enabled free-space-tree (-R free-space-tree)
>>
>> Error writing to device 5
>> kernel-shared/transaction.c:156: __commit_transaction: BUG_ON `ret`
>> triggered, value 5
>> ./mkfs.btrfs(__commit_transaction+0x197)[0x4382a6]
>> ./mkfs.btrfs(btrfs_commit_transaction+0x13b)[0x43840a]
>> ./mkfs.btrfs(main+0x171c)[0x40d64e]
>> /lib64/libc.so.6(+0x29550)[0x7f5108629550]
>> /lib64/libc.so.6(__libc_start_main+0x89)[0x7f5108629609]
>> ./mkfs.btrfs(_start+0x25)[0x40b965]
>> Aborted
>>
>>
>> [2]
>>
>> $ truncate -s 1G /tmp/btrfs.img
>> $ git reset --hard 2a937283~; make -j
>> $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup
>> /tmp/btrfs.img |& grep btrfs.img > /tmp/pre.log
>> $ git reset --hard 2a937283; make -j
>> $ sudo strace -y -s 0 -e pwrite64 ./mkfs.btrfs -f -d single -m dup
>> /tmp/btrfs.img |& grep btrfs.img > /tmp/post.log
>> $ diff -u /tmp/pre.log /tmp/post.log
>> --- /tmp/pre.log        2022-08-02 14:12:19.670688371 +0900
>> +++ /tmp/post.log       2022-08-02 14:12:47.019382686 +0900
>> @@ -32,36 +32,66 @@
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 5357568) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 4096, 65536) = 4096
>>   pwrite64(5</tmp/btrfs.img>, ""..., 4096, 67108864) = 4096
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384
>> +pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384
>>   pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384
>>   ....
>>
diff mbox series

Patch

--- /tmp/pre.log        2022-08-02 14:12:19.670688371 +0900
+++ /tmp/post.log       2022-08-02 14:12:47.019382686 +0900
@@ -32,36 +32,66 @@ 
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 5357568) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38797312) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92471296) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 4096, 65536) = 4096
 pwrite64(5</tmp/btrfs.img>, ""..., 4096, 67108864) = 4096
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 22020096) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 30408704) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38813696) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92487680) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384
+pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 38830080) = 16384
 pwrite64(5</tmp/btrfs.img>, ""..., 16384, 92504064) = 16384
 ....

-- 
Shin'ichiro Kawasaki