Make btrfs_prepare_device parallel during mkfs.btrfs

Message ID	1661357103-22735-1-git-send-email-zhanglikernel@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Li Zhang <zhanglikernel@gmail.com> To: linux-btrfs@vger.kernel.org Cc: Li Zhang <zhanglikernel@gmail.com> Subject: [PATCH] Make btrfs_prepare_device parallel during mkfs.btrfs Date: Thu, 25 Aug 2022 00:05:03 +0800 Message-Id: <1661357103-22735-1-git-send-email-zhanglikernel@gmail.com> Precedence: bulk
Series	Make btrfs_prepare_device parallel during mkfs.btrfs \| expand Make btrfs_prepare_device parallel during mkfs.btrfs

Li Zhang Aug. 24, 2022, 4:05 p.m. UTC

[enhancement]
When a disk is formatted as btrfs, it calls
btrfs_prepare_device for each device, which takes too much time.

[implementation]
Put each btrfs_prepare_device into a thread,
wait for the first thread to complete to mkfs.btrfs,
and wait for other threads to complete before adding
other devices to the file system.

[test]
Using the btrfs-progs test case mkfs-tests, mkfs.btrfs works fine.

But I don't have an actual zoed device,
so I don't know how much time it saves, If you guys
have a way to test it, please let me know.

Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
---
Issue: 496

 mkfs/main.c | 113 +++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 86 insertions(+), 27 deletions(-)

Qu Wenruo Aug. 25, 2022, 5:20 a.m. UTC | #1

On 2022/8/25 00:05, Li Zhang wrote:
> [enhancement]
> When a disk is formatted as btrfs, it calls
> btrfs_prepare_device for each device, which takes too much time.

The idea is awesome.

>
> [implementation]
> Put each btrfs_prepare_device into a thread,
> wait for the first thread to complete to mkfs.btrfs,
> and wait for other threads to complete before adding
> other devices to the file system.
>
> [test]
> Using the btrfs-progs test case mkfs-tests, mkfs.btrfs works fine.
>
> But I don't have an actual zoed device,
> so I don't know how much time it saves, If you guys
> have a way to test it, please let me know.
>
> Signed-off-by: Li Zhang <zhanglikernel@gmail.com>
> ---
> Issue: 496
>
>   mkfs/main.c | 113 +++++++++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 86 insertions(+), 27 deletions(-)
>
> diff --git a/mkfs/main.c b/mkfs/main.c
> index ce096d3..35fefe2 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -31,6 +31,7 @@
>   #include <uuid/uuid.h>
>   #include <ctype.h>
>   #include <blkid/blkid.h>
> +#include <pthread.h>
>   #include "kernel-shared/ctree.h"
>   #include "kernel-shared/disk-io.h"
>   #include "kernel-shared/free-space-tree.h"
> @@ -60,6 +61,18 @@ struct mkfs_allocation {
>   	u64 system;
>   };
>
> +
> +struct prepare_device_progress {
> +	char *file;
> +	u64 dev_block_count;
> +	u64 block_count;

> +	bool zero_end;
> +	bool discard;
> +	bool zoned;
> +	int oflags;

A small nitpick.

Aren't those 4 values the same shared by all devices?
Thus I'm not sure if they need to be put into prepare_device_progress at
all.

IIRC, we may want some shared memory between all the threads:

- A pthread_mutex
   Will be explained later

- All the other shared infos like above flags/oflags

It can be global or passed by some pointers.

> +	int ret;
> +};
> +
>   static int create_metadata_block_groups(struct btrfs_root *root, bool mixed,
>   				struct mkfs_allocation *allocation)
>   {
> @@ -969,6 +982,28 @@ fail:
>   	return ret;
>   }
>
> +static void *prepare_one_dev(void *ctx)
> +{
> +	struct prepare_device_progress *prepare_ctx = ctx;
> +	int fd;
> +
> +	fd = open(prepare_ctx->file, prepare_ctx->oflags);
> +	if (fd < 0) {
> +		error("unable to open %s: %m", prepare_ctx->file);

If we have no permission for all devices (pretty common in fact, e.g.
forgot to use sudo), we will have multiple threads printing out the same
time.

Without a lock, the output will be a mess.

Thus we may want a mutex, even it's just for synchronizing the output.

> +		prepare_ctx->ret = fd;
> +		return NULL;
> +	}
> +	prepare_ctx->ret = btrfs_prepare_device(fd,
> +		prepare_ctx->file, &prepare_ctx->dev_block_count,
> +		prepare_ctx->block_count,
> +		(bconf.verbose ? PREP_DEVICE_VERBOSE : 0) |
> +		(prepare_ctx->zero_end ? PREP_DEVICE_ZERO_END : 0) |
> +		(prepare_ctx->discard ? PREP_DEVICE_DISCARD : 0) |
> +		(prepare_ctx->zoned ? PREP_DEVICE_ZONED : 0));
> +	close(fd);
> +	return NULL;
> +}
> +
>   int BOX_MAIN(mkfs)(int argc, char **argv)
>   {
>   	char *file;
> @@ -997,7 +1032,6 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	bool ssd = false;
>   	bool zoned = false;
>   	bool force_overwrite = false;
> -	int oflags;
>   	char *source_dir = NULL;
>   	bool source_dir_set = false;
>   	bool shrink_rootdir = false;
> @@ -1006,6 +1040,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	u64 shrink_size;
>   	int dev_cnt = 0;
>   	int saved_optind;
> +	pthread_t *t_prepare = NULL;
> +	struct prepare_device_progress *prepare_ctx = NULL;
>   	char fs_uuid[BTRFS_UUID_UNPARSED_SIZE] = { 0 };
>   	u64 features = BTRFS_MKFS_DEFAULT_FEATURES;
>   	u64 runtime_features = BTRFS_MKFS_DEFAULT_RUNTIME_FEATURES;
> @@ -1428,29 +1464,45 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   		goto error;
>   	}
>
> -	dev_cnt--;
> -
> -	oflags = O_RDWR;
> -	if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
> -		oflags |= O_DIRECT;
> +	t_prepare = malloc(dev_cnt * sizeof(*t_prepare));
> +	prepare_ctx = malloc(dev_cnt * sizeof(*prepare_ctx));
>
> -	/*
> -	 * Open without O_EXCL so that the problem should not occur by the
> -	 * following operation in kernel:
> -	 * (btrfs_register_one_device() fails if O_EXCL is on)
> -	 */
> -	fd = open(file, oflags);
> -	if (fd < 0) {
> -		error("unable to open %s: %m", file);
> +	if (!t_prepare || !prepare_ctx) {
> +		error("unable to prepare dev");

Isn't this ENOMEM? The message doesn't seem to match the situation.

>   		goto error;
>   	}
> -	ret = btrfs_prepare_device(fd, file, &dev_block_count, block_count,
> -			(zero_end ? PREP_DEVICE_ZERO_END : 0) |
> -			(discard ? PREP_DEVICE_DISCARD : 0) |
> -			(bconf.verbose ? PREP_DEVICE_VERBOSE : 0) |
> -			(zoned ? PREP_DEVICE_ZONED : 0));
> +
> +	for (i = 0; i < dev_cnt; i++) {
> +		prepare_ctx[i].file = argv[optind + i - 1];
> +		prepare_ctx[i].block_count = block_count;
> +		prepare_ctx[i].dev_block_count = block_count;
> +		prepare_ctx[i].zero_end = zero_end;
> +		prepare_ctx[i].discard = discard;
> +		prepare_ctx[i].zoned = zoned;
> +		if (i == 0) {
> +			prepare_ctx[i].oflags = O_RDWR;
> +			/*
> +			 * Open without O_EXCL so that the problem should
> +			 * not occur by the following operation in kernel:
> +			 * (btrfs_register_one_device() fails if O_EXCL is on)
> +			 */

The comment seems out-dated, no O_EXCL involved anywhere.

> +			if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
> +				prepare_ctx[i].oflags = O_RDWR | O_DIRECT;

Do we need to treat the initial and other devices differently?

Can't we use the same flags for all devices?


> +		} else {
> +			prepare_ctx[i].oflags = O_RDWR;
> +		}
> +		ret = pthread_create(&t_prepare[i], NULL,
> +			prepare_one_dev, &prepare_ctx[i]);
> +	}
> +	pthread_join(t_prepare[0], NULL);
> +	ret = prepare_ctx[0].ret; > +

Can't we just wait for all devices?

I don't think treating them different could have much benefit.

Yes, we can have multiple-devices with different performance
characteristics, thus if the first device is the fastest one, it may
finish before all the others.

But this also means, the first one can be the slowest.

To me, parallel initialization is already a big enough improvement, and
for the most common case, all the devices should have the same or
similar performance characteristics, thus waiting for them all shouldn't
cause much difference.

>   	if (ret)
>   		goto error;
> +
> +	dev_cnt--;
> +	fd = open(file, prepare_ctx[0].oflags);
> +	dev_block_count = prepare_ctx[0].dev_block_count;
>   	if (block_count && block_count > dev_block_count) {
>   		error("%s is smaller than requested size, expected %llu, found %llu",
>   		      file, (unsigned long long)block_count,
> @@ -1459,7 +1511,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	}
>
>   	/* To create the first block group and chunk 0 in make_btrfs */
> -	system_group_size = zoned ?  zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
> +	system_group_size = zoned ? zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE;
>   	if (dev_block_count < system_group_size) {
>   		error("device is too small to make filesystem, must be at least %llu",
>   				(unsigned long long)system_group_size);
> @@ -1557,6 +1609,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	if (dev_cnt == 0)
>   		goto raid_groups;
>
> +	for (i = 0 ; i < dev_cnt; i++) {
> +		pthread_join(t_prepare[i+1], NULL);
> +		if (prepare_ctx[i+1].ret) {
> +			goto error;
> +		}
> +	}
>   	while (dev_cnt-- > 0) {
>   		file = argv[optind++];
>
> @@ -1578,12 +1636,9 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   			close(fd);
>   			continue;
>   		}
> -		ret = btrfs_prepare_device(fd, file, &dev_block_count,
> -				block_count,
> -				(bconf.verbose ? PREP_DEVICE_VERBOSE : 0) |
> -				(zero_end ? PREP_DEVICE_ZERO_END : 0) |
> -				(discard ? PREP_DEVICE_DISCARD : 0) |
> -				(zoned ? PREP_DEVICE_ZONED : 0));
> +		dev_block_count = prepare_ctx[argc - saved_optind - dev_cnt - 1]
> +			.dev_block_count;
> +
>   		if (ret) {
>   			goto error;
>   		}

This goto error is a dead code now.

Thanks for the great idea on reducing the preparation time!
Qu

> @@ -1763,12 +1818,16 @@ out:
>
>   	btrfs_close_all_devices();
>   	free(label);
> -
> +	free(t_prepare);
> +	free(prepare_ctx);
>   	return !!ret;
> +
>   error:
>   	if (fd > 0)
>   		close(fd);
>
> +	free(t_prepare);
> +	free(prepare_ctx);
>   	free(label);
>   	exit(1);
>   success:

Johannes Thumshirn Aug. 25, 2022, 8:31 a.m. UTC | #2

On 25.08.22 07:20, Qu Wenruo wrote:
>> +			if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
>> +				prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
> Do we need to treat the initial and other devices differently?
> 
> Can't we use the same flags for all devices?
> 
> 

Yep we need to have the same flags for all devices. Otherwise only
device 0 will be opened with O_DIRECT, in case of a host-managed one and
the subsequent will be opened without O_DIRECT causing mkfs to fail.

Johannes Thumshirn Aug. 25, 2022, 8:33 a.m. UTC | #3

On 24.08.22 18:06, Li Zhang wrote:
> [enhancement]
> When a disk is formatted as btrfs, it calls
> btrfs_prepare_device for each device, which takes too much time.

That really is awesome. I'll throw it onto my 60 zoned HDD test box,
once all devices have the same open flags.

[...]

> +	t_prepare = malloc(dev_cnt * sizeof(*t_prepare));
> +	prepare_ctx = malloc(dev_cnt * sizeof(*prepare_ctx));

That really should be calloc().

Qu Wenruo Aug. 25, 2022, 8:36 a.m. UTC | #4

On 2022/8/25 16:31, Johannes Thumshirn wrote:
> On 25.08.22 07:20, Qu Wenruo wrote:
>>> +			if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
>>> +				prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
>> Do we need to treat the initial and other devices differently?
>>
>> Can't we use the same flags for all devices?
>>
>>
>
> Yep we need to have the same flags for all devices. Otherwise only
> device 0 will be opened with O_DIRECT, in case of a host-managed one and
> the subsequent will be opened without O_DIRECT causing mkfs to fail.

Just a little curious, currently btrfs doesn't support mixed
traditional/zoned devices, right?

So that O_DIRECT for all devices are for future mixed zoned mode?

Anyway I'm completely fine if we can use the same oflags for all devices.

Thanks,
Qu

Johannes Thumshirn Aug. 25, 2022, 8:40 a.m. UTC | #5

On 25.08.22 10:36, Qu Wenruo wrote:
> 
> 
> On 2022/8/25 16:31, Johannes Thumshirn wrote:
>> On 25.08.22 07:20, Qu Wenruo wrote:
>>>> +			if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
>>>> +				prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
>>> Do we need to treat the initial and other devices differently?
>>>
>>> Can't we use the same flags for all devices?
>>>
>>>
>>
>> Yep we need to have the same flags for all devices. Otherwise only
>> device 0 will be opened with O_DIRECT, in case of a host-managed one and
>> the subsequent will be opened without O_DIRECT causing mkfs to fail.
> 
> Just a little curious, currently btrfs doesn't support mixed
> traditional/zoned devices, right?
> 
> So that O_DIRECT for all devices are for future mixed zoned mode?

We need it in case of multiple zoned devices as well. The mixed mode 
you describe above could actually work thanks to the zoned emulation 
we have in place. But I've never actually tried to be honest.

Li Zhang Aug. 28, 2022, 8:53 a.m. UTC | #6

Hi, I'm a bit confused, do you mean if you open a zoned device
without O_DIRECT it will fail?

I tested and found that if I open a device with the O_DIRECT flag
on a virtual device like a loop device, the device cannot be written
to, but with or without O_DIRECT, it works fine on a real
device (for me, I only test A normal block device since I don't have
any zoned devices)

If we use the same flags for all devices,
does that mean we can't use mkfs.btrfs
on both real and virtual devices at the same time.


Below is my test program and test results.

code(main idea):
printf("filename:%s.\n", argv[1]);
int fd = open(argv[1], O_RDWR | O_DIRECT);
if (fd < 0) {
     printf("fd:error.\n");
     return -1;
}
int num = write(fd, "123", 3);
printf("num:%d.\n", num);
close(fd);

result:
$ sudo losetup /dev/loop1 loopDev/loop1
$ sudo ./a.out /dev/loop1
filename:/dev/loop1.
num:-1.
# cannot write to loop1


Thanks,
Li Zhang

Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道：
>
> On 25.08.22 07:20, Qu Wenruo wrote:
> >> +                    if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
> >> +                            prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
> > Do we need to treat the initial and other devices differently?
> >
> > Can't we use the same flags for all devices?
> >
> >
>
> Yep we need to have the same flags for all devices. Otherwise only
> device 0 will be opened with O_DIRECT, in case of a host-managed one and
> the subsequent will be opened without O_DIRECT causing mkfs to fail.

Qu Wenruo Aug. 28, 2022, 9:54 a.m. UTC | #7

On 2022/8/28 16:53, li zhang wrote:
> Hi, I'm a bit confused, do you mean if you open a zoned device
> without O_DIRECT it will fail?

Not a zoned device expert, but to my understanding, if we write into
zoned device, without O_DIRECT, there is no guarantee that the data you
submitted will end at the same bytenr you specified.

E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M.

Without O_DIRECT, the zoned code can re-locate the bytenr to any range
after the write pointer inside the same zone.

AKA, for zoned device, without O_DIRECT (queue length 1), you can only
known the real physical bytenr after the write has fully finished.

(The final physical bytenr is determined by the zoned device, no longer
the write initiator).

>
> I tested and found that if I open a device with the O_DIRECT flag
> on a virtual device like a loop device, the device cannot be written
> to, but with or without O_DIRECT, it works fine on a real
> device (for me, I only test A normal block device since I don't have
> any zoned devices)

IIRC currently there is no zoned emulation for loop device.

If you want to test zoned device, you can use null block kernel module,
with fully memory backed storage:

https://zonedstorage.io/docs/getting-started/nullblk

Or go a little further, using tcmu-runner to create file backed zoned
device:

https://zonedstorage.io/docs/tools/tcmu-runner

>
> If we use the same flags for all devices,
> does that mean we can't use mkfs.btrfs
> on both real and virtual devices at the same time.
>
>
> Below is my test program and test results.
>
> code(main idea):
> printf("filename:%s.\n", argv[1]);
> int fd = open(argv[1], O_RDWR | O_DIRECT);
> if (fd < 0) {
>       printf("fd:error.\n");
>       return -1;
> }
> int num = write(fd, "123", 3);
> printf("num:%d.\n", num);

O_DIRECT requires strict memory alignment, obviously the length 3 is not
properly aligned.

Please check open(2p) for the full requirement.

For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT
can work correctly.

Back to btrfs-progs work, I'd say before we do anything, let's check all
the devices passed in to determine if we want zoned mode (any zoned
device should make it zoned).

Then we can determine the open flags for all devices, and for regular
devices, O_DIRECT mostly makes no difference (maybe a little slower, but
may not even be observable).

Thanks,
Qu

> close(fd);
>
> result:
> $ sudo losetup /dev/loop1 loopDev/loop1
> $ sudo ./a.out /dev/loop1
> filename:/dev/loop1.
> num:-1.
> # cannot write to loop1
>
>
> Thanks,
> Li Zhang
>
> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道：
>>
>> On 25.08.22 07:20, Qu Wenruo wrote:
>>>> +                    if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
>>>> +                            prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
>>> Do we need to treat the initial and other devices differently?
>>>
>>> Can't we use the same flags for all devices?
>>>
>>>
>>
>> Yep we need to have the same flags for all devices. Otherwise only
>> device 0 will be opened with O_DIRECT, in case of a host-managed one and
>> the subsequent will be opened without O_DIRECT causing mkfs to fail.

Li Zhang Aug. 28, 2022, 2:26 p.m. UTC | #8

Yes, I see what you mean.

There is no doubt that the loop device is not a zone device.
I simulated the zone device with the null_blk module and tested
mkfs.btrfs, but an error was reported. In addition, Not only
mkfs.btrfs does not work on null_blk zoned devices, mkfs.xfs and mkfs.ext2 also
do not work on null_blk zoned devices, here is the test log. My first
instinct is
the null_blk problem . But I didn't test tcmu-runner, I'll dig into it
later anyway.


#emulate zoned device using null_blk
$ sudo modprobe null_blk nr_devices=4 zoned=1

#mkfs.xfs failed
$ sudo mkfs.xfs -V
mkfs.xfs version 5.18.0
$ sudo mkfs.xfs /dev/nullb0 -f
meta-data=/dev/nullb0            isize=512    agcount=4, agsize=16384000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1
data     =                       bsize=4096   blocks=65536000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=32000, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
mkfs.xfs: pwrite failed: Input/output error
libxfs_bwrite: write failed on (unknown) bno 0x1f3fff00/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: Input/output error
libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
mkfs.xfs: pwrite failed: Input/output error
libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
mkfs.xfs: Releasing dirty buffer to free list!
mkfs.xfs: libxfs_device_zero write failed: Input/output error

#mkfs.btrfs failed
$ sudo mkfs.btrfs --version
mkfs.btrfs, part of btrfs-progs v5.19
$ sudo mkfs.btrfs -d single -m single -O zoned /dev/nullb0 /dev/nullb1
/dev/nullb2 -f
btrfs-progs v5.19
See http://btrfs.wiki.kernel.org for more information.

Resetting device zones /dev/nullb0 (1000 zones) ...
Resetting device zones /dev/nullb2 (1000 zones) ...
Resetting device zones /dev/nullb1 (1000 zones) ...
NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)

No valid Btrfs found on /dev/nullb0
ERROR: open ctree failed

#mkfs.ext2 failed
$ sudo mke2fs -V
mke2fs 1.46.5 (30-Dec-2021)
Using EXT2FS Library version 1.46.5
$ sudo mke2fs /dev/nullb0
mke2fs 1.46.5 (30-Dec-2021)
Creating filesystem with 65536000 4k blocks and 16384000 inodes
Filesystem UUID: 747350a2-a1d5-4944-9f46-0fe4ca76df9d
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: mke2fs:
Input/output error while writing out and closing file system



thanks,
Li Zhang

Qu Wenruo <quwenruo.btrfs@gmx.com> 于2022年8月28日周日 17:54写道：
>
>
>
> On 2022/8/28 16:53, li zhang wrote:
> > Hi, I'm a bit confused, do you mean if you open a zoned device
> > without O_DIRECT it will fail?
>
> Not a zoned device expert, but to my understanding, if we write into
> zoned device, without O_DIRECT, there is no guarantee that the data you
> submitted will end at the same bytenr you specified.
>
> E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M.
>
> Without O_DIRECT, the zoned code can re-locate the bytenr to any range
> after the write pointer inside the same zone.
>
> AKA, for zoned device, without O_DIRECT (queue length 1), you can only
> known the real physical bytenr after the write has fully finished.
>
> (The final physical bytenr is determined by the zoned device, no longer
> the write initiator).
>
> >
> > I tested and found that if I open a device with the O_DIRECT flag
> > on a virtual device like a loop device, the device cannot be written
> > to, but with or without O_DIRECT, it works fine on a real
> > device (for me, I only test A normal block device since I don't have
> > any zoned devices)
>
> IIRC currently there is no zoned emulation for loop device.
>
> If you want to test zoned device, you can use null block kernel module,
> with fully memory backed storage:
>
> https://zonedstorage.io/docs/getting-started/nullblk
>
>
> Or go a little further, using tcmu-runner to create file backed zoned
> device:
>
> https://zonedstorage.io/docs/tools/tcmu-runner
>
> >
> > If we use the same flags for all devices,
> > does that mean we can't use mkfs.btrfs
> > on both real and virtual devices at the same time.
> >
> >
> > Below is my test program and test results.
> >
> > code(main idea):
> > printf("filename:%s.\n", argv[1]);
> > int fd = open(argv[1], O_RDWR | O_DIRECT);
> > if (fd < 0) {
> >       printf("fd:error.\n");
> >       return -1;
> > }
> > int num = write(fd, "123", 3);
> > printf("num:%d.\n", num);
>
> O_DIRECT requires strict memory alignment, obviously the length 3 is not
> properly aligned.
>
> Please check open(2p) for the full requirement.
>
> For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT
> can work correctly.
>
>
> Back to btrfs-progs work, I'd say before we do anything, let's check all
> the devices passed in to determine if we want zoned mode (any zoned
> device should make it zoned).
>
> Then we can determine the open flags for all devices, and for regular
> devices, O_DIRECT mostly makes no difference (maybe a little slower, but
> may not even be observable).
>
> Thanks,
> Qu
>
>
> > close(fd);
> >
> > result:
> > $ sudo losetup /dev/loop1 loopDev/loop1
> > $ sudo ./a.out /dev/loop1
> > filename:/dev/loop1.
> > num:-1.
> > # cannot write to loop1
> >
> >
> > Thanks,
> > Li Zhang
> >
> > Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道：
> >>
> >> On 25.08.22 07:20, Qu Wenruo wrote:
> >>>> +                    if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
> >>>> +                            prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
> >>> Do we need to treat the initial and other devices differently?
> >>>
> >>> Can't we use the same flags for all devices?
> >>>
> >>>
> >>
> >> Yep we need to have the same flags for all devices. Otherwise only
> >> device 0 will be opened with O_DIRECT, in case of a host-managed one and
> >> the subsequent will be opened without O_DIRECT causing mkfs to fail.

Li Zhang Aug. 28, 2022, 2:33 p.m. UTC | #9

By the way, my kernel version is 5.19.0, and I also tested the 5.0 version
(maybe, I only remember that the version starts with 5), the same error output

thanks,
Li Zhang

li zhang <zhanglikernel@gmail.com> 于2022年8月28日周日 22:26写道：
>
> Yes, I see what you mean.
>
> There is no doubt that the loop device is not a zone device.
> I simulated the zone device with the null_blk module and tested
> mkfs.btrfs, but an error was reported. In addition, Not only
> mkfs.btrfs does not work on null_blk zoned devices, mkfs.xfs and mkfs.ext2 also
> do not work on null_blk zoned devices, here is the test log. My first
> instinct is
> the null_blk problem . But I didn't test tcmu-runner, I'll dig into it
> later anyway.
>
>
> #emulate zoned device using null_blk
> $ sudo modprobe null_blk nr_devices=4 zoned=1
>
> #mkfs.xfs failed
> $ sudo mkfs.xfs -V
> mkfs.xfs version 5.18.0
> $ sudo mkfs.xfs /dev/nullb0 -f
> meta-data=/dev/nullb0            isize=512    agcount=4, agsize=16384000 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1    bigtime=1 inobtcount=1
> data     =                       bsize=4096   blocks=65536000, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=32000, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x1f3fff00/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> mkfs.xfs: libxfs_device_zero write failed: Input/output error
>
> #mkfs.btrfs failed
> $ sudo mkfs.btrfs --version
> mkfs.btrfs, part of btrfs-progs v5.19
> $ sudo mkfs.btrfs -d single -m single -O zoned /dev/nullb0 /dev/nullb1
> /dev/nullb2 -f
> btrfs-progs v5.19
> See http://btrfs.wiki.kernel.org for more information.
>
> Resetting device zones /dev/nullb0 (1000 zones) ...
> Resetting device zones /dev/nullb2 (1000 zones) ...
> Resetting device zones /dev/nullb1 (1000 zones) ...
> NOTE: several default settings have changed in version 5.15, please make sure
>       this does not affect your deployments:
>       - DUP for metadata (-m dup)
>       - enabled no-holes (-O no-holes)
>       - enabled free-space-tree (-R free-space-tree)
>
> No valid Btrfs found on /dev/nullb0
> ERROR: open ctree failed
>
> #mkfs.ext2 failed
> $ sudo mke2fs -V
> mke2fs 1.46.5 (30-Dec-2021)
> Using EXT2FS Library version 1.46.5
> $ sudo mke2fs /dev/nullb0
> mke2fs 1.46.5 (30-Dec-2021)
> Creating filesystem with 65536000 4k blocks and 16384000 inodes
> Filesystem UUID: 747350a2-a1d5-4944-9f46-0fe4ca76df9d
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872
>
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: mke2fs:
> Input/output error while writing out and closing file system
>
>
>
> thanks,
> Li Zhang
>
> Qu Wenruo <quwenruo.btrfs@gmx.com> 于2022年8月28日周日 17:54写道：
> >
> >
> >
> > On 2022/8/28 16:53, li zhang wrote:
> > > Hi, I'm a bit confused, do you mean if you open a zoned device
> > > without O_DIRECT it will fail?
> >
> > Not a zoned device expert, but to my understanding, if we write into
> > zoned device, without O_DIRECT, there is no guarantee that the data you
> > submitted will end at the same bytenr you specified.
> >
> > E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M.
> >
> > Without O_DIRECT, the zoned code can re-locate the bytenr to any range
> > after the write pointer inside the same zone.
> >
> > AKA, for zoned device, without O_DIRECT (queue length 1), you can only
> > known the real physical bytenr after the write has fully finished.
> >
> > (The final physical bytenr is determined by the zoned device, no longer
> > the write initiator).
> >
> > >
> > > I tested and found that if I open a device with the O_DIRECT flag
> > > on a virtual device like a loop device, the device cannot be written
> > > to, but with or without O_DIRECT, it works fine on a real
> > > device (for me, I only test A normal block device since I don't have
> > > any zoned devices)
> >
> > IIRC currently there is no zoned emulation for loop device.
> >
> > If you want to test zoned device, you can use null block kernel module,
> > with fully memory backed storage:
> >
> > https://zonedstorage.io/docs/getting-started/nullblk
> >
> >
> > Or go a little further, using tcmu-runner to create file backed zoned
> > device:
> >
> > https://zonedstorage.io/docs/tools/tcmu-runner
> >
> > >
> > > If we use the same flags for all devices,
> > > does that mean we can't use mkfs.btrfs
> > > on both real and virtual devices at the same time.
> > >
> > >
> > > Below is my test program and test results.
> > >
> > > code(main idea):
> > > printf("filename:%s.\n", argv[1]);
> > > int fd = open(argv[1], O_RDWR | O_DIRECT);
> > > if (fd < 0) {
> > >       printf("fd:error.\n");
> > >       return -1;
> > > }
> > > int num = write(fd, "123", 3);
> > > printf("num:%d.\n", num);
> >
> > O_DIRECT requires strict memory alignment, obviously the length 3 is not
> > properly aligned.
> >
> > Please check open(2p) for the full requirement.
> >
> > For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT
> > can work correctly.
> >
> >
> > Back to btrfs-progs work, I'd say before we do anything, let's check all
> > the devices passed in to determine if we want zoned mode (any zoned
> > device should make it zoned).
> >
> > Then we can determine the open flags for all devices, and for regular
> > devices, O_DIRECT mostly makes no difference (maybe a little slower, but
> > may not even be observable).
> >
> > Thanks,
> > Qu
> >
> >
> > > close(fd);
> > >
> > > result:
> > > $ sudo losetup /dev/loop1 loopDev/loop1
> > > $ sudo ./a.out /dev/loop1
> > > filename:/dev/loop1.
> > > num:-1.
> > > # cannot write to loop1
> > >
> > >
> > > Thanks,
> > > Li Zhang
> > >
> > > Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道：
> > >>
> > >> On 25.08.22 07:20, Qu Wenruo wrote:
> > >>>> +                    if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
> > >>>> +                            prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
> > >>> Do we need to treat the initial and other devices differently?
> > >>>
> > >>> Can't we use the same flags for all devices?
> > >>>
> > >>>
> > >>
> > >> Yep we need to have the same flags for all devices. Otherwise only
> > >> device 0 will be opened with O_DIRECT, in case of a host-managed one and
> > >> the subsequent will be opened without O_DIRECT causing mkfs to fail.

Qu Wenruo Aug. 29, 2022, 12:36 a.m. UTC | #10

On 2022/8/28 22:26, li zhang wrote:
> Yes, I see what you mean.
>
> There is no doubt that the loop device is not a zone device.
> I simulated the zone device with the null_blk module and tested
> mkfs.btrfs, but an error was reported. In addition, Not only
> mkfs.btrfs does not work on null_blk zoned devices, mkfs.xfs and mkfs.ext2 also
> do not work on null_blk zoned devices, here is the test log. My first
> instinct is
> the null_blk problem . But I didn't test tcmu-runner, I'll dig into it
> later anyway.

Please get an overview of what zoned device can and can not in the first
place:

https://zonedstorage.io/docs/introduction/zoned-storage

In short, for zoned device it can not do any overwrite.

Johannes, please correct me if I'm wrong, it's only allowed to submit
write which bytenr is at (or beyond?) the write pointer inside a zone.

Thus that's why there are only very limited filesystems supporting zoned
device for now.

For current btrfs, we have mandatory metadata COW, thus can ensure all
our metadata are allocated in ascending bytenr, and uses queue depth 1
to make sure all our metadata can be written exactly where we specify.

For btrfs data, we let the zoned device to decide where the data should
be, and record the new bytenr returned by the zoned device into our
metadata (and follow above metadata write behavior to write them).

For btrfs super blocks, there are two (?) dedicated zones for
superblocks, we write super blocks into one zone like a ring buffer.
(Thus at mount we need to read the whole zone to find the newest copy)

So mkfs.xfs is *supposed* to fail, that's nothing new.
There are tons of things which can lead to write before the write
pointer, like to update the super block.


>
>
> #emulate zoned device using null_blk
> $ sudo modprobe null_blk nr_devices=4 zoned=1
>
> #mkfs.xfs failed
> $ sudo mkfs.xfs -V
> mkfs.xfs version 5.18.0
> $ sudo mkfs.xfs /dev/nullb0 -f
> meta-data=/dev/nullb0            isize=512    agcount=4, agsize=16384000 blks
>           =                       sectsz=512   attr=2, projid32bit=1
>           =                       crc=1        finobt=1, sparse=1, rmapbt=0
>           =                       reflink=1    bigtime=1 inobtcount=1
> data     =                       bsize=4096   blocks=65536000, imaxpct=25
>           =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=32000, version=2
>           =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x1f3fff00/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> found dirty buffer (bulk) on free list!
> mkfs.xfs: pwrite failed: Input/output error
> libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5
> mkfs.xfs: Releasing dirty buffer to free list!
> mkfs.xfs: libxfs_device_zero write failed: Input/output error

That's expected.

>
> #mkfs.btrfs failed
> $ sudo mkfs.btrfs --version
> mkfs.btrfs, part of btrfs-progs v5.19
> $ sudo mkfs.btrfs -d single -m single -O zoned /dev/nullb0 /dev/nullb1
> /dev/nullb2 -f
> btrfs-progs v5.19
> See http://btrfs.wiki.kernel.org for more information.
>
> Resetting device zones /dev/nullb0 (1000 zones) ...
> Resetting device zones /dev/nullb2 (1000 zones) ...
> Resetting device zones /dev/nullb1 (1000 zones) ...
> NOTE: several default settings have changed in version 5.15, please make sure
>        this does not affect your deployments:
>        - DUP for metadata (-m dup)
>        - enabled no-holes (-O no-holes)
>        - enabled free-space-tree (-R free-space-tree)
>
> No valid Btrfs found on /dev/nullb0

This looks like you're using null_blk in discard mode (aka, all writes
are just discarded).

You need to specify the memory_backed param to let it remember what you
have written.

To Johannes, maybe you want to update the null_blk page to specify the
memory_backed param?

With that specified, it works fine in my test env:

  # modprobe null_blk nr_devices=1 zoned=1 zone_size=128 gb=1
memory_backed=1
  # mkfs.btrfs  -f /dev/nullb0 -m single -d single
btrfs-progs v5.18.1
See http://btrfs.wiki.kernel.org for more information.

Zoned: /dev/nullb0: host-managed device detected, setting zoned feature
Resetting device zones /dev/nullb0 (8 zones) ...
NOTE: several default settings have changed in version 5.15, please make
sure
       this does not affect your deployments:
       - DUP for metadata (-m dup)
       - enabled no-holes (-O no-holes)
       - enabled free-space-tree (-R free-space-tree)

Label:              (null)
UUID:               d75978cc-cfff-4acd-abb3-5f8023d4f12f
Node size:          16384
Sector size:        4096
Filesystem size:    1.00GiB
Block group profiles:
   Data:             single          128.00MiB
   Metadata:         single          128.00MiB
   System:           single          128.00MiB
SSD detected:       yes
Zoned device:       yes
   Zone size:        128.00MiB
Incompat features:  extref, skinny-metadata, no-holes, zoned
Runtime features:   free-space-tree
Checksum:           crc32c
Number of devices:  1
Devices:
    ID        SIZE  PATH
     1     1.00GiB  /dev/nullb0

# mount /dev/nullb0  /mnt/btrfs/



Thanks,
Qu


> ERROR: open ctree failed
>
> #mkfs.ext2 failed
> $ sudo mke2fs -V
> mke2fs 1.46.5 (30-Dec-2021)
> Using EXT2FS Library version 1.46.5
> $ sudo mke2fs /dev/nullb0
> mke2fs 1.46.5 (30-Dec-2021)
> Creating filesystem with 65536000 4k blocks and 16384000 inodes
> Filesystem UUID: 747350a2-a1d5-4944-9f46-0fe4ca76df9d
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 20480000, 23887872
>
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: mke2fs:
> Input/output error while writing out and closing file system
>
>
>
> thanks,
> Li Zhang
>
> Qu Wenruo <quwenruo.btrfs@gmx.com> 于2022年8月28日周日 17:54写道：
>>
>>
>>
>> On 2022/8/28 16:53, li zhang wrote:
>>> Hi, I'm a bit confused, do you mean if you open a zoned device
>>> without O_DIRECT it will fail?
>>
>> Not a zoned device expert, but to my understanding, if we write into
>> zoned device, without O_DIRECT, there is no guarantee that the data you
>> submitted will end at the same bytenr you specified.
>>
>> E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M.
>>
>> Without O_DIRECT, the zoned code can re-locate the bytenr to any range
>> after the write pointer inside the same zone.
>>
>> AKA, for zoned device, without O_DIRECT (queue length 1), you can only
>> known the real physical bytenr after the write has fully finished.
>>
>> (The final physical bytenr is determined by the zoned device, no longer
>> the write initiator).
>>
>>>
>>> I tested and found that if I open a device with the O_DIRECT flag
>>> on a virtual device like a loop device, the device cannot be written
>>> to, but with or without O_DIRECT, it works fine on a real
>>> device (for me, I only test A normal block device since I don't have
>>> any zoned devices)
>>
>> IIRC currently there is no zoned emulation for loop device.
>>
>> If you want to test zoned device, you can use null block kernel module,
>> with fully memory backed storage:
>>
>> https://zonedstorage.io/docs/getting-started/nullblk
>>
>>
>> Or go a little further, using tcmu-runner to create file backed zoned
>> device:
>>
>> https://zonedstorage.io/docs/tools/tcmu-runner
>>
>>>
>>> If we use the same flags for all devices,
>>> does that mean we can't use mkfs.btrfs
>>> on both real and virtual devices at the same time.
>>>
>>>
>>> Below is my test program and test results.
>>>
>>> code(main idea):
>>> printf("filename:%s.\n", argv[1]);
>>> int fd = open(argv[1], O_RDWR | O_DIRECT);
>>> if (fd < 0) {
>>>        printf("fd:error.\n");
>>>        return -1;
>>> }
>>> int num = write(fd, "123", 3);
>>> printf("num:%d.\n", num);
>>
>> O_DIRECT requires strict memory alignment, obviously the length 3 is not
>> properly aligned.
>>
>> Please check open(2p) for the full requirement.
>>
>> For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT
>> can work correctly.
>>
>>
>> Back to btrfs-progs work, I'd say before we do anything, let's check all
>> the devices passed in to determine if we want zoned mode (any zoned
>> device should make it zoned).
>>
>> Then we can determine the open flags for all devices, and for regular
>> devices, O_DIRECT mostly makes no difference (maybe a little slower, but
>> may not even be observable).
>>
>> Thanks,
>> Qu
>>
>>
>>> close(fd);
>>>
>>> result:
>>> $ sudo losetup /dev/loop1 loopDev/loop1
>>> $ sudo ./a.out /dev/loop1
>>> filename:/dev/loop1.
>>> num:-1.
>>> # cannot write to loop1
>>>
>>>
>>> Thanks,
>>> Li Zhang
>>>
>>> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道：
>>>>
>>>> On 25.08.22 07:20, Qu Wenruo wrote:
>>>>>> +                    if (zoned && zoned_model(file) == ZONED_HOST_MANAGED)
>>>>>> +                            prepare_ctx[i].oflags = O_RDWR | O_DIRECT;
>>>>> Do we need to treat the initial and other devices differently?
>>>>>
>>>>> Can't we use the same flags for all devices?
>>>>>
>>>>>
>>>>
>>>> Yep we need to have the same flags for all devices. Otherwise only
>>>> device 0 will be opened with O_DIRECT, in case of a host-managed one and
>>>> the subsequent will be opened without O_DIRECT causing mkfs to fail.

Make btrfs_prepare_device parallel during mkfs.btrfs

Commit Message

Comments

Patch