Message ID | 1661357103-22735-1-git-send-email-zhanglikernel@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Make btrfs_prepare_device parallel during mkfs.btrfs | expand |
On 2022/8/25 00:05, Li Zhang wrote: > [enhancement] > When a disk is formatted as btrfs, it calls > btrfs_prepare_device for each device, which takes too much time. The idea is awesome. > > [implementation] > Put each btrfs_prepare_device into a thread, > wait for the first thread to complete to mkfs.btrfs, > and wait for other threads to complete before adding > other devices to the file system. > > [test] > Using the btrfs-progs test case mkfs-tests, mkfs.btrfs works fine. > > But I don't have an actual zoed device, > so I don't know how much time it saves, If you guys > have a way to test it, please let me know. > > Signed-off-by: Li Zhang <zhanglikernel@gmail.com> > --- > Issue: 496 > > mkfs/main.c | 113 +++++++++++++++++++++++++++++++++++++++++++++--------------- > 1 file changed, 86 insertions(+), 27 deletions(-) > > diff --git a/mkfs/main.c b/mkfs/main.c > index ce096d3..35fefe2 100644 > --- a/mkfs/main.c > +++ b/mkfs/main.c > @@ -31,6 +31,7 @@ > #include <uuid/uuid.h> > #include <ctype.h> > #include <blkid/blkid.h> > +#include <pthread.h> > #include "kernel-shared/ctree.h" > #include "kernel-shared/disk-io.h" > #include "kernel-shared/free-space-tree.h" > @@ -60,6 +61,18 @@ struct mkfs_allocation { > u64 system; > }; > > + > +struct prepare_device_progress { > + char *file; > + u64 dev_block_count; > + u64 block_count; > + bool zero_end; > + bool discard; > + bool zoned; > + int oflags; A small nitpick. Aren't those 4 values the same shared by all devices? Thus I'm not sure if they need to be put into prepare_device_progress at all. IIRC, we may want some shared memory between all the threads: - A pthread_mutex Will be explained later - All the other shared infos like above flags/oflags It can be global or passed by some pointers. > + int ret; > +}; > + > static int create_metadata_block_groups(struct btrfs_root *root, bool mixed, > struct mkfs_allocation *allocation) > { > @@ -969,6 +982,28 @@ fail: > return ret; > } > > +static void *prepare_one_dev(void *ctx) > +{ > + struct prepare_device_progress *prepare_ctx = ctx; > + int fd; > + > + fd = open(prepare_ctx->file, prepare_ctx->oflags); > + if (fd < 0) { > + error("unable to open %s: %m", prepare_ctx->file); If we have no permission for all devices (pretty common in fact, e.g. forgot to use sudo), we will have multiple threads printing out the same time. Without a lock, the output will be a mess. Thus we may want a mutex, even it's just for synchronizing the output. > + prepare_ctx->ret = fd; > + return NULL; > + } > + prepare_ctx->ret = btrfs_prepare_device(fd, > + prepare_ctx->file, &prepare_ctx->dev_block_count, > + prepare_ctx->block_count, > + (bconf.verbose ? PREP_DEVICE_VERBOSE : 0) | > + (prepare_ctx->zero_end ? PREP_DEVICE_ZERO_END : 0) | > + (prepare_ctx->discard ? PREP_DEVICE_DISCARD : 0) | > + (prepare_ctx->zoned ? PREP_DEVICE_ZONED : 0)); > + close(fd); > + return NULL; > +} > + > int BOX_MAIN(mkfs)(int argc, char **argv) > { > char *file; > @@ -997,7 +1032,6 @@ int BOX_MAIN(mkfs)(int argc, char **argv) > bool ssd = false; > bool zoned = false; > bool force_overwrite = false; > - int oflags; > char *source_dir = NULL; > bool source_dir_set = false; > bool shrink_rootdir = false; > @@ -1006,6 +1040,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv) > u64 shrink_size; > int dev_cnt = 0; > int saved_optind; > + pthread_t *t_prepare = NULL; > + struct prepare_device_progress *prepare_ctx = NULL; > char fs_uuid[BTRFS_UUID_UNPARSED_SIZE] = { 0 }; > u64 features = BTRFS_MKFS_DEFAULT_FEATURES; > u64 runtime_features = BTRFS_MKFS_DEFAULT_RUNTIME_FEATURES; > @@ -1428,29 +1464,45 @@ int BOX_MAIN(mkfs)(int argc, char **argv) > goto error; > } > > - dev_cnt--; > - > - oflags = O_RDWR; > - if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) > - oflags |= O_DIRECT; > + t_prepare = malloc(dev_cnt * sizeof(*t_prepare)); > + prepare_ctx = malloc(dev_cnt * sizeof(*prepare_ctx)); > > - /* > - * Open without O_EXCL so that the problem should not occur by the > - * following operation in kernel: > - * (btrfs_register_one_device() fails if O_EXCL is on) > - */ > - fd = open(file, oflags); > - if (fd < 0) { > - error("unable to open %s: %m", file); > + if (!t_prepare || !prepare_ctx) { > + error("unable to prepare dev"); Isn't this ENOMEM? The message doesn't seem to match the situation. > goto error; > } > - ret = btrfs_prepare_device(fd, file, &dev_block_count, block_count, > - (zero_end ? PREP_DEVICE_ZERO_END : 0) | > - (discard ? PREP_DEVICE_DISCARD : 0) | > - (bconf.verbose ? PREP_DEVICE_VERBOSE : 0) | > - (zoned ? PREP_DEVICE_ZONED : 0)); > + > + for (i = 0; i < dev_cnt; i++) { > + prepare_ctx[i].file = argv[optind + i - 1]; > + prepare_ctx[i].block_count = block_count; > + prepare_ctx[i].dev_block_count = block_count; > + prepare_ctx[i].zero_end = zero_end; > + prepare_ctx[i].discard = discard; > + prepare_ctx[i].zoned = zoned; > + if (i == 0) { > + prepare_ctx[i].oflags = O_RDWR; > + /* > + * Open without O_EXCL so that the problem should > + * not occur by the following operation in kernel: > + * (btrfs_register_one_device() fails if O_EXCL is on) > + */ The comment seems out-dated, no O_EXCL involved anywhere. > + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) > + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; Do we need to treat the initial and other devices differently? Can't we use the same flags for all devices? > + } else { > + prepare_ctx[i].oflags = O_RDWR; > + } > + ret = pthread_create(&t_prepare[i], NULL, > + prepare_one_dev, &prepare_ctx[i]); > + } > + pthread_join(t_prepare[0], NULL); > + ret = prepare_ctx[0].ret; > + Can't we just wait for all devices? I don't think treating them different could have much benefit. Yes, we can have multiple-devices with different performance characteristics, thus if the first device is the fastest one, it may finish before all the others. But this also means, the first one can be the slowest. To me, parallel initialization is already a big enough improvement, and for the most common case, all the devices should have the same or similar performance characteristics, thus waiting for them all shouldn't cause much difference. > if (ret) > goto error; > + > + dev_cnt--; > + fd = open(file, prepare_ctx[0].oflags); > + dev_block_count = prepare_ctx[0].dev_block_count; > if (block_count && block_count > dev_block_count) { > error("%s is smaller than requested size, expected %llu, found %llu", > file, (unsigned long long)block_count, > @@ -1459,7 +1511,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv) > } > > /* To create the first block group and chunk 0 in make_btrfs */ > - system_group_size = zoned ? zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE; > + system_group_size = zoned ? zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE; > if (dev_block_count < system_group_size) { > error("device is too small to make filesystem, must be at least %llu", > (unsigned long long)system_group_size); > @@ -1557,6 +1609,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv) > if (dev_cnt == 0) > goto raid_groups; > > + for (i = 0 ; i < dev_cnt; i++) { > + pthread_join(t_prepare[i+1], NULL); > + if (prepare_ctx[i+1].ret) { > + goto error; > + } > + } > while (dev_cnt-- > 0) { > file = argv[optind++]; > > @@ -1578,12 +1636,9 @@ int BOX_MAIN(mkfs)(int argc, char **argv) > close(fd); > continue; > } > - ret = btrfs_prepare_device(fd, file, &dev_block_count, > - block_count, > - (bconf.verbose ? PREP_DEVICE_VERBOSE : 0) | > - (zero_end ? PREP_DEVICE_ZERO_END : 0) | > - (discard ? PREP_DEVICE_DISCARD : 0) | > - (zoned ? PREP_DEVICE_ZONED : 0)); > + dev_block_count = prepare_ctx[argc - saved_optind - dev_cnt - 1] > + .dev_block_count; > + > if (ret) { > goto error; > } This goto error is a dead code now. Thanks for the great idea on reducing the preparation time! Qu > @@ -1763,12 +1818,16 @@ out: > > btrfs_close_all_devices(); > free(label); > - > + free(t_prepare); > + free(prepare_ctx); > return !!ret; > + > error: > if (fd > 0) > close(fd); > > + free(t_prepare); > + free(prepare_ctx); > free(label); > exit(1); > success:
On 25.08.22 07:20, Qu Wenruo wrote: >> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) >> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; > Do we need to treat the initial and other devices differently? > > Can't we use the same flags for all devices? > > Yep we need to have the same flags for all devices. Otherwise only device 0 will be opened with O_DIRECT, in case of a host-managed one and the subsequent will be opened without O_DIRECT causing mkfs to fail.
On 24.08.22 18:06, Li Zhang wrote: > [enhancement] > When a disk is formatted as btrfs, it calls > btrfs_prepare_device for each device, which takes too much time. That really is awesome. I'll throw it onto my 60 zoned HDD test box, once all devices have the same open flags. [...] > + t_prepare = malloc(dev_cnt * sizeof(*t_prepare)); > + prepare_ctx = malloc(dev_cnt * sizeof(*prepare_ctx)); That really should be calloc().
On 2022/8/25 16:31, Johannes Thumshirn wrote: > On 25.08.22 07:20, Qu Wenruo wrote: >>> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) >>> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; >> Do we need to treat the initial and other devices differently? >> >> Can't we use the same flags for all devices? >> >> > > Yep we need to have the same flags for all devices. Otherwise only > device 0 will be opened with O_DIRECT, in case of a host-managed one and > the subsequent will be opened without O_DIRECT causing mkfs to fail. Just a little curious, currently btrfs doesn't support mixed traditional/zoned devices, right? So that O_DIRECT for all devices are for future mixed zoned mode? Anyway I'm completely fine if we can use the same oflags for all devices. Thanks, Qu
On 25.08.22 10:36, Qu Wenruo wrote: > > > On 2022/8/25 16:31, Johannes Thumshirn wrote: >> On 25.08.22 07:20, Qu Wenruo wrote: >>>> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) >>>> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; >>> Do we need to treat the initial and other devices differently? >>> >>> Can't we use the same flags for all devices? >>> >>> >> >> Yep we need to have the same flags for all devices. Otherwise only >> device 0 will be opened with O_DIRECT, in case of a host-managed one and >> the subsequent will be opened without O_DIRECT causing mkfs to fail. > > Just a little curious, currently btrfs doesn't support mixed > traditional/zoned devices, right? > > So that O_DIRECT for all devices are for future mixed zoned mode? We need it in case of multiple zoned devices as well. The mixed mode you describe above could actually work thanks to the zoned emulation we have in place. But I've never actually tried to be honest.
Hi, I'm a bit confused, do you mean if you open a zoned device without O_DIRECT it will fail? I tested and found that if I open a device with the O_DIRECT flag on a virtual device like a loop device, the device cannot be written to, but with or without O_DIRECT, it works fine on a real device (for me, I only test A normal block device since I don't have any zoned devices) If we use the same flags for all devices, does that mean we can't use mkfs.btrfs on both real and virtual devices at the same time. Below is my test program and test results. code(main idea): printf("filename:%s.\n", argv[1]); int fd = open(argv[1], O_RDWR | O_DIRECT); if (fd < 0) { printf("fd:error.\n"); return -1; } int num = write(fd, "123", 3); printf("num:%d.\n", num); close(fd); result: $ sudo losetup /dev/loop1 loopDev/loop1 $ sudo ./a.out /dev/loop1 filename:/dev/loop1. num:-1. # cannot write to loop1 Thanks, Li Zhang Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道: > > On 25.08.22 07:20, Qu Wenruo wrote: > >> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) > >> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; > > Do we need to treat the initial and other devices differently? > > > > Can't we use the same flags for all devices? > > > > > > Yep we need to have the same flags for all devices. Otherwise only > device 0 will be opened with O_DIRECT, in case of a host-managed one and > the subsequent will be opened without O_DIRECT causing mkfs to fail.
On 2022/8/28 16:53, li zhang wrote: > Hi, I'm a bit confused, do you mean if you open a zoned device > without O_DIRECT it will fail? Not a zoned device expert, but to my understanding, if we write into zoned device, without O_DIRECT, there is no guarantee that the data you submitted will end at the same bytenr you specified. E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M. Without O_DIRECT, the zoned code can re-locate the bytenr to any range after the write pointer inside the same zone. AKA, for zoned device, without O_DIRECT (queue length 1), you can only known the real physical bytenr after the write has fully finished. (The final physical bytenr is determined by the zoned device, no longer the write initiator). > > I tested and found that if I open a device with the O_DIRECT flag > on a virtual device like a loop device, the device cannot be written > to, but with or without O_DIRECT, it works fine on a real > device (for me, I only test A normal block device since I don't have > any zoned devices) IIRC currently there is no zoned emulation for loop device. If you want to test zoned device, you can use null block kernel module, with fully memory backed storage: https://zonedstorage.io/docs/getting-started/nullblk Or go a little further, using tcmu-runner to create file backed zoned device: https://zonedstorage.io/docs/tools/tcmu-runner > > If we use the same flags for all devices, > does that mean we can't use mkfs.btrfs > on both real and virtual devices at the same time. > > > Below is my test program and test results. > > code(main idea): > printf("filename:%s.\n", argv[1]); > int fd = open(argv[1], O_RDWR | O_DIRECT); > if (fd < 0) { > printf("fd:error.\n"); > return -1; > } > int num = write(fd, "123", 3); > printf("num:%d.\n", num); O_DIRECT requires strict memory alignment, obviously the length 3 is not properly aligned. Please check open(2p) for the full requirement. For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT can work correctly. Back to btrfs-progs work, I'd say before we do anything, let's check all the devices passed in to determine if we want zoned mode (any zoned device should make it zoned). Then we can determine the open flags for all devices, and for regular devices, O_DIRECT mostly makes no difference (maybe a little slower, but may not even be observable). Thanks, Qu > close(fd); > > result: > $ sudo losetup /dev/loop1 loopDev/loop1 > $ sudo ./a.out /dev/loop1 > filename:/dev/loop1. > num:-1. > # cannot write to loop1 > > > Thanks, > Li Zhang > > Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道: >> >> On 25.08.22 07:20, Qu Wenruo wrote: >>>> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) >>>> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; >>> Do we need to treat the initial and other devices differently? >>> >>> Can't we use the same flags for all devices? >>> >>> >> >> Yep we need to have the same flags for all devices. Otherwise only >> device 0 will be opened with O_DIRECT, in case of a host-managed one and >> the subsequent will be opened without O_DIRECT causing mkfs to fail.
Yes, I see what you mean. There is no doubt that the loop device is not a zone device. I simulated the zone device with the null_blk module and tested mkfs.btrfs, but an error was reported. In addition, Not only mkfs.btrfs does not work on null_blk zoned devices, mkfs.xfs and mkfs.ext2 also do not work on null_blk zoned devices, here is the test log. My first instinct is the null_blk problem . But I didn't test tcmu-runner, I'll dig into it later anyway. #emulate zoned device using null_blk $ sudo modprobe null_blk nr_devices=4 zoned=1 #mkfs.xfs failed $ sudo mkfs.xfs -V mkfs.xfs version 5.18.0 $ sudo mkfs.xfs /dev/nullb0 -f meta-data=/dev/nullb0 isize=512 agcount=4, agsize=16384000 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 bigtime=1 inobtcount=1 data = bsize=4096 blocks=65536000, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=32000, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 mkfs.xfs: pwrite failed: Input/output error libxfs_bwrite: write failed on (unknown) bno 0x1f3fff00/0x100, err=5 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! mkfs.xfs: pwrite failed: Input/output error libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5 mkfs.xfs: Releasing dirty buffer to free list! found dirty buffer (bulk) on free list! mkfs.xfs: pwrite failed: Input/output error libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5 mkfs.xfs: Releasing dirty buffer to free list! mkfs.xfs: libxfs_device_zero write failed: Input/output error #mkfs.btrfs failed $ sudo mkfs.btrfs --version mkfs.btrfs, part of btrfs-progs v5.19 $ sudo mkfs.btrfs -d single -m single -O zoned /dev/nullb0 /dev/nullb1 /dev/nullb2 -f btrfs-progs v5.19 See http://btrfs.wiki.kernel.org for more information. Resetting device zones /dev/nullb0 (1000 zones) ... Resetting device zones /dev/nullb2 (1000 zones) ... Resetting device zones /dev/nullb1 (1000 zones) ... NOTE: several default settings have changed in version 5.15, please make sure this does not affect your deployments: - DUP for metadata (-m dup) - enabled no-holes (-O no-holes) - enabled free-space-tree (-R free-space-tree) No valid Btrfs found on /dev/nullb0 ERROR: open ctree failed #mkfs.ext2 failed $ sudo mke2fs -V mke2fs 1.46.5 (30-Dec-2021) Using EXT2FS Library version 1.46.5 $ sudo mke2fs /dev/nullb0 mke2fs 1.46.5 (30-Dec-2021) Creating filesystem with 65536000 4k blocks and 16384000 inodes Filesystem UUID: 747350a2-a1d5-4944-9f46-0fe4ca76df9d Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872 Allocating group tables: done Writing inode tables: done Writing superblocks and filesystem accounting information: mke2fs: Input/output error while writing out and closing file system thanks, Li Zhang Qu Wenruo <quwenruo.btrfs@gmx.com> 于2022年8月28日周日 17:54写道: > > > > On 2022/8/28 16:53, li zhang wrote: > > Hi, I'm a bit confused, do you mean if you open a zoned device > > without O_DIRECT it will fail? > > Not a zoned device expert, but to my understanding, if we write into > zoned device, without O_DIRECT, there is no guarantee that the data you > submitted will end at the same bytenr you specified. > > E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M. > > Without O_DIRECT, the zoned code can re-locate the bytenr to any range > after the write pointer inside the same zone. > > AKA, for zoned device, without O_DIRECT (queue length 1), you can only > known the real physical bytenr after the write has fully finished. > > (The final physical bytenr is determined by the zoned device, no longer > the write initiator). > > > > > I tested and found that if I open a device with the O_DIRECT flag > > on a virtual device like a loop device, the device cannot be written > > to, but with or without O_DIRECT, it works fine on a real > > device (for me, I only test A normal block device since I don't have > > any zoned devices) > > IIRC currently there is no zoned emulation for loop device. > > If you want to test zoned device, you can use null block kernel module, > with fully memory backed storage: > > https://zonedstorage.io/docs/getting-started/nullblk > > > Or go a little further, using tcmu-runner to create file backed zoned > device: > > https://zonedstorage.io/docs/tools/tcmu-runner > > > > > If we use the same flags for all devices, > > does that mean we can't use mkfs.btrfs > > on both real and virtual devices at the same time. > > > > > > Below is my test program and test results. > > > > code(main idea): > > printf("filename:%s.\n", argv[1]); > > int fd = open(argv[1], O_RDWR | O_DIRECT); > > if (fd < 0) { > > printf("fd:error.\n"); > > return -1; > > } > > int num = write(fd, "123", 3); > > printf("num:%d.\n", num); > > O_DIRECT requires strict memory alignment, obviously the length 3 is not > properly aligned. > > Please check open(2p) for the full requirement. > > For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT > can work correctly. > > > Back to btrfs-progs work, I'd say before we do anything, let's check all > the devices passed in to determine if we want zoned mode (any zoned > device should make it zoned). > > Then we can determine the open flags for all devices, and for regular > devices, O_DIRECT mostly makes no difference (maybe a little slower, but > may not even be observable). > > Thanks, > Qu > > > > close(fd); > > > > result: > > $ sudo losetup /dev/loop1 loopDev/loop1 > > $ sudo ./a.out /dev/loop1 > > filename:/dev/loop1. > > num:-1. > > # cannot write to loop1 > > > > > > Thanks, > > Li Zhang > > > > Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道: > >> > >> On 25.08.22 07:20, Qu Wenruo wrote: > >>>> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) > >>>> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; > >>> Do we need to treat the initial and other devices differently? > >>> > >>> Can't we use the same flags for all devices? > >>> > >>> > >> > >> Yep we need to have the same flags for all devices. Otherwise only > >> device 0 will be opened with O_DIRECT, in case of a host-managed one and > >> the subsequent will be opened without O_DIRECT causing mkfs to fail.
By the way, my kernel version is 5.19.0, and I also tested the 5.0 version (maybe, I only remember that the version starts with 5), the same error output thanks, Li Zhang li zhang <zhanglikernel@gmail.com> 于2022年8月28日周日 22:26写道: > > Yes, I see what you mean. > > There is no doubt that the loop device is not a zone device. > I simulated the zone device with the null_blk module and tested > mkfs.btrfs, but an error was reported. In addition, Not only > mkfs.btrfs does not work on null_blk zoned devices, mkfs.xfs and mkfs.ext2 also > do not work on null_blk zoned devices, here is the test log. My first > instinct is > the null_blk problem . But I didn't test tcmu-runner, I'll dig into it > later anyway. > > > #emulate zoned device using null_blk > $ sudo modprobe null_blk nr_devices=4 zoned=1 > > #mkfs.xfs failed > $ sudo mkfs.xfs -V > mkfs.xfs version 5.18.0 > $ sudo mkfs.xfs /dev/nullb0 -f > meta-data=/dev/nullb0 isize=512 agcount=4, agsize=16384000 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 bigtime=1 inobtcount=1 > data = bsize=4096 blocks=65536000, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=32000, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > mkfs.xfs: pwrite failed: Input/output error > libxfs_bwrite: write failed on (unknown) bno 0x1f3fff00/0x100, err=5 > mkfs.xfs: Releasing dirty buffer to free list! > found dirty buffer (bulk) on free list! > mkfs.xfs: pwrite failed: Input/output error > libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5 > mkfs.xfs: Releasing dirty buffer to free list! > found dirty buffer (bulk) on free list! > mkfs.xfs: pwrite failed: Input/output error > libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5 > mkfs.xfs: Releasing dirty buffer to free list! > mkfs.xfs: libxfs_device_zero write failed: Input/output error > > #mkfs.btrfs failed > $ sudo mkfs.btrfs --version > mkfs.btrfs, part of btrfs-progs v5.19 > $ sudo mkfs.btrfs -d single -m single -O zoned /dev/nullb0 /dev/nullb1 > /dev/nullb2 -f > btrfs-progs v5.19 > See http://btrfs.wiki.kernel.org for more information. > > Resetting device zones /dev/nullb0 (1000 zones) ... > Resetting device zones /dev/nullb2 (1000 zones) ... > Resetting device zones /dev/nullb1 (1000 zones) ... > NOTE: several default settings have changed in version 5.15, please make sure > this does not affect your deployments: > - DUP for metadata (-m dup) > - enabled no-holes (-O no-holes) > - enabled free-space-tree (-R free-space-tree) > > No valid Btrfs found on /dev/nullb0 > ERROR: open ctree failed > > #mkfs.ext2 failed > $ sudo mke2fs -V > mke2fs 1.46.5 (30-Dec-2021) > Using EXT2FS Library version 1.46.5 > $ sudo mke2fs /dev/nullb0 > mke2fs 1.46.5 (30-Dec-2021) > Creating filesystem with 65536000 4k blocks and 16384000 inodes > Filesystem UUID: 747350a2-a1d5-4944-9f46-0fe4ca76df9d > Superblock backups stored on blocks: > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, > 4096000, 7962624, 11239424, 20480000, 23887872 > > Allocating group tables: done > Writing inode tables: done > Writing superblocks and filesystem accounting information: mke2fs: > Input/output error while writing out and closing file system > > > > thanks, > Li Zhang > > Qu Wenruo <quwenruo.btrfs@gmx.com> 于2022年8月28日周日 17:54写道: > > > > > > > > On 2022/8/28 16:53, li zhang wrote: > > > Hi, I'm a bit confused, do you mean if you open a zoned device > > > without O_DIRECT it will fail? > > > > Not a zoned device expert, but to my understanding, if we write into > > zoned device, without O_DIRECT, there is no guarantee that the data you > > submitted will end at the same bytenr you specified. > > > > E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M. > > > > Without O_DIRECT, the zoned code can re-locate the bytenr to any range > > after the write pointer inside the same zone. > > > > AKA, for zoned device, without O_DIRECT (queue length 1), you can only > > known the real physical bytenr after the write has fully finished. > > > > (The final physical bytenr is determined by the zoned device, no longer > > the write initiator). > > > > > > > > I tested and found that if I open a device with the O_DIRECT flag > > > on a virtual device like a loop device, the device cannot be written > > > to, but with or without O_DIRECT, it works fine on a real > > > device (for me, I only test A normal block device since I don't have > > > any zoned devices) > > > > IIRC currently there is no zoned emulation for loop device. > > > > If you want to test zoned device, you can use null block kernel module, > > with fully memory backed storage: > > > > https://zonedstorage.io/docs/getting-started/nullblk > > > > > > Or go a little further, using tcmu-runner to create file backed zoned > > device: > > > > https://zonedstorage.io/docs/tools/tcmu-runner > > > > > > > > If we use the same flags for all devices, > > > does that mean we can't use mkfs.btrfs > > > on both real and virtual devices at the same time. > > > > > > > > > Below is my test program and test results. > > > > > > code(main idea): > > > printf("filename:%s.\n", argv[1]); > > > int fd = open(argv[1], O_RDWR | O_DIRECT); > > > if (fd < 0) { > > > printf("fd:error.\n"); > > > return -1; > > > } > > > int num = write(fd, "123", 3); > > > printf("num:%d.\n", num); > > > > O_DIRECT requires strict memory alignment, obviously the length 3 is not > > properly aligned. > > > > Please check open(2p) for the full requirement. > > > > For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT > > can work correctly. > > > > > > Back to btrfs-progs work, I'd say before we do anything, let's check all > > the devices passed in to determine if we want zoned mode (any zoned > > device should make it zoned). > > > > Then we can determine the open flags for all devices, and for regular > > devices, O_DIRECT mostly makes no difference (maybe a little slower, but > > may not even be observable). > > > > Thanks, > > Qu > > > > > > > close(fd); > > > > > > result: > > > $ sudo losetup /dev/loop1 loopDev/loop1 > > > $ sudo ./a.out /dev/loop1 > > > filename:/dev/loop1. > > > num:-1. > > > # cannot write to loop1 > > > > > > > > > Thanks, > > > Li Zhang > > > > > > Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道: > > >> > > >> On 25.08.22 07:20, Qu Wenruo wrote: > > >>>> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) > > >>>> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; > > >>> Do we need to treat the initial and other devices differently? > > >>> > > >>> Can't we use the same flags for all devices? > > >>> > > >>> > > >> > > >> Yep we need to have the same flags for all devices. Otherwise only > > >> device 0 will be opened with O_DIRECT, in case of a host-managed one and > > >> the subsequent will be opened without O_DIRECT causing mkfs to fail.
On 2022/8/28 22:26, li zhang wrote: > Yes, I see what you mean. > > There is no doubt that the loop device is not a zone device. > I simulated the zone device with the null_blk module and tested > mkfs.btrfs, but an error was reported. In addition, Not only > mkfs.btrfs does not work on null_blk zoned devices, mkfs.xfs and mkfs.ext2 also > do not work on null_blk zoned devices, here is the test log. My first > instinct is > the null_blk problem . But I didn't test tcmu-runner, I'll dig into it > later anyway. Please get an overview of what zoned device can and can not in the first place: https://zonedstorage.io/docs/introduction/zoned-storage In short, for zoned device it can not do any overwrite. Johannes, please correct me if I'm wrong, it's only allowed to submit write which bytenr is at (or beyond?) the write pointer inside a zone. Thus that's why there are only very limited filesystems supporting zoned device for now. For current btrfs, we have mandatory metadata COW, thus can ensure all our metadata are allocated in ascending bytenr, and uses queue depth 1 to make sure all our metadata can be written exactly where we specify. For btrfs data, we let the zoned device to decide where the data should be, and record the new bytenr returned by the zoned device into our metadata (and follow above metadata write behavior to write them). For btrfs super blocks, there are two (?) dedicated zones for superblocks, we write super blocks into one zone like a ring buffer. (Thus at mount we need to read the whole zone to find the newest copy) So mkfs.xfs is *supposed* to fail, that's nothing new. There are tons of things which can lead to write before the write pointer, like to update the super block. > > > #emulate zoned device using null_blk > $ sudo modprobe null_blk nr_devices=4 zoned=1 > > #mkfs.xfs failed > $ sudo mkfs.xfs -V > mkfs.xfs version 5.18.0 > $ sudo mkfs.xfs /dev/nullb0 -f > meta-data=/dev/nullb0 isize=512 agcount=4, agsize=16384000 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 bigtime=1 inobtcount=1 > data = bsize=4096 blocks=65536000, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=32000, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > mkfs.xfs: pwrite failed: Input/output error > libxfs_bwrite: write failed on (unknown) bno 0x1f3fff00/0x100, err=5 > mkfs.xfs: Releasing dirty buffer to free list! > found dirty buffer (bulk) on free list! > mkfs.xfs: pwrite failed: Input/output error > libxfs_bwrite: write failed on (unknown) bno 0x0/0x100, err=5 > mkfs.xfs: Releasing dirty buffer to free list! > found dirty buffer (bulk) on free list! > mkfs.xfs: pwrite failed: Input/output error > libxfs_bwrite: write failed on xfs_sb bno 0x0/0x1, err=5 > mkfs.xfs: Releasing dirty buffer to free list! > mkfs.xfs: libxfs_device_zero write failed: Input/output error That's expected. > > #mkfs.btrfs failed > $ sudo mkfs.btrfs --version > mkfs.btrfs, part of btrfs-progs v5.19 > $ sudo mkfs.btrfs -d single -m single -O zoned /dev/nullb0 /dev/nullb1 > /dev/nullb2 -f > btrfs-progs v5.19 > See http://btrfs.wiki.kernel.org for more information. > > Resetting device zones /dev/nullb0 (1000 zones) ... > Resetting device zones /dev/nullb2 (1000 zones) ... > Resetting device zones /dev/nullb1 (1000 zones) ... > NOTE: several default settings have changed in version 5.15, please make sure > this does not affect your deployments: > - DUP for metadata (-m dup) > - enabled no-holes (-O no-holes) > - enabled free-space-tree (-R free-space-tree) > > No valid Btrfs found on /dev/nullb0 This looks like you're using null_blk in discard mode (aka, all writes are just discarded). You need to specify the memory_backed param to let it remember what you have written. To Johannes, maybe you want to update the null_blk page to specify the memory_backed param? With that specified, it works fine in my test env: # modprobe null_blk nr_devices=1 zoned=1 zone_size=128 gb=1 memory_backed=1 # mkfs.btrfs -f /dev/nullb0 -m single -d single btrfs-progs v5.18.1 See http://btrfs.wiki.kernel.org for more information. Zoned: /dev/nullb0: host-managed device detected, setting zoned feature Resetting device zones /dev/nullb0 (8 zones) ... NOTE: several default settings have changed in version 5.15, please make sure this does not affect your deployments: - DUP for metadata (-m dup) - enabled no-holes (-O no-holes) - enabled free-space-tree (-R free-space-tree) Label: (null) UUID: d75978cc-cfff-4acd-abb3-5f8023d4f12f Node size: 16384 Sector size: 4096 Filesystem size: 1.00GiB Block group profiles: Data: single 128.00MiB Metadata: single 128.00MiB System: single 128.00MiB SSD detected: yes Zoned device: yes Zone size: 128.00MiB Incompat features: extref, skinny-metadata, no-holes, zoned Runtime features: free-space-tree Checksum: crc32c Number of devices: 1 Devices: ID SIZE PATH 1 1.00GiB /dev/nullb0 # mount /dev/nullb0 /mnt/btrfs/ Thanks, Qu > ERROR: open ctree failed > > #mkfs.ext2 failed > $ sudo mke2fs -V > mke2fs 1.46.5 (30-Dec-2021) > Using EXT2FS Library version 1.46.5 > $ sudo mke2fs /dev/nullb0 > mke2fs 1.46.5 (30-Dec-2021) > Creating filesystem with 65536000 4k blocks and 16384000 inodes > Filesystem UUID: 747350a2-a1d5-4944-9f46-0fe4ca76df9d > Superblock backups stored on blocks: > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, > 4096000, 7962624, 11239424, 20480000, 23887872 > > Allocating group tables: done > Writing inode tables: done > Writing superblocks and filesystem accounting information: mke2fs: > Input/output error while writing out and closing file system > > > > thanks, > Li Zhang > > Qu Wenruo <quwenruo.btrfs@gmx.com> 于2022年8月28日周日 17:54写道: >> >> >> >> On 2022/8/28 16:53, li zhang wrote: >>> Hi, I'm a bit confused, do you mean if you open a zoned device >>> without O_DIRECT it will fail? >> >> Not a zoned device expert, but to my understanding, if we write into >> zoned device, without O_DIRECT, there is no guarantee that the data you >> submitted will end at the same bytenr you specified. >> >> E.g. if you do a pwrite() with a 1M buffer, at device bytenr 4M. >> >> Without O_DIRECT, the zoned code can re-locate the bytenr to any range >> after the write pointer inside the same zone. >> >> AKA, for zoned device, without O_DIRECT (queue length 1), you can only >> known the real physical bytenr after the write has fully finished. >> >> (The final physical bytenr is determined by the zoned device, no longer >> the write initiator). >> >>> >>> I tested and found that if I open a device with the O_DIRECT flag >>> on a virtual device like a loop device, the device cannot be written >>> to, but with or without O_DIRECT, it works fine on a real >>> device (for me, I only test A normal block device since I don't have >>> any zoned devices) >> >> IIRC currently there is no zoned emulation for loop device. >> >> If you want to test zoned device, you can use null block kernel module, >> with fully memory backed storage: >> >> https://zonedstorage.io/docs/getting-started/nullblk >> >> >> Or go a little further, using tcmu-runner to create file backed zoned >> device: >> >> https://zonedstorage.io/docs/tools/tcmu-runner >> >>> >>> If we use the same flags for all devices, >>> does that mean we can't use mkfs.btrfs >>> on both real and virtual devices at the same time. >>> >>> >>> Below is my test program and test results. >>> >>> code(main idea): >>> printf("filename:%s.\n", argv[1]); >>> int fd = open(argv[1], O_RDWR | O_DIRECT); >>> if (fd < 0) { >>> printf("fd:error.\n"); >>> return -1; >>> } >>> int num = write(fd, "123", 3); >>> printf("num:%d.\n", num); >> >> O_DIRECT requires strict memory alignment, obviously the length 3 is not >> properly aligned. >> >> Please check open(2p) for the full requirement. >> >> For mkfs usage, all of our write is at least 4K aligned, thus O_DIRECT >> can work correctly. >> >> >> Back to btrfs-progs work, I'd say before we do anything, let's check all >> the devices passed in to determine if we want zoned mode (any zoned >> device should make it zoned). >> >> Then we can determine the open flags for all devices, and for regular >> devices, O_DIRECT mostly makes no difference (maybe a little slower, but >> may not even be observable). >> >> Thanks, >> Qu >> >> >>> close(fd); >>> >>> result: >>> $ sudo losetup /dev/loop1 loopDev/loop1 >>> $ sudo ./a.out /dev/loop1 >>> filename:/dev/loop1. >>> num:-1. >>> # cannot write to loop1 >>> >>> >>> Thanks, >>> Li Zhang >>> >>> Johannes Thumshirn <Johannes.Thumshirn@wdc.com> 于2022年8月25日周四 16:31写道: >>>> >>>> On 25.08.22 07:20, Qu Wenruo wrote: >>>>>> + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) >>>>>> + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; >>>>> Do we need to treat the initial and other devices differently? >>>>> >>>>> Can't we use the same flags for all devices? >>>>> >>>>> >>>> >>>> Yep we need to have the same flags for all devices. Otherwise only >>>> device 0 will be opened with O_DIRECT, in case of a host-managed one and >>>> the subsequent will be opened without O_DIRECT causing mkfs to fail.
diff --git a/mkfs/main.c b/mkfs/main.c index ce096d3..35fefe2 100644 --- a/mkfs/main.c +++ b/mkfs/main.c @@ -31,6 +31,7 @@ #include <uuid/uuid.h> #include <ctype.h> #include <blkid/blkid.h> +#include <pthread.h> #include "kernel-shared/ctree.h" #include "kernel-shared/disk-io.h" #include "kernel-shared/free-space-tree.h" @@ -60,6 +61,18 @@ struct mkfs_allocation { u64 system; }; + +struct prepare_device_progress { + char *file; + u64 dev_block_count; + u64 block_count; + bool zero_end; + bool discard; + bool zoned; + int oflags; + int ret; +}; + static int create_metadata_block_groups(struct btrfs_root *root, bool mixed, struct mkfs_allocation *allocation) { @@ -969,6 +982,28 @@ fail: return ret; } +static void *prepare_one_dev(void *ctx) +{ + struct prepare_device_progress *prepare_ctx = ctx; + int fd; + + fd = open(prepare_ctx->file, prepare_ctx->oflags); + if (fd < 0) { + error("unable to open %s: %m", prepare_ctx->file); + prepare_ctx->ret = fd; + return NULL; + } + prepare_ctx->ret = btrfs_prepare_device(fd, + prepare_ctx->file, &prepare_ctx->dev_block_count, + prepare_ctx->block_count, + (bconf.verbose ? PREP_DEVICE_VERBOSE : 0) | + (prepare_ctx->zero_end ? PREP_DEVICE_ZERO_END : 0) | + (prepare_ctx->discard ? PREP_DEVICE_DISCARD : 0) | + (prepare_ctx->zoned ? PREP_DEVICE_ZONED : 0)); + close(fd); + return NULL; +} + int BOX_MAIN(mkfs)(int argc, char **argv) { char *file; @@ -997,7 +1032,6 @@ int BOX_MAIN(mkfs)(int argc, char **argv) bool ssd = false; bool zoned = false; bool force_overwrite = false; - int oflags; char *source_dir = NULL; bool source_dir_set = false; bool shrink_rootdir = false; @@ -1006,6 +1040,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv) u64 shrink_size; int dev_cnt = 0; int saved_optind; + pthread_t *t_prepare = NULL; + struct prepare_device_progress *prepare_ctx = NULL; char fs_uuid[BTRFS_UUID_UNPARSED_SIZE] = { 0 }; u64 features = BTRFS_MKFS_DEFAULT_FEATURES; u64 runtime_features = BTRFS_MKFS_DEFAULT_RUNTIME_FEATURES; @@ -1428,29 +1464,45 @@ int BOX_MAIN(mkfs)(int argc, char **argv) goto error; } - dev_cnt--; - - oflags = O_RDWR; - if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) - oflags |= O_DIRECT; + t_prepare = malloc(dev_cnt * sizeof(*t_prepare)); + prepare_ctx = malloc(dev_cnt * sizeof(*prepare_ctx)); - /* - * Open without O_EXCL so that the problem should not occur by the - * following operation in kernel: - * (btrfs_register_one_device() fails if O_EXCL is on) - */ - fd = open(file, oflags); - if (fd < 0) { - error("unable to open %s: %m", file); + if (!t_prepare || !prepare_ctx) { + error("unable to prepare dev"); goto error; } - ret = btrfs_prepare_device(fd, file, &dev_block_count, block_count, - (zero_end ? PREP_DEVICE_ZERO_END : 0) | - (discard ? PREP_DEVICE_DISCARD : 0) | - (bconf.verbose ? PREP_DEVICE_VERBOSE : 0) | - (zoned ? PREP_DEVICE_ZONED : 0)); + + for (i = 0; i < dev_cnt; i++) { + prepare_ctx[i].file = argv[optind + i - 1]; + prepare_ctx[i].block_count = block_count; + prepare_ctx[i].dev_block_count = block_count; + prepare_ctx[i].zero_end = zero_end; + prepare_ctx[i].discard = discard; + prepare_ctx[i].zoned = zoned; + if (i == 0) { + prepare_ctx[i].oflags = O_RDWR; + /* + * Open without O_EXCL so that the problem should + * not occur by the following operation in kernel: + * (btrfs_register_one_device() fails if O_EXCL is on) + */ + if (zoned && zoned_model(file) == ZONED_HOST_MANAGED) + prepare_ctx[i].oflags = O_RDWR | O_DIRECT; + } else { + prepare_ctx[i].oflags = O_RDWR; + } + ret = pthread_create(&t_prepare[i], NULL, + prepare_one_dev, &prepare_ctx[i]); + } + pthread_join(t_prepare[0], NULL); + ret = prepare_ctx[0].ret; + if (ret) goto error; + + dev_cnt--; + fd = open(file, prepare_ctx[0].oflags); + dev_block_count = prepare_ctx[0].dev_block_count; if (block_count && block_count > dev_block_count) { error("%s is smaller than requested size, expected %llu, found %llu", file, (unsigned long long)block_count, @@ -1459,7 +1511,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv) } /* To create the first block group and chunk 0 in make_btrfs */ - system_group_size = zoned ? zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE; + system_group_size = zoned ? zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE; if (dev_block_count < system_group_size) { error("device is too small to make filesystem, must be at least %llu", (unsigned long long)system_group_size); @@ -1557,6 +1609,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv) if (dev_cnt == 0) goto raid_groups; + for (i = 0 ; i < dev_cnt; i++) { + pthread_join(t_prepare[i+1], NULL); + if (prepare_ctx[i+1].ret) { + goto error; + } + } while (dev_cnt-- > 0) { file = argv[optind++]; @@ -1578,12 +1636,9 @@ int BOX_MAIN(mkfs)(int argc, char **argv) close(fd); continue; } - ret = btrfs_prepare_device(fd, file, &dev_block_count, - block_count, - (bconf.verbose ? PREP_DEVICE_VERBOSE : 0) | - (zero_end ? PREP_DEVICE_ZERO_END : 0) | - (discard ? PREP_DEVICE_DISCARD : 0) | - (zoned ? PREP_DEVICE_ZONED : 0)); + dev_block_count = prepare_ctx[argc - saved_optind - dev_cnt - 1] + .dev_block_count; + if (ret) { goto error; } @@ -1763,12 +1818,16 @@ out: btrfs_close_all_devices(); free(label); - + free(t_prepare); + free(prepare_ctx); return !!ret; + error: if (fd > 0) close(fd); + free(t_prepare); + free(prepare_ctx); free(label); exit(1); success:
[enhancement] When a disk is formatted as btrfs, it calls btrfs_prepare_device for each device, which takes too much time. [implementation] Put each btrfs_prepare_device into a thread, wait for the first thread to complete to mkfs.btrfs, and wait for other threads to complete before adding other devices to the file system. [test] Using the btrfs-progs test case mkfs-tests, mkfs.btrfs works fine. But I don't have an actual zoed device, so I don't know how much time it saves, If you guys have a way to test it, please let me know. Signed-off-by: Li Zhang <zhanglikernel@gmail.com> --- Issue: 496 mkfs/main.c | 113 +++++++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 86 insertions(+), 27 deletions(-)