diff mbox series

btrfs-progs: mkfs: use flock to prevent udev race

Message ID 61a44fa98c640d93a77c14a7f617f5d68f166002.1678755426.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs-progs: mkfs: use flock to prevent udev race | expand

Commit Message

Qu Wenruo March 14, 2023, 12:57 a.m. UTC
[BUG]
There is an internal bug report that, after mkfs.btrfs there is some
random cases where the uuid soft link (/dev/disk/by-uuid) is not
created.

[CAUSE]
The soft link is created by udev, which monitor those block devices by
listening to the inotify.

But during the scan for filesystems, inotify would be temporarily
disabled until the scan is done.

However btrfs split the mkfs into several parts, leaving a window for
udev to got half backed result:

- Disk prepare
  This involves discarding the whole disk (by defualt) or previous
  superblock (-k option).

  After the prepare is done, we close the fd of each device, which would
  trigger udev scan on the block device, without any btrfs superblock
  signature.

- Temporary btrfs creation
  Here we create an initial btrfs image on the first device, using
  the first device, and then close the fd.

  This would also trigger udev scan, but the temporary superblock
  contains an invalid magic number.

- Real btrfs creation
  Here we call open_ctree() on the initial btrfs, and make it to be
  a proper btrfs.
  Then we call close_ctree() to finalize the fs, writing back the real
  super blocks and close the devices.

  This is the normal even triggering udev to detect new btrfs and create
  the soft link.

However if the first two steps triggered a long enough scan that covers
the last step, the last step itself can not trigger a scan, thus udev
only got an empty or invalid btrfs super block, and won't create the
uuid soft link.

[FIX]
To address the problem, follow the advice from systemd
[BLOCK_DEVICE_LOCKING] section, using flock() to lock an fd of each
device.

And only close the locked fd after all operation is done.

Here we re-use the prepare_ctx[] array, and saves the fd until the end
of mkfs.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 mkfs/main.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

Comments

Qu Wenruo March 15, 2023, 5 a.m. UTC | #1
Please discard this patch.

There is some extra limits on the flock() usage, especially for dm devices.

- One has to lock the root block device, not the partition one

- flock() on dm devices are mostly ignored

I'll update the patch to a new method to address the problem.

Thanks,
Qu

On 2023/3/14 08:57, Qu Wenruo wrote:
> [BUG]
> There is an internal bug report that, after mkfs.btrfs there is some
> random cases where the uuid soft link (/dev/disk/by-uuid) is not
> created.
> 
> [CAUSE]
> The soft link is created by udev, which monitor those block devices by
> listening to the inotify.
> 
> But during the scan for filesystems, inotify would be temporarily
> disabled until the scan is done.
> 
> However btrfs split the mkfs into several parts, leaving a window for
> udev to got half backed result:
> 
> - Disk prepare
>    This involves discarding the whole disk (by defualt) or previous
>    superblock (-k option).
> 
>    After the prepare is done, we close the fd of each device, which would
>    trigger udev scan on the block device, without any btrfs superblock
>    signature.
> 
> - Temporary btrfs creation
>    Here we create an initial btrfs image on the first device, using
>    the first device, and then close the fd.
> 
>    This would also trigger udev scan, but the temporary superblock
>    contains an invalid magic number.
> 
> - Real btrfs creation
>    Here we call open_ctree() on the initial btrfs, and make it to be
>    a proper btrfs.
>    Then we call close_ctree() to finalize the fs, writing back the real
>    super blocks and close the devices.
> 
>    This is the normal even triggering udev to detect new btrfs and create
>    the soft link.
> 
> However if the first two steps triggered a long enough scan that covers
> the last step, the last step itself can not trigger a scan, thus udev
> only got an empty or invalid btrfs super block, and won't create the
> uuid soft link.
> 
> [FIX]
> To address the problem, follow the advice from systemd
> [BLOCK_DEVICE_LOCKING] section, using flock() to lock an fd of each
> device.
> 
> And only close the locked fd after all operation is done.
> 
> Here we re-use the prepare_ctx[] array, and saves the fd until the end
> of mkfs.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   mkfs/main.c | 24 +++++++++++++++++++-----
>   1 file changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/mkfs/main.c b/mkfs/main.c
> index f5e34cbda612..e91311bf6eb4 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -18,6 +18,7 @@
>   
>   #include "kerncompat.h"
>   #include <sys/stat.h>
> +#include <sys/file.h>
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <fcntl.h>
> @@ -75,6 +76,7 @@ struct prepare_device_progress {
>   	char *file;
>   	u64 dev_block_count;
>   	u64 block_count;
> +	int fd;
>   	int ret;
>   };
>   
> @@ -965,22 +967,26 @@ fail:
>   static void *prepare_one_device(void *ctx)
>   {
>   	struct prepare_device_progress *prepare_ctx = ctx;
> -	int fd;
>   
> -	fd = open(prepare_ctx->file, opt_oflags);
> -	if (fd < 0) {
> +	prepare_ctx->fd = open(prepare_ctx->file, opt_oflags);
> +	if (prepare_ctx->fd < 0) {
>   		error("unable to open %s: %m", prepare_ctx->file);
>   		prepare_ctx->ret = -errno;
>   		return NULL;
>   	}
> -	prepare_ctx->ret = btrfs_prepare_device(fd, prepare_ctx->file,
> +	/*
> +	 * Take flock() to prevent udev got a half backed scan.
> +	 * This is only an advisory operation, thus no need to handle
> +	 * its failure.
> +	 */
> +	flock(prepare_ctx->fd, LOCK_EX | LOCK_NB);
> +	prepare_ctx->ret = btrfs_prepare_device(prepare_ctx->fd, prepare_ctx->file,
>   				&prepare_ctx->dev_block_count,
>   				prepare_ctx->block_count,
>   				(bconf.verbose ? PREP_DEVICE_VERBOSE : 0) |
>   				(opt_zero_end ? PREP_DEVICE_ZERO_END : 0) |
>   				(opt_discard ? PREP_DEVICE_DISCARD : 0) |
>   				(opt_zoned ? PREP_DEVICE_ZONED : 0));
> -	close(fd);
>   
>   	return NULL;
>   }
> @@ -1002,6 +1008,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	u64 min_dev_size;
>   	u64 shrink_size;
>   	int device_count = 0;
> +	int orig_device_count;
>   	int saved_optind;
>   	pthread_t *t_prepare = NULL;
>   	struct prepare_device_progress *prepare_ctx = NULL;
> @@ -1217,6 +1224,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	stripesize = sectorsize;
>   	saved_optind = optind;
>   	device_count = argc - optind;
> +	orig_device_count = device_count;
>   	if (device_count == 0)
>   		usage(&mkfs_cmd, 1);
>   
> @@ -1498,6 +1506,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   
>   	/* Start threads */
>   	for (i = 0; i < device_count; i++) {
> +		prepare_ctx[i].fd = -1;
>   		prepare_ctx[i].file = argv[optind + i - 1];
>   		prepare_ctx[i].block_count = block_count;
>   		prepare_ctx[i].dev_block_count = block_count;
> @@ -1838,6 +1847,8 @@ out:
>   	}
>   
>   	btrfs_close_all_devices();
> +	for (i = 0; i < orig_device_count; i++)
> +		close(prepare_ctx[i].fd);
>   	free(t_prepare);
>   	free(prepare_ctx);
>   	free(label);
> @@ -1849,6 +1860,9 @@ error:
>   	if (fd > 0)
>   		close(fd);
>   
> +	if (prepare_ctx)
> +		for (i = 0; i < orig_device_count; i++)
> +			close(prepare_ctx[i].fd);
>   	free(t_prepare);
>   	free(prepare_ctx);
>   	free(label);
diff mbox series

Patch

diff --git a/mkfs/main.c b/mkfs/main.c
index f5e34cbda612..e91311bf6eb4 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -18,6 +18,7 @@ 
 
 #include "kerncompat.h"
 #include <sys/stat.h>
+#include <sys/file.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <fcntl.h>
@@ -75,6 +76,7 @@  struct prepare_device_progress {
 	char *file;
 	u64 dev_block_count;
 	u64 block_count;
+	int fd;
 	int ret;
 };
 
@@ -965,22 +967,26 @@  fail:
 static void *prepare_one_device(void *ctx)
 {
 	struct prepare_device_progress *prepare_ctx = ctx;
-	int fd;
 
-	fd = open(prepare_ctx->file, opt_oflags);
-	if (fd < 0) {
+	prepare_ctx->fd = open(prepare_ctx->file, opt_oflags);
+	if (prepare_ctx->fd < 0) {
 		error("unable to open %s: %m", prepare_ctx->file);
 		prepare_ctx->ret = -errno;
 		return NULL;
 	}
-	prepare_ctx->ret = btrfs_prepare_device(fd, prepare_ctx->file,
+	/*
+	 * Take flock() to prevent udev got a half backed scan.
+	 * This is only an advisory operation, thus no need to handle
+	 * its failure.
+	 */
+	flock(prepare_ctx->fd, LOCK_EX | LOCK_NB);
+	prepare_ctx->ret = btrfs_prepare_device(prepare_ctx->fd, prepare_ctx->file,
 				&prepare_ctx->dev_block_count,
 				prepare_ctx->block_count,
 				(bconf.verbose ? PREP_DEVICE_VERBOSE : 0) |
 				(opt_zero_end ? PREP_DEVICE_ZERO_END : 0) |
 				(opt_discard ? PREP_DEVICE_DISCARD : 0) |
 				(opt_zoned ? PREP_DEVICE_ZONED : 0));
-	close(fd);
 
 	return NULL;
 }
@@ -1002,6 +1008,7 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 	u64 min_dev_size;
 	u64 shrink_size;
 	int device_count = 0;
+	int orig_device_count;
 	int saved_optind;
 	pthread_t *t_prepare = NULL;
 	struct prepare_device_progress *prepare_ctx = NULL;
@@ -1217,6 +1224,7 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 	stripesize = sectorsize;
 	saved_optind = optind;
 	device_count = argc - optind;
+	orig_device_count = device_count;
 	if (device_count == 0)
 		usage(&mkfs_cmd, 1);
 
@@ -1498,6 +1506,7 @@  int BOX_MAIN(mkfs)(int argc, char **argv)
 
 	/* Start threads */
 	for (i = 0; i < device_count; i++) {
+		prepare_ctx[i].fd = -1;
 		prepare_ctx[i].file = argv[optind + i - 1];
 		prepare_ctx[i].block_count = block_count;
 		prepare_ctx[i].dev_block_count = block_count;
@@ -1838,6 +1847,8 @@  out:
 	}
 
 	btrfs_close_all_devices();
+	for (i = 0; i < orig_device_count; i++)
+		close(prepare_ctx[i].fd);
 	free(t_prepare);
 	free(prepare_ctx);
 	free(label);
@@ -1849,6 +1860,9 @@  error:
 	if (fd > 0)
 		close(fd);
 
+	if (prepare_ctx)
+		for (i = 0; i < orig_device_count; i++)
+			close(prepare_ctx[i].fd);
 	free(t_prepare);
 	free(prepare_ctx);
 	free(label);