diff mbox series

[V2] mkfs: increase the minimum log size to 64MB when possible

Message ID a8bc42f2-98db-2f16-2879-9ed62415ba01@redhat.com (mailing list archive)
State Accepted, archived
Headers show
Series [V2] mkfs: increase the minimum log size to 64MB when possible | expand

Commit Message

Eric Sandeen April 4, 2022, 11:08 p.m. UTC
Recently, the upstream maintainers have been taking a lot of heat on
account of writer threads encountering high latency when asking for log
grant space when the log is small.  The reported use case is a heavily
threaded indexing product logging trace information to a filesystem
ranging in size between 20 and 250GB.  The meetings that result from the
complaints about latency and stall warnings in dmesg both from this use
case and also a large well known cloud product are now consuming 25% of
the maintainer's weekly time and have been for months.

For small filesystems, the log is small by default because we have
defaulted to a ratio of 1:2048 (or even less).  For grown filesystems,
this is even worse, because big filesystems generate big metadata.
However, the log size is still insufficient even if it is formatted at
the larger size.

On a 220GB filesystem, the 99.95% latencies observed with a 200-writer
file synchronous append workload running on a 44-AG filesystem (with 44
CPUs) spread across 4 hard disks showed:

	99.5%
Log(MB)	Latency(ms)	BW (MB/s)	xlog_grant_head_wait
10	520		243		1875
20	220		308		540
40	140		360		6
80	92		363		0
160	86		364		0

For 4 NVME, the results were:

10	201		409		898
20	177		488		144
40	122		550		0
80	120		549		0
160	121		545		0

This shows pretty clearly that we could reduce the amount of time that
threads spend waiting on the XFS log by increasing the log size to at
least 40MB regardless of size.  We then repeated the benchmark with a
cloud system and an old machine to see if there were any ill effects on
less stable hardware.

For cloudy iscsi block storage, the results were:

10	390		176		2584
20	173		186		357
40	37		187		0
80	40		183		0
160	37		183		0

A decade-old machine w/ 24 CPUs and a giant spinning disk RAID6 array
produced this:

10	55		5.4		0
20	40		5.9		0
40	62		5.7		0
80	66		5.7		0
160	25		5.4		0

From the first three scenarios, it is clear that there are gains to be
had by sizing the log somewhere between 40 and 80MB -- the long tail
latency drops quite a bit, and programs are no longer blocking on the
log's transaction space grant heads.  Split the difference and set the
log size floor to 64MB.

Inspired-by: Darrick J. Wong <djwong@kernel.org>
Commit-log-stolen-from: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

This is reworked, with dependencies on other patches removed; details in
followup emails.

Comments

Eric Sandeen April 4, 2022, 11:31 p.m. UTC | #1
For starters I know the lack of if / else if in the co is a little
ugly but smashing into 80cols was uglier...

Here are the changes in log size for various filesystem geometries
(differing block sizes and filesystem sizes, with and without stripe
geometry to increase AG count). "--" means mkfs failed.

Blocksize: 4096
	|	orig		|	new
size	|	log	striped	|	log	striped
-------------------------------------------------------
128m	|	5m	m	|	5m	m
256m	|	5m	18m	|	5m	18m
511m	|	5m	18m	|	5m	18m
512m	|	5m	18m	|	64m	18m
513m	|	5m	18m	|	64m	64m
1024m	|	10m	18m	|	64m	64m
2047m	|	10m	18m	|	64m	64m
2048m	|	10m	18m	|	64m	64m
2049m	|	10m	18m	|	64m	64m
4g	|	10m	20m	|	64m	64m
8g	|	10m	20m	|	64m	64m
15g	|	10m	20m	|	64m	64m
16g	|	10m	20m	|	64m	64m
17g	|	10m	20m	|	64m	64m
32g	|	16m	20m	|	64m	64m
64g	|	32m	32m	|	64m	64m
256g	|	128m	128m	|	128m	128m
512g	|	256m	256m	|	256m	256m
1t	|	512m	512m	|	512m	512m
2t	|	1024m	1024m	|	1024m	1024m
4t	|	2038m	2038m	|	2038m	2038m
8t	|	2038m	2038m	|	2038m	2038m

Blocksize: 1024
	|	orig		|	new
size	|	log	striped	|	log	striped
------------------------------------------------------------------------------
128m	|	3m	15m	|	3m	15m
256m	|	3m	15m	|	3m	15m
511m	|	3m	15m	|	3m	15m
512m	|	3m	15m	|	64m	15m
513m	|	3m	15m	|	64m	64m
1024m	|	10m	15m	|	64m	64m
2047m	|	10m	16m	|	64m	64m
2048m	|	10m	16m	|	64m	64m
2049m	|	10m	16m	|	64m	64m
4g	|	10m	16m	|	64m	64m
8g	|	10m	16m	|	64m	64m
15g	|	10m	16m	|	64m	64m
16g	|	10m	16m	|	64m	64m
17g	|	10m	16m	|	64m	64m
32g	|	16m	16m	|	64m	64m
64g	|	32m	32m	|	64m	64m
256g	|	128m	128m	|	128m	128m
512g	|	256m	256m	|	256m	256m
1t	|	512m	512m	|	512m	512m
2t	|	1024m	1024m	|	1024m	1024m
4t	|	1024m	1024m	|	1024m	1024m
8t	|	1024m	1024m	|	1024m	1024m

Blocksize: 65536
	|	orig		|	new
size	|	log	striped	|	log	striped
------------------------------------------------------------------------------
128m	|	--	--	|	--	--
256m	|	32m	--	|	32m	--
511m	|	32m	32m	|	32m	32m
512m	|	32m	32m	|	64m	32m
513m	|	32m	32m	|	64m	63m
1024m	|	32m	32m	|	64m	64m
2047m	|	56m	45m	|	64m	64m
2048m	|	56m	45m	|	64m	64m
2049m	|	56m	45m	|	64m	64m
4g	|	56m	69m	|	64m	69m
8g	|	56m	69m	|	64m	69m
15g	|	56m	69m	|	64m	69m
16g	|	56m	69m	|	64m	69m
17g	|	56m	69m	|	64m	69m
32g	|	56m	69m	|	64m	69m
64g	|	56m	69m	|	64m	69m
256g	|	128m	128m	|	128m	128m
512g	|	256m	256m	|	256m	256m
1t	|	512m	512m	|	512m	512m
2t	|	1024m	1024m	|	1024m	1024m
4t	|	2038m	2038m	|	2038m	2038m
8t	|	2038m	2038m	|	2038m	2038m
Darrick J. Wong April 5, 2022, 12:55 a.m. UTC | #2
On Mon, Apr 04, 2022 at 06:08:28PM -0500, Eric Sandeen wrote:
> Recently, the upstream maintainers have been taking a lot of heat on
> account of writer threads encountering high latency when asking for log
> grant space when the log is small.  The reported use case is a heavily
> threaded indexing product logging trace information to a filesystem
> ranging in size between 20 and 250GB.  The meetings that result from the
> complaints about latency and stall warnings in dmesg both from this use
> case and also a large well known cloud product are now consuming 25% of
> the maintainer's weekly time and have been for months.
> 
> For small filesystems, the log is small by default because we have
> defaulted to a ratio of 1:2048 (or even less).  For grown filesystems,
> this is even worse, because big filesystems generate big metadata.
> However, the log size is still insufficient even if it is formatted at
> the larger size.
> 
> On a 220GB filesystem, the 99.95% latencies observed with a 200-writer
> file synchronous append workload running on a 44-AG filesystem (with 44
> CPUs) spread across 4 hard disks showed:
> 
> 	99.5%
> Log(MB)	Latency(ms)	BW (MB/s)	xlog_grant_head_wait
> 10	520		243		1875
> 20	220		308		540
> 40	140		360		6
> 80	92		363		0
> 160	86		364		0
> 
> For 4 NVME, the results were:
> 
> 10	201		409		898
> 20	177		488		144
> 40	122		550		0
> 80	120		549		0
> 160	121		545		0
> 
> This shows pretty clearly that we could reduce the amount of time that
> threads spend waiting on the XFS log by increasing the log size to at
> least 40MB regardless of size.  We then repeated the benchmark with a
> cloud system and an old machine to see if there were any ill effects on
> less stable hardware.
> 
> For cloudy iscsi block storage, the results were:
> 
> 10	390		176		2584
> 20	173		186		357
> 40	37		187		0
> 80	40		183		0
> 160	37		183		0
> 
> A decade-old machine w/ 24 CPUs and a giant spinning disk RAID6 array
> produced this:
> 
> 10	55		5.4		0
> 20	40		5.9		0
> 40	62		5.7		0
> 80	66		5.7		0
> 160	25		5.4		0
> 
> From the first three scenarios, it is clear that there are gains to be
> had by sizing the log somewhere between 40 and 80MB -- the long tail
> latency drops quite a bit, and programs are no longer blocking on the
> log's transaction space grant heads.  Split the difference and set the
> log size floor to 64MB.
> 
> Inspired-by: Darrick J. Wong <djwong@kernel.org>
> Commit-log-stolen-from: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> This is reworked, with dependencies on other patches removed; details in
> followup emails.
> 
> diff --git a/include/xfs_multidisk.h b/include/xfs_multidisk.h
> index a16a9fe2..ef4443b0 100644
> --- a/include/xfs_multidisk.h
> +++ b/include/xfs_multidisk.h
> @@ -17,8 +17,6 @@
>  #define	XFS_MIN_INODE_PERBLOCK	2		/* min inodes per block */
>  #define	XFS_DFL_IMAXIMUM_PCT	25		/* max % of space for inodes */
>  #define	XFS_MIN_REC_DIRSIZE	12		/* 4096 byte dirblocks (V2) */
> -#define	XFS_DFL_LOG_FACTOR	5		/* default log size, factor */
> -						/* with max trans reservation */
>  #define XFS_MAX_INODE_SIG_BITS	32		/* most significant bits in an
>  						 * inode number that we'll
>  						 * accept w/o warnings
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 96682f9a..e36c1083 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -18,6 +18,14 @@
>  #define GIGABYTES(count, blog)	((uint64_t)(count) << (30 - (blog)))
>  #define MEGABYTES(count, blog)	((uint64_t)(count) << (20 - (blog)))
>  
> +/*
> + * Realistically, the log should never be smaller than 64MB.  Studies by the
> + * kernel maintainer in early 2022 have shown a dramatic reduction in long tail
> + * latency of the xlog grant head waitqueue when running a heavy metadata
> + * update workload when the log size is at least 64MB.
> + */
> +#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog)	(MEGABYTES(64, (blog)))
> +
>  /*
>   * Use this macro before we have superblock and mount structure to
>   * convert from basic blocks to filesystem blocks.
> @@ -3266,7 +3274,7 @@ calculate_log_size(
>  	struct xfs_mount	*mp)
>  {
>  	struct xfs_sb		*sbp = &mp->m_sb;
> -	int			min_logblocks;
> +	int			min_logblocks;	/* absolute minimum */
>  	struct xfs_mount	mount;
>  
>  	/* we need a temporary mount to calculate the minimum log size. */
> @@ -3308,28 +3316,17 @@ _("external log device size %lld blocks too small, must be at least %lld blocks\
>  
>  	/* internal log - if no size specified, calculate automatically */
>  	if (!cfg->logblocks) {
> -		if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
> -			/* tiny filesystems get minimum sized logs. */
> -			cfg->logblocks = min_logblocks;
> -		} else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
> +		/* Use a 2048:1 fs:log ratio for most filesystems */
> +		cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
> +		cfg->logblocks = cfg->logblocks >> cfg->blocklog;
>  
> -			/*
> -			 * For small filesystems, we want to use the
> -			 * XFS_MIN_LOG_BYTES for filesystems smaller than 16G if
> -			 * at all possible, ramping up to 128MB at 256GB.
> -			 */
> -			cfg->logblocks = min(XFS_MIN_LOG_BYTES >> cfg->blocklog,
> -					min_logblocks * XFS_DFL_LOG_FACTOR);
> -		} else {
> -			/*
> -			 * With a 2GB max log size, default to maximum size
> -			 * at 4TB. This keeps the same ratio from the older
> -			 * max log size of 128M at 256GB fs size. IOWs,
> -			 * the ratio of fs size to log size is 2048:1.
> -			 */
> -			cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
> -			cfg->logblocks = cfg->logblocks >> cfg->blocklog;
> -		}
> +		/* But don't go below a reasonable size */
> +		cfg->logblocks = max(cfg->logblocks,
> +				XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));
> +
> +		/* And for a tiny filesystem, use the absolute minimum size */
> +		if (cfg->dblocks < MEGABYTES(512, cfg->blocklog))
> +			cfg->logblocks = min_logblocks;

Heh, I was going to apply this to any filesystem under 300MB (and then
cut everyone off at 300M) but I suppose if you'd rather set that at 512M
then I'm not going to complain... maybe we're better off not creating
absurd things like 20% of a tiny FS used for logs. :D

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  
>  		/* Ensure the chosen size meets minimum log size requirements */
>  		cfg->logblocks = max(min_logblocks, cfg->logblocks);
>
Dave Chinner April 5, 2022, 5:36 a.m. UTC | #3
On Mon, Apr 04, 2022 at 06:31:03PM -0500, Eric Sandeen wrote:
> For starters I know the lack of if / else if in the co is a little
> ugly but smashing into 80cols was uglier...
> 
> Here are the changes in log size for various filesystem geometries
> (differing block sizes and filesystem sizes, with and without stripe
> geometry to increase AG count). "--" means mkfs failed.
> 
> Blocksize: 4096
> 	|	orig		|	new
> size	|	log	striped	|	log	striped
> -------------------------------------------------------
> 128m	|	5m	m	|	5m	m
> 256m	|	5m	18m	|	5m	18m
> 511m	|	5m	18m	|	5m	18m
> 512m	|	5m	18m	|	64m	18m
> 513m	|	5m	18m	|	64m	64m
> 1024m	|	10m	18m	|	64m	64m
> 2047m	|	10m	18m	|	64m	64m
> 2048m	|	10m	18m	|	64m	64m
> 2049m	|	10m	18m	|	64m	64m
> 4g	|	10m	20m	|	64m	64m
> 8g	|	10m	20m	|	64m	64m
> 15g	|	10m	20m	|	64m	64m
> 16g	|	10m	20m	|	64m	64m
> 17g	|	10m	20m	|	64m	64m
> 32g	|	16m	20m	|	64m	64m
> 64g	|	32m	32m	|	64m	64m
> 256g	|	128m	128m	|	128m	128m
> 512g	|	256m	256m	|	256m	256m
> 1t	|	512m	512m	|	512m	512m
> 2t	|	1024m	1024m	|	1024m	1024m
> 4t	|	2038m	2038m	|	2038m	2038m
> 8t	|	2038m	2038m	|	2038m	2038m
> 
> Blocksize: 1024
> 	|	orig		|	new
> size	|	log	striped	|	log	striped
> ------------------------------------------------------------------------------
> 128m	|	3m	15m	|	3m	15m
> 256m	|	3m	15m	|	3m	15m
> 511m	|	3m	15m	|	3m	15m
> 512m	|	3m	15m	|	64m	15m
> 513m	|	3m	15m	|	64m	64m
> 1024m	|	10m	15m	|	64m	64m
> 2047m	|	10m	16m	|	64m	64m
> 2048m	|	10m	16m	|	64m	64m
> 2049m	|	10m	16m	|	64m	64m
> 4g	|	10m	16m	|	64m	64m
> 8g	|	10m	16m	|	64m	64m
> 15g	|	10m	16m	|	64m	64m
> 16g	|	10m	16m	|	64m	64m
> 17g	|	10m	16m	|	64m	64m
> 32g	|	16m	16m	|	64m	64m
> 64g	|	32m	32m	|	64m	64m
> 256g	|	128m	128m	|	128m	128m
> 512g	|	256m	256m	|	256m	256m
> 1t	|	512m	512m	|	512m	512m
> 2t	|	1024m	1024m	|	1024m	1024m
> 4t	|	1024m	1024m	|	1024m	1024m
> 8t	|	1024m	1024m	|	1024m	1024m
> 
> Blocksize: 65536
> 	|	orig		|	new
> size	|	log	striped	|	log	striped
> ------------------------------------------------------------------------------
> 128m	|	--	--	|	--	--
> 256m	|	32m	--	|	32m	--
> 511m	|	32m	32m	|	32m	32m
> 512m	|	32m	32m	|	64m	32m
> 513m	|	32m	32m	|	64m	63m
> 1024m	|	32m	32m	|	64m	64m
> 2047m	|	56m	45m	|	64m	64m
> 2048m	|	56m	45m	|	64m	64m
> 2049m	|	56m	45m	|	64m	64m
> 4g	|	56m	69m	|	64m	69m
> 8g	|	56m	69m	|	64m	69m
> 15g	|	56m	69m	|	64m	69m
> 16g	|	56m	69m	|	64m	69m
> 17g	|	56m	69m	|	64m	69m
> 32g	|	56m	69m	|	64m	69m
> 64g	|	56m	69m	|	64m	69m
> 256g	|	128m	128m	|	128m	128m
> 512g	|	256m	256m	|	256m	256m
> 1t	|	512m	512m	|	512m	512m
> 2t	|	1024m	1024m	|	1024m	1024m
> 4t	|	2038m	2038m	|	2038m	2038m
> 8t	|	2038m	2038m	|	2038m	2038m

Those new sizes look good to me.

Acked-by: Dave Chinner <dchinner@redhat.com>
diff mbox series

Patch

diff --git a/include/xfs_multidisk.h b/include/xfs_multidisk.h
index a16a9fe2..ef4443b0 100644
--- a/include/xfs_multidisk.h
+++ b/include/xfs_multidisk.h
@@ -17,8 +17,6 @@ 
 #define	XFS_MIN_INODE_PERBLOCK	2		/* min inodes per block */
 #define	XFS_DFL_IMAXIMUM_PCT	25		/* max % of space for inodes */
 #define	XFS_MIN_REC_DIRSIZE	12		/* 4096 byte dirblocks (V2) */
-#define	XFS_DFL_LOG_FACTOR	5		/* default log size, factor */
-						/* with max trans reservation */
 #define XFS_MAX_INODE_SIG_BITS	32		/* most significant bits in an
 						 * inode number that we'll
 						 * accept w/o warnings
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 96682f9a..e36c1083 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -18,6 +18,14 @@ 
 #define GIGABYTES(count, blog)	((uint64_t)(count) << (30 - (blog)))
 #define MEGABYTES(count, blog)	((uint64_t)(count) << (20 - (blog)))
 
+/*
+ * Realistically, the log should never be smaller than 64MB.  Studies by the
+ * kernel maintainer in early 2022 have shown a dramatic reduction in long tail
+ * latency of the xlog grant head waitqueue when running a heavy metadata
+ * update workload when the log size is at least 64MB.
+ */
+#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog)	(MEGABYTES(64, (blog)))
+
 /*
  * Use this macro before we have superblock and mount structure to
  * convert from basic blocks to filesystem blocks.
@@ -3266,7 +3274,7 @@  calculate_log_size(
 	struct xfs_mount	*mp)
 {
 	struct xfs_sb		*sbp = &mp->m_sb;
-	int			min_logblocks;
+	int			min_logblocks;	/* absolute minimum */
 	struct xfs_mount	mount;
 
 	/* we need a temporary mount to calculate the minimum log size. */
@@ -3308,28 +3316,17 @@  _("external log device size %lld blocks too small, must be at least %lld blocks\
 
 	/* internal log - if no size specified, calculate automatically */
 	if (!cfg->logblocks) {
-		if (cfg->dblocks < GIGABYTES(1, cfg->blocklog)) {
-			/* tiny filesystems get minimum sized logs. */
-			cfg->logblocks = min_logblocks;
-		} else if (cfg->dblocks < GIGABYTES(16, cfg->blocklog)) {
+		/* Use a 2048:1 fs:log ratio for most filesystems */
+		cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
+		cfg->logblocks = cfg->logblocks >> cfg->blocklog;
 
-			/*
-			 * For small filesystems, we want to use the
-			 * XFS_MIN_LOG_BYTES for filesystems smaller than 16G if
-			 * at all possible, ramping up to 128MB at 256GB.
-			 */
-			cfg->logblocks = min(XFS_MIN_LOG_BYTES >> cfg->blocklog,
-					min_logblocks * XFS_DFL_LOG_FACTOR);
-		} else {
-			/*
-			 * With a 2GB max log size, default to maximum size
-			 * at 4TB. This keeps the same ratio from the older
-			 * max log size of 128M at 256GB fs size. IOWs,
-			 * the ratio of fs size to log size is 2048:1.
-			 */
-			cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
-			cfg->logblocks = cfg->logblocks >> cfg->blocklog;
-		}
+		/* But don't go below a reasonable size */
+		cfg->logblocks = max(cfg->logblocks,
+				XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));
+
+		/* And for a tiny filesystem, use the absolute minimum size */
+		if (cfg->dblocks < MEGABYTES(512, cfg->blocklog))
+			cfg->logblocks = min_logblocks;
 
 		/* Ensure the chosen size meets minimum log size requirements */
 		cfg->logblocks = max(min_logblocks, cfg->logblocks);