diff mbox series

[v3,2/2] xfs: make xfs_log_iovec independent from xfs_log_vec and free it early

Message ID 20240626044909.15060-3-alexjlzheng@tencent.com (mailing list archive)
State Accepted, archived
Headers show
Series Separate xfs_log_vec/iovec to save memory | expand

Commit Message

Jinliang Zheng June 26, 2024, 4:49 a.m. UTC
From: Jinliang Zheng <alexjlzheng@tencent.com>

When the contents of the xfs_log_vec/xfs_log_iovec combination are
written to iclog, xfs_log_iovec loses its meaning in continuing to exist
in memory, because iclog already has a copy of its contents. We only
need to keep xfs_log_vec that takes up very little memory to find the
xfs_log_item that needs to be added to AIL after we flush the iclog into
the disk log space.

Because xfs_log_iovec dominates most of the memory in the
xfs_log_vec/xfs_log_iovec combination, retaining xfs_log_iovec until
iclog is flushed into the disk log space and releasing together with
xfs_log_vec is a significant waste of memory.

This patch separates the memory of xfs_log_iovec from that of
xfs_log_vec, and releases the memory of xfs_log_iovec in advance to save
memory.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/xfs/xfs_log.c     |  2 ++
 fs/xfs/xfs_log.h     |  8 ++++++--
 fs/xfs/xfs_log_cil.c | 34 ++++++++++++++++++++--------------
 3 files changed, 28 insertions(+), 16 deletions(-)

Comments

Dave Chinner July 1, 2024, 12:51 a.m. UTC | #1
On Wed, Jun 26, 2024 at 12:49:09PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> When the contents of the xfs_log_vec/xfs_log_iovec combination are
> written to iclog, xfs_log_iovec loses its meaning in continuing to exist
> in memory, because iclog already has a copy of its contents. We only
> need to keep xfs_log_vec that takes up very little memory to find the
> xfs_log_item that needs to be added to AIL after we flush the iclog into
> the disk log space.
> 
> Because xfs_log_iovec dominates most of the memory in the
> xfs_log_vec/xfs_log_iovec combination, retaining xfs_log_iovec until
> iclog is flushed into the disk log space and releasing together with
> xfs_log_vec is a significant waste of memory.

Have you measured this? Please provide numbers and the workload that
generates them, because when I did this combined structure the
numbers and performance measured came out decisively on the side of
"almost no difference in memory usage, major performance cost to
doing a second allocation"...

Here's the logic - the iovec array is largely "free" with the larger
data allocation.

------

Look at how the heap is structured - it is in power of 2 slab sizes:

$ grep kmalloc /proc/slabinfo |tail -13
kmalloc-8k           949    976   8192    4    8 : tunables    0    0    0 : slabdata    244    244      0
kmalloc-4k          1706   1768   4096    8    8 : tunables    0    0    0 : slabdata    221    221      0
kmalloc-2k          3252   3312   2048   16    8 : tunables    0    0    0 : slabdata    207    207      0
kmalloc-1k         76110  96192   1024   32    8 : tunables    0    0    0 : slabdata   3006   3006      0
kmalloc-512        71753  98656    512   32    4 : tunables    0    0    0 : slabdata   3083   3083      0
kmalloc-256        71006  71520    256   32    2 : tunables    0    0    0 : slabdata   2235   2235      0
kmalloc-192        10304  10458    192   42    2 : tunables    0    0    0 : slabdata    249    249      0
kmalloc-128         8889   9280    128   32    1 : tunables    0    0    0 : slabdata    290    290      0
kmalloc-96         13583  13902     96   42    1 : tunables    0    0    0 : slabdata    331    331      0
kmalloc-64         63116  64640     64   64    1 : tunables    0    0    0 : slabdata   1010   1010      0
kmalloc-32        552726 582272     32  128    1 : tunables    0    0    0 : slabdata   4549   4549      0
kmalloc-16        444768 445440     16  256    1 : tunables    0    0    0 : slabdata   1740   1740      0
kmalloc-8          18178  18432      8  512    1 : tunables    0    0    0 : slabdata     36     36      0

IOws, if we do a 260 byte allocation, we get the same sized memory
chunk as a 512 byte allocation as they come from the same slab
cache.

If we now look at structure sizes - the common ones are buffers
and inodes so we'll look at then.

For an inode, we typically log something like this for an extent
allocation (or free) on mostly contiguous inode (say less than 10
extents)

vec 1:	inode log format item
vec 2:	inode core
vec 3:	inode data fork

Each of these vectors has a 12 byte log op header built into them,
and some padding to round them out to 8 byte alignment.

vec 1:	inode log format item:	12 + 56 + 4 (pad)
vec 2:	inode core:		12 + 176 + 4 (pad)
vec 3:	inode data fork:	12 + 16 (minimum) + 4 (pad)
				12 + 336 (maximum for 512 byte inode)

If we are just logging the inode core, we are allocating
12 + 56 + 4 + 12 + 176 + 4 = 264 bytes.

It should be obvious now that this must be allocated from the 512
byte slab, and that means we have another 248 bytes of unused space
in that allocated region we can actually use -for free-.

IOWs, the fact that we add 32 bytes for the 2 iovecs for to index
this inode log item doesn't matter at all - it's free space on the
heap. Indeed, it's not until the inode data fork gets to a couple of
hundred bytes in length that we overflow the 512 byte slab and have
to use the 1kB slab. Again, we get the iovec array space for free.

If we are logging the entire inode with the data fork, then the
size of the data being logged is 264 + 12 + 336 + 4 = 616 bytes.
This is well over the 512 byte slab, so we are always going to be
allocating from the 1kB slab. We get the iovec array for free the
moment we go over the 512 byte threshold again.

IOWs, all the separation of the iovec array does is slightly change
the data/attr fork size thresholds where we go from using the 512
byte slab to the 1kB slab.

A similar pattern holds out for the buffer log items.  The minimum
it will be is:

vec 1:	buf log format item
vec 2:	single 128 byte chunk

This requires 12 + 40B + 4 + 12 + 128B + 4 = 200 bytes. For two
vectors, we need 32 bytes for the iovec array, so a total of 232
bytes is needed, and this will fit in a 256 byte slab with or
without the iovec array attached.

The same situation occurs are we increase the number of logged
regions or the size of the logged regions - in almost all cases we
get the iovec array for free because we log 128 byte regions out of
buffers and they will put us into the next largest size slab
regardless of the memory used by the iovec array.

Hence we should almost always get the space for the iovec array for
free from the slab allocator, and separating it out doesn't actually
reduce slab cache memory usage. If anything, it increases it,
because now we are allocating the iovec array out of small slabs and
so instead of it being "free" the memory usage is now accounted to
smaller slabs...

-----

Hence before we go any further with this patch set, I'd like to see
see numbers that quantify how much extra memory the embedded iovec
array is actually costing us. And from that, an explanation of why
the above "iovec array space should be cost-free" logic isn't
working as intended....

-Dave.
Christoph Hellwig July 1, 2024, 4:49 a.m. UTC | #2
On Mon, Jul 01, 2024 at 10:51:13AM +1000, Dave Chinner wrote:
> Here's the logic - the iovec array is largely "free" with the larger
> data allocation.

What the patch does it to free the data allocation, that is the shadow
buffer earlier.  Which would safe a quite a bit of memory indeed ... if
we didn't expect the shadow buffer to be needed again a little later
anyway, which AFAIK is the assumption under which the CIL code operates.

So as asked previously and by you again here I'd love to see numbers
for workloads where this actually is a benefit.
Dave Chinner July 2, 2024, 11:44 p.m. UTC | #3
On Sun, Jun 30, 2024 at 09:49:03PM -0700, Christoph Hellwig wrote:
> On Mon, Jul 01, 2024 at 10:51:13AM +1000, Dave Chinner wrote:
> > Here's the logic - the iovec array is largely "free" with the larger
> > data allocation.
> 
> What the patch does it to free the data allocation, that is the shadow
> buffer earlier.  Which would safe a quite a bit of memory indeed ... if
> we didn't expect the shadow buffer to be needed again a little later
> anyway, which AFAIK is the assumption under which the CIL code operates.

Ah, ok, my bad. I missed that because the xfs_log_iovec is not the
data buffer - it is specifically just the iovec array that indexes
the data buffer. Everything in the commit message references the
xfs_log_iovec, and makes no mention of the actual logged metadata
that is being stored, and I didn't catch that the submitter was
using xfs_log_iovec to mean something different to what I understand
it to be from looking at the code. That's why I take the time to
explain my reasoning - so that people aren't in any doubt about how
I interpretted the changes and can easily point out where I've gone
wrong. :)

> So as asked previously and by you again here I'd love to see numbers
> for workloads where this actually is a benefit.

Yup, it doesn't change the basic premise that no allocations in the
fast path is faster than doing even one allocation in the fast
path. I made the explicit design choice to consume that
memory as a necessary cost of going fast, and the memory is already
being consumed while the objects are sitting and being relogged in
the CIL before the CIL is formatted and checkpointed.

Hence I'm not sure that freeing it before the checkpoint IO is
submitted actually reduces the memory footprint significantly at
all. Numbers and workloads are definitely needed.

Cheers,

Dave.
Christoph Hellwig July 3, 2024, 5:14 a.m. UTC | #4
On Wed, Jul 03, 2024 at 09:44:36AM +1000, Dave Chinner wrote:
> Ah, ok, my bad. I missed that because the xfs_log_iovec is not the
> data buffer - it is specifically just the iovec array that indexes
> the data buffer. Everything in the commit message references the
> xfs_log_iovec, and makes no mention of the actual logged metadata
> that is being stored, and I didn't catch that the submitter was
> using xfs_log_iovec to mean something different to what I understand
> it to be from looking at the code. That's why I take the time to
> explain my reasoning - so that people aren't in any doubt about how
> I interpretted the changes and can easily point out where I've gone
> wrong. :) 

And throw in the xfs_log_vec vs xfs_log_iovec naming that keeps
confusing me after all these years..
diff mbox series

Patch

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 49e676061f2f..84a01ce61c96 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2527,6 +2527,8 @@  xlog_write(
 			xlog_write_full(lv, ticket, iclog, &log_offset,
 					 &len, &record_cnt, &data_cnt);
 		}
+		if (lv->lv_flags & XFS_LOG_VEC_DYNAMIC)
+			kvfree(lv->lv_iovecp);
 	}
 	ASSERT(len == 0);
 
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 9cc10acf7bcd..7d0ae93e9e79 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -6,6 +6,8 @@ 
 #ifndef	__XFS_LOG_H__
 #define __XFS_LOG_H__
 
+#define XFS_LOG_VEC_DYNAMIC	(1 << 0)
+
 struct xfs_cil_ctx;
 
 struct xfs_log_vec {
@@ -17,7 +19,8 @@  struct xfs_log_vec {
 	char			*lv_buf;	/* formatted buffer */
 	int			lv_bytes;	/* accounted space in buffer */
 	int			lv_buf_len;	/* aligned size of buffer */
-	int			lv_size;	/* size of allocated lv */
+	int			lv_size;	/* size of allocated iovecp + buf */
+	int			lv_flags;	/* lv flags */
 };
 
 extern struct kmem_cache *xfs_log_vec_cache;
@@ -71,7 +74,8 @@  xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec,
 	vec->i_len = len;
 
 	/* Catch buffer overruns */
-	ASSERT((void *)lv->lv_buf + lv->lv_bytes <= (void *)lv + lv->lv_size);
+	ASSERT((void *)lv->lv_buf + lv->lv_bytes <=
+	       (void *)lv->lv_iovecp + lv->lv_size);
 }
 
 /*
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index f51cbc6405c1..0175bd68590a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -219,8 +219,7 @@  static inline int
 xlog_cil_iovec_space(
 	uint	niovecs)
 {
-	return round_up((sizeof(struct xfs_log_vec) +
-					niovecs * sizeof(struct xfs_log_iovec)),
+	return round_up(niovecs * sizeof(struct xfs_log_iovec),
 			sizeof(uint64_t));
 }
 
@@ -279,6 +278,7 @@  xlog_cil_alloc_shadow_bufs(
 
 	list_for_each_entry(lip, &tp->t_items, li_trans) {
 		struct xfs_log_vec *lv;
+		struct xfs_log_iovec *lvec;
 		int	niovecs = 0;
 		int	nbytes = 0;
 		int	buf_size;
@@ -330,8 +330,8 @@  xlog_cil_alloc_shadow_bufs(
 		 * if we have no shadow buffer, or it is too small, we need to
 		 * reallocate it.
 		 */
-		if (!lip->li_lv_shadow ||
-		    buf_size > lip->li_lv_shadow->lv_size) {
+		lv = lip->li_lv_shadow;
+		if (!lv || buf_size > lv->lv_size) {
 			/*
 			 * We free and allocate here as a realloc would copy
 			 * unnecessary data. We don't use kvzalloc() for the
@@ -339,22 +339,27 @@  xlog_cil_alloc_shadow_bufs(
 			 * the buffer, only the log vector header and the iovec
 			 * storage.
 			 */
-			kvfree(lip->li_lv_shadow);
-			lv = xlog_kvmalloc(buf_size);
+			if (lv)
+				kvfree(lv->lv_iovecp);
+			else
+				lv = kmem_cache_alloc(xfs_log_vec_cache,
+						GFP_KERNEL | __GFP_NOFAIL);
 
-			memset(lv, 0, xlog_cil_iovec_space(niovecs));
+			memset(lv, 0, sizeof(struct xfs_log_vec));
+			lvec = xlog_kvmalloc(buf_size);
+			memset(lvec, 0, xlog_cil_iovec_space(niovecs));
 
+			lv->lv_flags |= XFS_LOG_VEC_DYNAMIC;
 			INIT_LIST_HEAD(&lv->lv_list);
 			lv->lv_item = lip;
 			lv->lv_size = buf_size;
 			if (ordered)
 				lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
 			else
-				lv->lv_iovecp = (struct xfs_log_iovec *)&lv[1];
+				lv->lv_iovecp = lvec;
 			lip->li_lv_shadow = lv;
 		} else {
 			/* same or smaller, optimise common overwrite case */
-			lv = lip->li_lv_shadow;
 			if (ordered)
 				lv->lv_buf_len = XFS_LOG_VEC_ORDERED;
 			else
@@ -366,9 +371,9 @@  xlog_cil_alloc_shadow_bufs(
 		lv->lv_niovecs = niovecs;
 
 		/* The allocated data region lies beyond the iovec region */
-		lv->lv_buf = (char *)lv + xlog_cil_iovec_space(niovecs);
+		lv->lv_buf = (char *)lv->lv_iovecp +
+				xlog_cil_iovec_space(niovecs);
 	}
-
 }
 
 /*
@@ -502,7 +507,7 @@  xlog_cil_insert_format_items(
 			/* reset the lv buffer information for new formatting */
 			lv->lv_buf_len = 0;
 			lv->lv_bytes = 0;
-			lv->lv_buf = (char *)lv +
+			lv->lv_buf = (char *)lv->lv_iovecp +
 					xlog_cil_iovec_space(lv->lv_niovecs);
 		} else {
 			/* switch to shadow buffer! */
@@ -703,7 +708,7 @@  xlog_cil_free_logvec(
 	while (!list_empty(lv_chain)) {
 		lv = list_first_entry(lv_chain, struct xfs_log_vec, lv_list);
 		list_del_init(&lv->lv_list);
-		kvfree(lv);
+		kmem_cache_free(xfs_log_vec_cache, lv);
 	}
 }
 
@@ -1544,7 +1549,8 @@  xlog_cil_process_intents(
 		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
 		trace_xfs_cil_whiteout_mark(ilip);
 		len += ilip->li_lv->lv_bytes;
-		kvfree(ilip->li_lv);
+		kvfree(ilip->li_lv->lv_iovecp);
+		kmem_cache_free(xfs_log_vec_cache, ilip->li_lv);
 		ilip->li_lv = NULL;
 
 		xfs_trans_del_item(lip);