Message ID | 20240626044909.15060-3-alexjlzheng@tencent.com (mailing list archive) |
---|---|
State | Accepted, archived |
Headers | show |
Series | Separate xfs_log_vec/iovec to save memory | expand |
On Wed, Jun 26, 2024 at 12:49:09PM +0800, alexjlzheng@gmail.com wrote: > From: Jinliang Zheng <alexjlzheng@tencent.com> > > When the contents of the xfs_log_vec/xfs_log_iovec combination are > written to iclog, xfs_log_iovec loses its meaning in continuing to exist > in memory, because iclog already has a copy of its contents. We only > need to keep xfs_log_vec that takes up very little memory to find the > xfs_log_item that needs to be added to AIL after we flush the iclog into > the disk log space. > > Because xfs_log_iovec dominates most of the memory in the > xfs_log_vec/xfs_log_iovec combination, retaining xfs_log_iovec until > iclog is flushed into the disk log space and releasing together with > xfs_log_vec is a significant waste of memory. Have you measured this? Please provide numbers and the workload that generates them, because when I did this combined structure the numbers and performance measured came out decisively on the side of "almost no difference in memory usage, major performance cost to doing a second allocation"... Here's the logic - the iovec array is largely "free" with the larger data allocation. ------ Look at how the heap is structured - it is in power of 2 slab sizes: $ grep kmalloc /proc/slabinfo |tail -13 kmalloc-8k 949 976 8192 4 8 : tunables 0 0 0 : slabdata 244 244 0 kmalloc-4k 1706 1768 4096 8 8 : tunables 0 0 0 : slabdata 221 221 0 kmalloc-2k 3252 3312 2048 16 8 : tunables 0 0 0 : slabdata 207 207 0 kmalloc-1k 76110 96192 1024 32 8 : tunables 0 0 0 : slabdata 3006 3006 0 kmalloc-512 71753 98656 512 32 4 : tunables 0 0 0 : slabdata 3083 3083 0 kmalloc-256 71006 71520 256 32 2 : tunables 0 0 0 : slabdata 2235 2235 0 kmalloc-192 10304 10458 192 42 2 : tunables 0 0 0 : slabdata 249 249 0 kmalloc-128 8889 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0 kmalloc-96 13583 13902 96 42 1 : tunables 0 0 0 : slabdata 331 331 0 kmalloc-64 63116 64640 64 64 1 : tunables 0 0 0 : slabdata 1010 1010 0 kmalloc-32 552726 582272 32 128 1 : tunables 0 0 0 : slabdata 4549 4549 0 kmalloc-16 444768 445440 16 256 1 : tunables 0 0 0 : slabdata 1740 1740 0 kmalloc-8 18178 18432 8 512 1 : tunables 0 0 0 : slabdata 36 36 0 IOws, if we do a 260 byte allocation, we get the same sized memory chunk as a 512 byte allocation as they come from the same slab cache. If we now look at structure sizes - the common ones are buffers and inodes so we'll look at then. For an inode, we typically log something like this for an extent allocation (or free) on mostly contiguous inode (say less than 10 extents) vec 1: inode log format item vec 2: inode core vec 3: inode data fork Each of these vectors has a 12 byte log op header built into them, and some padding to round them out to 8 byte alignment. vec 1: inode log format item: 12 + 56 + 4 (pad) vec 2: inode core: 12 + 176 + 4 (pad) vec 3: inode data fork: 12 + 16 (minimum) + 4 (pad) 12 + 336 (maximum for 512 byte inode) If we are just logging the inode core, we are allocating 12 + 56 + 4 + 12 + 176 + 4 = 264 bytes. It should be obvious now that this must be allocated from the 512 byte slab, and that means we have another 248 bytes of unused space in that allocated region we can actually use -for free-. IOWs, the fact that we add 32 bytes for the 2 iovecs for to index this inode log item doesn't matter at all - it's free space on the heap. Indeed, it's not until the inode data fork gets to a couple of hundred bytes in length that we overflow the 512 byte slab and have to use the 1kB slab. Again, we get the iovec array space for free. If we are logging the entire inode with the data fork, then the size of the data being logged is 264 + 12 + 336 + 4 = 616 bytes. This is well over the 512 byte slab, so we are always going to be allocating from the 1kB slab. We get the iovec array for free the moment we go over the 512 byte threshold again. IOWs, all the separation of the iovec array does is slightly change the data/attr fork size thresholds where we go from using the 512 byte slab to the 1kB slab. A similar pattern holds out for the buffer log items. The minimum it will be is: vec 1: buf log format item vec 2: single 128 byte chunk This requires 12 + 40B + 4 + 12 + 128B + 4 = 200 bytes. For two vectors, we need 32 bytes for the iovec array, so a total of 232 bytes is needed, and this will fit in a 256 byte slab with or without the iovec array attached. The same situation occurs are we increase the number of logged regions or the size of the logged regions - in almost all cases we get the iovec array for free because we log 128 byte regions out of buffers and they will put us into the next largest size slab regardless of the memory used by the iovec array. Hence we should almost always get the space for the iovec array for free from the slab allocator, and separating it out doesn't actually reduce slab cache memory usage. If anything, it increases it, because now we are allocating the iovec array out of small slabs and so instead of it being "free" the memory usage is now accounted to smaller slabs... ----- Hence before we go any further with this patch set, I'd like to see see numbers that quantify how much extra memory the embedded iovec array is actually costing us. And from that, an explanation of why the above "iovec array space should be cost-free" logic isn't working as intended.... -Dave.
On Mon, Jul 01, 2024 at 10:51:13AM +1000, Dave Chinner wrote: > Here's the logic - the iovec array is largely "free" with the larger > data allocation. What the patch does it to free the data allocation, that is the shadow buffer earlier. Which would safe a quite a bit of memory indeed ... if we didn't expect the shadow buffer to be needed again a little later anyway, which AFAIK is the assumption under which the CIL code operates. So as asked previously and by you again here I'd love to see numbers for workloads where this actually is a benefit.
On Sun, Jun 30, 2024 at 09:49:03PM -0700, Christoph Hellwig wrote: > On Mon, Jul 01, 2024 at 10:51:13AM +1000, Dave Chinner wrote: > > Here's the logic - the iovec array is largely "free" with the larger > > data allocation. > > What the patch does it to free the data allocation, that is the shadow > buffer earlier. Which would safe a quite a bit of memory indeed ... if > we didn't expect the shadow buffer to be needed again a little later > anyway, which AFAIK is the assumption under which the CIL code operates. Ah, ok, my bad. I missed that because the xfs_log_iovec is not the data buffer - it is specifically just the iovec array that indexes the data buffer. Everything in the commit message references the xfs_log_iovec, and makes no mention of the actual logged metadata that is being stored, and I didn't catch that the submitter was using xfs_log_iovec to mean something different to what I understand it to be from looking at the code. That's why I take the time to explain my reasoning - so that people aren't in any doubt about how I interpretted the changes and can easily point out where I've gone wrong. :) > So as asked previously and by you again here I'd love to see numbers > for workloads where this actually is a benefit. Yup, it doesn't change the basic premise that no allocations in the fast path is faster than doing even one allocation in the fast path. I made the explicit design choice to consume that memory as a necessary cost of going fast, and the memory is already being consumed while the objects are sitting and being relogged in the CIL before the CIL is formatted and checkpointed. Hence I'm not sure that freeing it before the checkpoint IO is submitted actually reduces the memory footprint significantly at all. Numbers and workloads are definitely needed. Cheers, Dave.
On Wed, Jul 03, 2024 at 09:44:36AM +1000, Dave Chinner wrote: > Ah, ok, my bad. I missed that because the xfs_log_iovec is not the > data buffer - it is specifically just the iovec array that indexes > the data buffer. Everything in the commit message references the > xfs_log_iovec, and makes no mention of the actual logged metadata > that is being stored, and I didn't catch that the submitter was > using xfs_log_iovec to mean something different to what I understand > it to be from looking at the code. That's why I take the time to > explain my reasoning - so that people aren't in any doubt about how > I interpretted the changes and can easily point out where I've gone > wrong. :) And throw in the xfs_log_vec vs xfs_log_iovec naming that keeps confusing me after all these years..
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 49e676061f2f..84a01ce61c96 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -2527,6 +2527,8 @@ xlog_write( xlog_write_full(lv, ticket, iclog, &log_offset, &len, &record_cnt, &data_cnt); } + if (lv->lv_flags & XFS_LOG_VEC_DYNAMIC) + kvfree(lv->lv_iovecp); } ASSERT(len == 0); diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 9cc10acf7bcd..7d0ae93e9e79 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -6,6 +6,8 @@ #ifndef __XFS_LOG_H__ #define __XFS_LOG_H__ +#define XFS_LOG_VEC_DYNAMIC (1 << 0) + struct xfs_cil_ctx; struct xfs_log_vec { @@ -17,7 +19,8 @@ struct xfs_log_vec { char *lv_buf; /* formatted buffer */ int lv_bytes; /* accounted space in buffer */ int lv_buf_len; /* aligned size of buffer */ - int lv_size; /* size of allocated lv */ + int lv_size; /* size of allocated iovecp + buf */ + int lv_flags; /* lv flags */ }; extern struct kmem_cache *xfs_log_vec_cache; @@ -71,7 +74,8 @@ xlog_finish_iovec(struct xfs_log_vec *lv, struct xfs_log_iovec *vec, vec->i_len = len; /* Catch buffer overruns */ - ASSERT((void *)lv->lv_buf + lv->lv_bytes <= (void *)lv + lv->lv_size); + ASSERT((void *)lv->lv_buf + lv->lv_bytes <= + (void *)lv->lv_iovecp + lv->lv_size); } /* diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index f51cbc6405c1..0175bd68590a 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -219,8 +219,7 @@ static inline int xlog_cil_iovec_space( uint niovecs) { - return round_up((sizeof(struct xfs_log_vec) + - niovecs * sizeof(struct xfs_log_iovec)), + return round_up(niovecs * sizeof(struct xfs_log_iovec), sizeof(uint64_t)); } @@ -279,6 +278,7 @@ xlog_cil_alloc_shadow_bufs( list_for_each_entry(lip, &tp->t_items, li_trans) { struct xfs_log_vec *lv; + struct xfs_log_iovec *lvec; int niovecs = 0; int nbytes = 0; int buf_size; @@ -330,8 +330,8 @@ xlog_cil_alloc_shadow_bufs( * if we have no shadow buffer, or it is too small, we need to * reallocate it. */ - if (!lip->li_lv_shadow || - buf_size > lip->li_lv_shadow->lv_size) { + lv = lip->li_lv_shadow; + if (!lv || buf_size > lv->lv_size) { /* * We free and allocate here as a realloc would copy * unnecessary data. We don't use kvzalloc() for the @@ -339,22 +339,27 @@ xlog_cil_alloc_shadow_bufs( * the buffer, only the log vector header and the iovec * storage. */ - kvfree(lip->li_lv_shadow); - lv = xlog_kvmalloc(buf_size); + if (lv) + kvfree(lv->lv_iovecp); + else + lv = kmem_cache_alloc(xfs_log_vec_cache, + GFP_KERNEL | __GFP_NOFAIL); - memset(lv, 0, xlog_cil_iovec_space(niovecs)); + memset(lv, 0, sizeof(struct xfs_log_vec)); + lvec = xlog_kvmalloc(buf_size); + memset(lvec, 0, xlog_cil_iovec_space(niovecs)); + lv->lv_flags |= XFS_LOG_VEC_DYNAMIC; INIT_LIST_HEAD(&lv->lv_list); lv->lv_item = lip; lv->lv_size = buf_size; if (ordered) lv->lv_buf_len = XFS_LOG_VEC_ORDERED; else - lv->lv_iovecp = (struct xfs_log_iovec *)&lv[1]; + lv->lv_iovecp = lvec; lip->li_lv_shadow = lv; } else { /* same or smaller, optimise common overwrite case */ - lv = lip->li_lv_shadow; if (ordered) lv->lv_buf_len = XFS_LOG_VEC_ORDERED; else @@ -366,9 +371,9 @@ xlog_cil_alloc_shadow_bufs( lv->lv_niovecs = niovecs; /* The allocated data region lies beyond the iovec region */ - lv->lv_buf = (char *)lv + xlog_cil_iovec_space(niovecs); + lv->lv_buf = (char *)lv->lv_iovecp + + xlog_cil_iovec_space(niovecs); } - } /* @@ -502,7 +507,7 @@ xlog_cil_insert_format_items( /* reset the lv buffer information for new formatting */ lv->lv_buf_len = 0; lv->lv_bytes = 0; - lv->lv_buf = (char *)lv + + lv->lv_buf = (char *)lv->lv_iovecp + xlog_cil_iovec_space(lv->lv_niovecs); } else { /* switch to shadow buffer! */ @@ -703,7 +708,7 @@ xlog_cil_free_logvec( while (!list_empty(lv_chain)) { lv = list_first_entry(lv_chain, struct xfs_log_vec, lv_list); list_del_init(&lv->lv_list); - kvfree(lv); + kmem_cache_free(xfs_log_vec_cache, lv); } } @@ -1544,7 +1549,8 @@ xlog_cil_process_intents( set_bit(XFS_LI_WHITEOUT, &ilip->li_flags); trace_xfs_cil_whiteout_mark(ilip); len += ilip->li_lv->lv_bytes; - kvfree(ilip->li_lv); + kvfree(ilip->li_lv->lv_iovecp); + kmem_cache_free(xfs_log_vec_cache, ilip->li_lv); ilip->li_lv = NULL; xfs_trans_del_item(lip);