xfs: cache minimum realtime summary level

Message ID	74441433f31eeb7f0c9fb49a04173ac2417d349a.1541187228.git.osandov@fb.com (mailing list archive)
State	Accepted, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Omar Sandoval <osandov@osandov.com> To: linux-xfs@vger.kernel.org Cc: kernel-team@fb.com Subject: [PATCH] xfs: cache minimum realtime summary level Date: Fri, 2 Nov 2018 12:38:00 -0700 Message-Id: <74441433f31eeb7f0c9fb49a04173ac2417d349a.1541187228.git.osandov@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	xfs: cache minimum realtime summary level \| expand xfs: cache minimum realtime summary level

Message ID

74441433f31eeb7f0c9fb49a04173ac2417d349a.1541187228.git.osandov@fb.com (mailing list archive)

State

Accepted, archived

Headers

From: Omar Sandoval <osandov@osandov.com>
To: linux-xfs@vger.kernel.org
Cc: kernel-team@fb.com
Subject: [PATCH] xfs: cache minimum realtime summary level
Date: Fri,  2 Nov 2018 12:38:00 -0700
Message-Id: 
 <74441433f31eeb7f0c9fb49a04173ac2417d349a.1541187228.git.osandov@fb.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk

Series

xfs: cache minimum realtime summary level | expand

Commit Message

Omar Sandoval Nov. 2, 2018, 7:38 p.m. UTC

From: Omar Sandoval <osandov@fb.com>

The realtime summary is a two-dimensional array on disk, effectively:

u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]

rsum[log][bbno] is the number of extents of size 2**log which start in
bitmap block bbno.

xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
rsum[log][bbno] != 0 for any log level. However, the summary array is
stored in row-major order (i.e., like an array in C), so all of these
entries are not adjacent, but rather spread across the entire summary
file. In the worst case (a full bitmap block), xfs_rtany_summary() has
to check every level.

This means that on a moderately-used realtime device, an allocation will
waste a lot of time finding, reading, and releasing buffers for the
realtime summary. In particular, one of our storage services (which runs
on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
xfs_trans_brelse() called from xfs_rtany_summary().

One solution would be to swap the dimensions of the summary array so
that different sizes on the same bitmap block are adjacent. However,
this would require a disk format change to a very old component of XFS,
and it would slow down xfs_rtallocate_extent_size().

Instead, we can cache the minimum size which contains any extents. We do
so lazily; rather than guaranteeing that the cache contains the precise
minimum, it always contains a loose lower bound which we tighten when we
read or update a summary block. This only uses a few kilobytes of memory
and is already serialized via the realtime bitmap and summary inode
locks, so the cost is minimal. With this change, the same workload only
spends 0.2% of its CPU cycles in the realtime allocator.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
This patch is based on Linus' current master branch.

 fs/xfs/libxfs/xfs_rtbitmap.c |  4 ++++
 fs/xfs/xfs_mount.h           |  6 ++++++
 fs/xfs/xfs_rtalloc.c         | 27 +++++++++++++++++++++++----
 3 files changed, 33 insertions(+), 4 deletions(-)

Comments

Dave Chinner Nov. 5, 2018, 11:49 p.m. UTC | #1

On Fri, Nov 02, 2018 at 12:38:00PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
> 
> The realtime summary is a two-dimensional array on disk, effectively:
> 
> u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
> 
> rsum[log][bbno] is the number of extents of size 2**log which start in
> bitmap block bbno.
> 
> xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
> rsum[log][bbno] != 0 for any log level. However, the summary array is
> stored in row-major order (i.e., like an array in C), so all of these
> entries are not adjacent, but rather spread across the entire summary
> file. In the worst case (a full bitmap block), xfs_rtany_summary() has
> to check every level.
> 
> This means that on a moderately-used realtime device, an allocation will
> waste a lot of time finding, reading, and releasing buffers for the
> realtime summary. In particular, one of our storage services (which runs
> on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
> spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
> xfs_trans_brelse() called from xfs_rtany_summary().

Yup, the RT allocator is showing that it was never intended for
general purpose data storage workloads... :P

So how much memory would it require to keep an in-memory copy of
the summary information? i.e. do an in-memory copy search, then once
the block is found, pull in the buffer that needs modifying and log
it? That gets rid of the buffer management overhead, and potentially
allows for more efficient search algorithms to be used.

> One solution would be to swap the dimensions of the summary array so
> that different sizes on the same bitmap block are adjacent. However,
> this would require a disk format change to a very old component of XFS,
> and it would slow down xfs_rtallocate_extent_size().

Or maybe we should put a second summary on disk, ordered the other
way similar to  how the btree allocator freespace indexes are
organised.

> Instead, we can cache the minimum size which contains any extents. We do
> so lazily; rather than guaranteeing that the cache contains the precise
> minimum, it always contains a loose lower bound which we tighten when we
> read or update a summary block. This only uses a few kilobytes of memory
> and is already serialized via the realtime bitmap and summary inode
> locks, so the cost is minimal. With this change, the same workload only
> spends 0.2% of its CPU cycles in the realtime allocator.

Good win.

> @@ -1187,8 +1195,8 @@ xfs_rtmount_init(
>  }
>  
>  /*
> - * Get the bitmap and summary inodes into the mount structure
> - * at mount time.
> + * Get the bitmap and summary inodes and the summary cache into the mount
> + * structure at mount time.
>   */
>  int					/* error */
>  xfs_rtmount_inodes(
> @@ -1211,6 +1219,16 @@ xfs_rtmount_inodes(
>  		return error;
>  	}
>  	ASSERT(mp->m_rsumip != NULL);
> +	/*
> +	 * The rsum cache is initialized to all zeroes, which trivially
> +	 * satisfies the invariant.

Satisfies what invariant? Please explain why this is OK and how it
interacts with the allocation code so we're not left guessing what
this means in the future.

Remember: comments are not for now - they are for 10 years time when
no-one remembers what this code does or is actually used for.

> +	 */
> +	mp->m_rsum_cache = kvzalloc(sbp->sb_rbmblocks, GFP_KERNEL);

kmem_zalloc_large().

Which makes me ask, just how large is this allocation if you're
using vmalloc() straight out of the box?

> +	if (!mp->m_rsum_cache) {
> +		xfs_irele(mp->m_rbmip);
> +		xfs_irele(mp->m_rsumip);
> +		return -ENOMEM;
> +	}
>  	return 0;
>  }
>  
> @@ -1218,6 +1236,7 @@ void
>  xfs_rtunmount_inodes(
>  	struct xfs_mount	*mp)
>  {
> +	kvfree(mp->m_rsum_cache);

kmem_free().

Cheers,

Dave.

Omar Sandoval Nov. 6, 2018, 5:20 p.m. UTC | #2

On Tue, Nov 06, 2018 at 10:49:48AM +1100, Dave Chinner wrote:
> On Fri, Nov 02, 2018 at 12:38:00PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval <osandov@fb.com>
> > 
> > The realtime summary is a two-dimensional array on disk, effectively:
> > 
> > u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
> > 
> > rsum[log][bbno] is the number of extents of size 2**log which start in
> > bitmap block bbno.
> > 
> > xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
> > rsum[log][bbno] != 0 for any log level. However, the summary array is
> > stored in row-major order (i.e., like an array in C), so all of these
> > entries are not adjacent, but rather spread across the entire summary
> > file. In the worst case (a full bitmap block), xfs_rtany_summary() has
> > to check every level.
> > 
> > This means that on a moderately-used realtime device, an allocation will
> > waste a lot of time finding, reading, and releasing buffers for the
> > realtime summary. In particular, one of our storage services (which runs
> > on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
> > spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
> > xfs_trans_brelse() called from xfs_rtany_summary().
> 
> Yup, the RT allocator is showing that it was never intended for
> general purpose data storage workloads... :P

Indeed. What they really want is the AG allocator with metadata on a
separate device, but that sounds like a much bigger project.

> So how much memory would it require to keep an in-memory copy of
> the summary information? i.e. do an in-memory copy search, then once
> the block is found, pull in the buffer that needs modifying and log
> it? That gets rid of the buffer management overhead, and potentially
> allows for more efficient search algorithms to be used.

Yeah, I considered that. We use 256kB realtime extents for the 8 TB
filesystem, so it's 100kB. If we were using 4kB realtime extents, it'd
be about 7MB. So, it's doable but not the best.

> > One solution would be to swap the dimensions of the summary array so
> > that different sizes on the same bitmap block are adjacent. However,
> > this would require a disk format change to a very old component of XFS,
> > and it would slow down xfs_rtallocate_extent_size().
> 
> Or maybe we should put a second summary on disk, ordered the other
> way similar to  how the btree allocator freespace indexes are
> organised.

That's another thing I considered, but I wanted to avoid a disk format
change if possible, and this small cache is pretty much just as good.

> > Instead, we can cache the minimum size which contains any extents. We do
> > so lazily; rather than guaranteeing that the cache contains the precise
> > minimum, it always contains a loose lower bound which we tighten when we
> > read or update a summary block. This only uses a few kilobytes of memory
> > and is already serialized via the realtime bitmap and summary inode
> > locks, so the cost is minimal. With this change, the same workload only
> > spends 0.2% of its CPU cycles in the realtime allocator.
> 
> Good win.
> 
> > @@ -1187,8 +1195,8 @@ xfs_rtmount_init(
> >  }
> >  
> >  /*
> > - * Get the bitmap and summary inodes into the mount structure
> > - * at mount time.
> > + * Get the bitmap and summary inodes and the summary cache into the mount
> > + * structure at mount time.
> >   */
> >  int					/* error */
> >  xfs_rtmount_inodes(
> > @@ -1211,6 +1219,16 @@ xfs_rtmount_inodes(
> >  		return error;
> >  	}
> >  	ASSERT(mp->m_rsumip != NULL);
> > +	/*
> > +	 * The rsum cache is initialized to all zeroes, which trivially
> > +	 * satisfies the invariant.
> 
> Satisfies what invariant? Please explain why this is OK and how it
> interacts with the allocation code so we're not left guessing what
> this means in the future.
> 
> Remember: comments are not for now - they are for 10 years time when
> no-one remembers what this code does or is actually used for.

The invariant I specified in the definition in xfs_mount_t, but I'll
clarify this.

> > +	 */
> > +	mp->m_rsum_cache = kvzalloc(sbp->sb_rbmblocks, GFP_KERNEL);
> 
> kmem_zalloc_large().

Will change.

> Which makes me ask, just how large is this allocation if you're
> using vmalloc() straight out of the box?

For our 256kB extent size, it's only ~1000 bytes. However, for the same
size filesystem with 4kB extents, it'd be ~60kB.

> > +	if (!mp->m_rsum_cache) {
> > +		xfs_irele(mp->m_rbmip);
> > +		xfs_irele(mp->m_rsumip);
> > +		return -ENOMEM;
> > +	}
> >  	return 0;
> >  }
> >  
> > @@ -1218,6 +1236,7 @@ void
> >  xfs_rtunmount_inodes(
> >  	struct xfs_mount	*mp)
> >  {
> > +	kvfree(mp->m_rsum_cache);
> 
> kmem_free().

Will change.

Thanks for the review!

Dave Chinner Nov. 6, 2018, 8:58 p.m. UTC | #3

On Tue, Nov 06, 2018 at 09:20:37AM -0800, Omar Sandoval wrote:
> On Tue, Nov 06, 2018 at 10:49:48AM +1100, Dave Chinner wrote:
> > On Fri, Nov 02, 2018 at 12:38:00PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@fb.com>
> > > 
> > > The realtime summary is a two-dimensional array on disk, effectively:
> > > 
> > > u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
> > > 
> > > rsum[log][bbno] is the number of extents of size 2**log which start in
> > > bitmap block bbno.
> > > 
> > > xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
> > > rsum[log][bbno] != 0 for any log level. However, the summary array is
> > > stored in row-major order (i.e., like an array in C), so all of these
> > > entries are not adjacent, but rather spread across the entire summary
> > > file. In the worst case (a full bitmap block), xfs_rtany_summary() has
> > > to check every level.
> > > 
> > > This means that on a moderately-used realtime device, an allocation will
> > > waste a lot of time finding, reading, and releasing buffers for the
> > > realtime summary. In particular, one of our storage services (which runs
> > > on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
> > > spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
> > > xfs_trans_brelse() called from xfs_rtany_summary().
> > 
> > Yup, the RT allocator is showing that it was never intended for
> > general purpose data storage workloads... :P
> 
> Indeed. What they really want is the AG allocator with metadata on a
> separate device, but that sounds like a much bigger project.
> 
> > So how much memory would it require to keep an in-memory copy of
> > the summary information? i.e. do an in-memory copy search, then once
> > the block is found, pull in the buffer that needs modifying and log
> > it? That gets rid of the buffer management overhead, and potentially
> > allows for more efficient search algorithms to be used.
> 
> Yeah, I considered that. We use 256kB realtime extents for the 8 TB
> filesystem, so it's 100kB. If we were using 4kB realtime extents, it'd
> be about 7MB. So, it's doable but not the best.

Quite frankly, that's a small amount compared to the amount of
metadata we typically cache on an active filesystem. If we are
really talking about 1MB of RAM per TB of disk space at the worst
case, then that's and acceptible amount to spend on speeding up the
allocator. This patch is a good stepping stone - would you be able
to look into implementing a full in-memory summary cache?

[ Consider that we're reserving something like 16GB of disk space per
TB for rmap btrees - the disk space and memory overhead there is
substantial once the rmap tree grows - and that's why I think 1MB
per TB is almost inconsequential. ]

> > > One solution would be to swap the dimensions of the summary array so
> > > that different sizes on the same bitmap block are adjacent. However,
> > > this would require a disk format change to a very old component of XFS,
> > > and it would slow down xfs_rtallocate_extent_size().
> > 
> > Or maybe we should put a second summary on disk, ordered the other
> > way similar to  how the btree allocator freespace indexes are
> > organised.
> 
> That's another thing I considered, but I wanted to avoid a disk format
> change if possible, and this small cache is pretty much just as good.

*nod*

> > > +	mp->m_rsum_cache = kvzalloc(sbp->sb_rbmblocks, GFP_KERNEL);
> > 
> > kmem_zalloc_large().
> 
> Will change.
> 
> > Which makes me ask, just how large is this allocation if you're
> > using vmalloc() straight out of the box?
> 
> For our 256kB extent size, it's only ~1000 bytes. However, for the same
> size filesystem with 4kB extents, it'd be ~60kB.

Yup, I had a rough thought of around 64k for 8TB, but I wasn't sure
I got it all right in my head. Thanks for confirming.

Cheers,

Dave.

Omar Sandoval Nov. 11, 2018, 3:59 a.m. UTC | #4

On Wed, Nov 07, 2018 at 07:58:51AM +1100, Dave Chinner wrote:
> On Tue, Nov 06, 2018 at 09:20:37AM -0800, Omar Sandoval wrote:
> > On Tue, Nov 06, 2018 at 10:49:48AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 02, 2018 at 12:38:00PM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval <osandov@fb.com>
> > > > 
> > > > The realtime summary is a two-dimensional array on disk, effectively:
> > > > 
> > > > u32 rsum[log2(number of realtime extents) + 1][number of blocks in the bitmap]
> > > > 
> > > > rsum[log][bbno] is the number of extents of size 2**log which start in
> > > > bitmap block bbno.
> > > > 
> > > > xfs_rtallocate_extent_near() uses xfs_rtany_summary() to check whether
> > > > rsum[log][bbno] != 0 for any log level. However, the summary array is
> > > > stored in row-major order (i.e., like an array in C), so all of these
> > > > entries are not adjacent, but rather spread across the entire summary
> > > > file. In the worst case (a full bitmap block), xfs_rtany_summary() has
> > > > to check every level.
> > > > 
> > > > This means that on a moderately-used realtime device, an allocation will
> > > > waste a lot of time finding, reading, and releasing buffers for the
> > > > realtime summary. In particular, one of our storage services (which runs
> > > > on servers with 8 very slow CPUs and 15 8 TB XFS realtime filesystems)
> > > > spends almost 5% of its CPU cycles in xfs_rtbuf_get() and
> > > > xfs_trans_brelse() called from xfs_rtany_summary().
> > > 
> > > Yup, the RT allocator is showing that it was never intended for
> > > general purpose data storage workloads... :P
> > 
> > Indeed. What they really want is the AG allocator with metadata on a
> > separate device, but that sounds like a much bigger project.
> > 
> > > So how much memory would it require to keep an in-memory copy of
> > > the summary information? i.e. do an in-memory copy search, then once
> > > the block is found, pull in the buffer that needs modifying and log
> > > it? That gets rid of the buffer management overhead, and potentially
> > > allows for more efficient search algorithms to be used.
> > 
> > Yeah, I considered that. We use 256kB realtime extents for the 8 TB
> > filesystem, so it's 100kB. If we were using 4kB realtime extents, it'd
> > be about 7MB. So, it's doable but not the best.
> 
> Quite frankly, that's a small amount compared to the amount of
> metadata we typically cache on an active filesystem. If we are
> really talking about 1MB of RAM per TB of disk space at the worst
> case, then that's and acceptible amount to spend on speeding up the
> allocator. This patch is a good stepping stone - would you be able
> to look into implementing a full in-memory summary cache?

We noticed some other issues with the realtime allocator (namely, using
the "near" allocation strategy for new files leads to bad
fragmentation), so it is on my todo list to make it more intelligent --
it's good to know that it'd be acceptable to cache much more.

v2 of this patch incoming.

diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index b228c821bae6..6d4990717cee 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -505,6 +505,10 @@  xfs_rtmodify_summary_int(
 		uint first = (uint)((char *)sp - (char *)bp->b_addr);
 
 		*sp += delta;
+		if (*sp == 0 && log == mp->m_rsum_cache[bbno])
+			mp->m_rsum_cache[bbno]++;
+		if (*sp != 0 && log < mp->m_rsum_cache[bbno])
+			mp->m_rsum_cache[bbno] = log;
 		xfs_trans_log_buf(tp, bp, first, first + sizeof(*sp) - 1);
 	}
 	if (sum)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7964513c3128..2b626a4b4824 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -89,6 +89,12 @@  typedef struct xfs_mount {
 	int			m_logbsize;	/* size of each log buffer */
 	uint			m_rsumlevels;	/* rt summary levels */
 	uint			m_rsumsize;	/* size of rt summary, bytes */
+	/*
+	 * Cache of rt summary level per bitmap block with the invariant that
+	 * m_rsum_cache[bbno] <= the minimum i for which rsum[i][bbno] != 0.
+	 * Reads and writes are serialized by the rsumip inode lock.
+	 */
+	uint8_t			*m_rsum_cache;
 	struct xfs_inode	*m_rbmip;	/* pointer to bitmap inode */
 	struct xfs_inode	*m_rsumip;	/* pointer to summary inode */
 	struct xfs_inode	*m_rootip;	/* pointer to root directory */
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 926ed314ffba..351bdc6a84cb 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -64,8 +64,12 @@  xfs_rtany_summary(
 	int		log;		/* loop counter, log2 of ext. size */
 	xfs_suminfo_t	sum;		/* summary data */
 
+	/* There are no extents at levels < m_rsum_cache[bbno]. */
+	if (low < mp->m_rsum_cache[bbno])
+		low = mp->m_rsum_cache[bbno];
+
 	/*
-	 * Loop over logs of extent sizes.  Order is irrelevant.
+	 * Loop over logs of extent sizes.
 	 */
 	for (log = low; log <= high; log++) {
 		/*
@@ -80,13 +84,17 @@  xfs_rtany_summary(
 		 */
 		if (sum) {
 			*stat = 1;
-			return 0;
+			goto out;
 		}
 	}
 	/*
 	 * Found nothing, return failure.
 	 */
 	*stat = 0;
+out:
+	/* There were no extents at levels < log. */
+	if (log > mp->m_rsum_cache[bbno])
+		mp->m_rsum_cache[bbno] = log;
 	return 0;
 }
 
@@ -1187,8 +1195,8 @@  xfs_rtmount_init(
 }
 
 /*
- * Get the bitmap and summary inodes into the mount structure
- * at mount time.
+ * Get the bitmap and summary inodes and the summary cache into the mount
+ * structure at mount time.
  */
 int					/* error */
 xfs_rtmount_inodes(
@@ -1211,6 +1219,16 @@  xfs_rtmount_inodes(
 		return error;
 	}
 	ASSERT(mp->m_rsumip != NULL);
+	/*
+	 * The rsum cache is initialized to all zeroes, which trivially
+	 * satisfies the invariant.
+	 */
+	mp->m_rsum_cache = kvzalloc(sbp->sb_rbmblocks, GFP_KERNEL);
+	if (!mp->m_rsum_cache) {
+		xfs_irele(mp->m_rbmip);
+		xfs_irele(mp->m_rsumip);
+		return -ENOMEM;
+	}
 	return 0;
 }
 
@@ -1218,6 +1236,7 @@  void
 xfs_rtunmount_inodes(
 	struct xfs_mount	*mp)
 {
+	kvfree(mp->m_rsum_cache);
 	if (mp->m_rbmip)
 		xfs_irele(mp->m_rbmip);
 	if (mp->m_rsumip)

xfs: cache minimum realtime summary level

Commit Message

Comments

Patch