Message ID | 20191009032124.10541-5-david@fromorbit.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, xfs: non-blocking inode reclaim | expand |
On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The buffer cache shrinker frees more than just the xfs_buf slab > objects - it also frees the pages attached to the buffers. Make sure > the memory reclaim code accounts for this memory being freed > correctly, similar to how the inode shrinker accounts for pages > freed from the page cache due to mapping invalidation. > > We also need to make sure that the mm subsystem knows these are > reclaimable objects. We provide the memory reclaim subsystem with a > a shrinker to reclaim xfs_bufs, so we should really mark the slab > that way. > > We also have a lot of xfs_bufs in a busy system, spread them around > like we do inodes. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Seems reasonable, but for inodes we also spread the ili zone. Should we not be consistent with bli's as well? Brian > fs/xfs/xfs_buf.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index e484f6bead53..45b470f55ad7 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -324,6 +324,9 @@ xfs_buf_free( > > __free_page(page); > } > + if (current->reclaim_state) > + current->reclaim_state->reclaimed_slab += > + bp->b_page_count; > } else if (bp->b_flags & _XBF_KMEM) > kmem_free(bp->b_addr); > _xfs_buf_free_pages(bp); > @@ -2064,7 +2067,8 @@ int __init > xfs_buf_init(void) > { > xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", > - KM_ZONE_HWALIGN, NULL); > + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, > + NULL); > if (!xfs_buf_zone) > goto out; > > -- > 2.23.0.rc1 >
On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote: > Seems reasonable, but for inodes we also spread the ili zone. Should we > not be consistent with bli's as well? Btw, can we please kill off the stupid KM_ZONE_* flags? We only use each of them less than a hand ful of places, and they make reading the code much harder.
On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote: > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > objects - it also frees the pages attached to the buffers. Make sure > > the memory reclaim code accounts for this memory being freed > > correctly, similar to how the inode shrinker accounts for pages > > freed from the page cache due to mapping invalidation. > > > > We also need to make sure that the mm subsystem knows these are > > reclaimable objects. We provide the memory reclaim subsystem with a > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > that way. > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > like we do inodes. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > Seems reasonable, but for inodes we also spread the ili zone. Should we > not be consistent with bli's as well? bli's are reclaimed when the buffer is cleaned. ili's live for the live of the inode in cache. Hence bli's are short term allocations (much shorter than xfs_bufs they attach to) and are reclaimed much faster than inodes and their ilis. There's also a lot less blis than ili's, so the spread of their footprint across memory nodes doesn't matter that much. Local access for the memcpy during formatting is probably more important than spreading the memory usage of them these days, anyway. Cheers, Dave.
On Fri, Oct 11, 2019 at 05:57:35AM -0700, Christoph Hellwig wrote: > On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote: > > Seems reasonable, but for inodes we also spread the ili zone. Should we > > not be consistent with bli's as well? > > Btw, can we please kill off the stupid KM_ZONE_* flags? We only use > each of them less than a hand ful of places, and they make reading the > code much harder. Separate patchset - it's not relevant to this one which is already complex enough... Cheers, Dave.
On Sat, Oct 12, 2019 at 10:13:23AM +1100, Dave Chinner wrote: > On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote: > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > > objects - it also frees the pages attached to the buffers. Make sure > > > the memory reclaim code accounts for this memory being freed > > > correctly, similar to how the inode shrinker accounts for pages > > > freed from the page cache due to mapping invalidation. > > > > > > We also need to make sure that the mm subsystem knows these are > > > reclaimable objects. We provide the memory reclaim subsystem with a > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > > that way. > > > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > > like we do inodes. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > > Seems reasonable, but for inodes we also spread the ili zone. Should we > > not be consistent with bli's as well? > > bli's are reclaimed when the buffer is cleaned. ili's live for the > live of the inode in cache. Hence bli's are short term allocations > (much shorter than xfs_bufs they attach to) and are reclaimed much > faster than inodes and their ilis. There's also a lot less blis than > ili's, so the spread of their footprint across memory nodes doesn't > matter that much. Local access for the memcpy during formatting is > probably more important than spreading the memory usage of them > these days, anyway. > Yes, the buffer/inode lifecycle difference is why why I presume bli zones are not ZONE_RECLAIM like ili zones. This doesn't tell me anything about why buffers should be spread around as such and buffer log items not, though.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Sat, Oct 12, 2019 at 08:05:58AM -0400, Brian Foster wrote: > On Sat, Oct 12, 2019 at 10:13:23AM +1100, Dave Chinner wrote: > > On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote: > > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > > > objects - it also frees the pages attached to the buffers. Make sure > > > > the memory reclaim code accounts for this memory being freed > > > > correctly, similar to how the inode shrinker accounts for pages > > > > freed from the page cache due to mapping invalidation. > > > > > > > > We also need to make sure that the mm subsystem knows these are > > > > reclaimable objects. We provide the memory reclaim subsystem with a > > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > > > that way. > > > > > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > > > like we do inodes. > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > --- > > > > > > Seems reasonable, but for inodes we also spread the ili zone. Should we > > > not be consistent with bli's as well? > > > > bli's are reclaimed when the buffer is cleaned. ili's live for the > > live of the inode in cache. Hence bli's are short term allocations > > (much shorter than xfs_bufs they attach to) and are reclaimed much > > faster than inodes and their ilis. There's also a lot less blis than > > ili's, so the spread of their footprint across memory nodes doesn't > > matter that much. Local access for the memcpy during formatting is > > probably more important than spreading the memory usage of them > > these days, anyway. > > > > Yes, the buffer/inode lifecycle difference is why why I presume bli > zones are not ZONE_RECLAIM like ili zones. No, that is not the case. IO completion cleaning the buffer is what frees the bli. The ili can only be freed by reclaiming the inode, so it's memory that can only be returned to the free pool by running a shrinker. Hence ilis are ZONE_RECLAIM to account them as memory that can be reclaimed through shrinker invocation, while BLIs are not because memory reclaim can't directly cause them to be freed. > This doesn't tell me anything about why buffers should be spread > around as such and buffer log items not, though.. xfs_bufs are long lived, are global structures, and can accumulate in the millions if the workload requires it. IOWs, we should spread xfs_bufs for exactly the same reasons inodes are spread. As for BLIs, they are short term structures - a single xfs_buf might have thousands of different blis attached to it over it's life in the cache because the BLI is freed when the buffer is cleaned. We don't need to spread small short term structures around NUMA memory nodes because they don't present a long term memory imbalance vector. In general it is better to have them allocated local to the process that is using them where the memory access latency is lowest, knowing that they will be freed shortly and not contribute to long term memory usage. Cheers, Dave.
On Sun, Oct 13, 2019 at 02:14:50PM +1100, Dave Chinner wrote: > On Sat, Oct 12, 2019 at 08:05:58AM -0400, Brian Foster wrote: > > On Sat, Oct 12, 2019 at 10:13:23AM +1100, Dave Chinner wrote: > > > On Fri, Oct 11, 2019 at 08:39:39AM -0400, Brian Foster wrote: > > > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > > > > objects - it also frees the pages attached to the buffers. Make sure > > > > > the memory reclaim code accounts for this memory being freed > > > > > correctly, similar to how the inode shrinker accounts for pages > > > > > freed from the page cache due to mapping invalidation. > > > > > > > > > > We also need to make sure that the mm subsystem knows these are > > > > > reclaimable objects. We provide the memory reclaim subsystem with a > > > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > > > > that way. > > > > > > > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > > > > like we do inodes. > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > --- > > > > > > > > Seems reasonable, but for inodes we also spread the ili zone. Should we > > > > not be consistent with bli's as well? > > > > > > bli's are reclaimed when the buffer is cleaned. ili's live for the > > > live of the inode in cache. Hence bli's are short term allocations > > > (much shorter than xfs_bufs they attach to) and are reclaimed much > > > faster than inodes and their ilis. There's also a lot less blis than > > > ili's, so the spread of their footprint across memory nodes doesn't > > > matter that much. Local access for the memcpy during formatting is > > > probably more important than spreading the memory usage of them > > > these days, anyway. > > > > > > > Yes, the buffer/inode lifecycle difference is why why I presume bli > > zones are not ZONE_RECLAIM like ili zones. > > No, that is not the case. IO completion cleaning the buffer is what > frees the bli. The ili can only be freed by reclaiming the inode, so > it's memory that can only be returned to the free pool by running a > shrinker. Hence ilis are ZONE_RECLAIM to account them as memory that > can be reclaimed through shrinker invocation, while BLIs are not > because memory reclaim can't directly cause them to be freed. > That is pretty much what I said. I think we're in agreement. > > This doesn't tell me anything about why buffers should be spread > > around as such and buffer log items not, though.. > > xfs_bufs are long lived, are global structures, and can accumulate > in the millions if the workload requires it. IOWs, we should spread > xfs_bufs for exactly the same reasons inodes are spread. > > As for BLIs, they are short term structures - a single xfs_buf might > have thousands of different blis attached to it over it's life in > the cache because the BLI is freed when the buffer is cleaned. > Short term relative to the ILI perhaps, but these are still memory allocations that outlive the allocating task returning to userspace, are reused (across tasks) commonly enough and have an I/O bound life cycle. That's also not considering page/slab buildup in the kmem cache beyond the lifetime of individual allocations.. > We don't need to spread small short term structures around NUMA > memory nodes because they don't present a long term memory imbalance > vector. In general it is better to have them allocated local to the > process that is using them where the memory access latency is > lowest, knowing that they will be freed shortly and not contribute > to long term memory usage. > Hmm.. doesn't this all depend on enablement of a cgroup knob in the first place? It looks to me that this behavior is tied to a per-task state (not a per-mount or zone setting, which just allows such behavior on the zone) where the controller has explicitly requested us to not perform sustained allocations in the local node if possible. Instead, spread slab allocations around at the cost of bypassing this local allocation heuristic, presumably because $application wants prioritized access to that memory. What am I missing? BTW, it also looks like this is only relevant for slab. I don't see any references in slub (or slob), but I haven't dug too deeply.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The buffer cache shrinker frees more than just the xfs_buf slab > objects - it also frees the pages attached to the buffers. Make sure > the memory reclaim code accounts for this memory being freed > correctly, similar to how the inode shrinker accounts for pages > freed from the page cache due to mapping invalidation. > > We also need to make sure that the mm subsystem knows these are > reclaimable objects. We provide the memory reclaim subsystem with a > a shrinker to reclaim xfs_bufs, so we should really mark the slab > that way. > > We also have a lot of xfs_bufs in a busy system, spread them around > like we do inodes. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_buf.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index e484f6bead53..45b470f55ad7 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -324,6 +324,9 @@ xfs_buf_free( > > __free_page(page); > } > + if (current->reclaim_state) > + current->reclaim_state->reclaimed_slab += > + bp->b_page_count; Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together. > } else if (bp->b_flags & _XBF_KMEM) > kmem_free(bp->b_addr); > _xfs_buf_free_pages(bp); > @@ -2064,7 +2067,8 @@ int __init > xfs_buf_init(void) > { > xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", > - KM_ZONE_HWALIGN, NULL); > + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply to a particular "use another node" memory policy when slab is in use. Was that your intent? --D > + NULL); > if (!xfs_buf_zone) > goto out; > > -- > 2.23.0.rc1 >
On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote: > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > objects - it also frees the pages attached to the buffers. Make sure > > the memory reclaim code accounts for this memory being freed > > correctly, similar to how the inode shrinker accounts for pages > > freed from the page cache due to mapping invalidation. > > > > We also need to make sure that the mm subsystem knows these are > > reclaimable objects. We provide the memory reclaim subsystem with a > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > that way. > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > like we do inodes. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_buf.c | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index e484f6bead53..45b470f55ad7 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -324,6 +324,9 @@ xfs_buf_free( > > > > __free_page(page); > > } > > + if (current->reclaim_state) > > + current->reclaim_state->reclaimed_slab += > > + bp->b_page_count; > > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together. > > > } else if (bp->b_flags & _XBF_KMEM) > > kmem_free(bp->b_addr); > > _xfs_buf_free_pages(bp); > > @@ -2064,7 +2067,8 @@ int __init > > xfs_buf_init(void) > > { > > xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", > > - KM_ZONE_HWALIGN, NULL); > > + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, > > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply > to a particular "use another node" memory policy when slab is in use. > Was that your intent? It's more documentation than anything - that we shouldn't be piling these structures all on to one node because that can have severe issues with NUMA memory reclaim algorithms. i.e. the xfs-buf shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single node can reclaim all the xfs-bufs on that node without touching any other node. That means, for example, if we instantiate all the AG header buffers on a single node (e.g. like we do at mount time) then memory pressure on that one node will generate IO stalls across the entire filesystem as other nodes doing work have to repopulate the buffer cache for any allocation for freeing of space/inodes.. IOWs, for large NUMA systems using cpusets this cache should be spread around all of memory, especially as it has NUMA aware reclaim. For everyone else, it's just documentation that improper cgroup or NUMA memory policy could cause you all sorts of problems with this cache. It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in filesystems for inode caches largely because, at the time (~2006), the only reclaimable cache that could grow to any size large enough to cause problems was the inode cache. It's been cargo-culted ever since, whether it is needed or not (e.g. ceph). In the case of the xfs_bufs, I've been running workloads recently that cache several million xfs_bufs and only a handful of inodes rather than the other way around. If we spread inodes because caching millions on a single node can cause problems on large NUMA machines, then we also need to spread xfs_bufs... Cheers, Dave.
On Thu, Oct 31, 2019 at 08:43:35AM +1100, Dave Chinner wrote: > On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote: > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > > objects - it also frees the pages attached to the buffers. Make sure > > > the memory reclaim code accounts for this memory being freed > > > correctly, similar to how the inode shrinker accounts for pages > > > freed from the page cache due to mapping invalidation. > > > > > > We also need to make sure that the mm subsystem knows these are > > > reclaimable objects. We provide the memory reclaim subsystem with a > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > > that way. > > > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > > like we do inodes. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_buf.c | 6 +++++- > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > index e484f6bead53..45b470f55ad7 100644 > > > --- a/fs/xfs/xfs_buf.c > > > +++ b/fs/xfs/xfs_buf.c > > > @@ -324,6 +324,9 @@ xfs_buf_free( > > > > > > __free_page(page); > > > } > > > + if (current->reclaim_state) > > > + current->reclaim_state->reclaimed_slab += > > > + bp->b_page_count; > > > > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together. > > > > > } else if (bp->b_flags & _XBF_KMEM) > > > kmem_free(bp->b_addr); > > > _xfs_buf_free_pages(bp); > > > @@ -2064,7 +2067,8 @@ int __init > > > xfs_buf_init(void) > > > { > > > xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", > > > - KM_ZONE_HWALIGN, NULL); > > > + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, > > > > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply > > to a particular "use another node" memory policy when slab is in use. > > Was that your intent? > > It's more documentation than anything - that we shouldn't be piling > these structures all on to one node because that can have severe > issues with NUMA memory reclaim algorithms. i.e. the xfs-buf > shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single > node can reclaim all the xfs-bufs on that node without touching any > other node. > > That means, for example, if we instantiate all the AG header buffers > on a single node (e.g. like we do at mount time) then memory > pressure on that one node will generate IO stalls across the entire > filesystem as other nodes doing work have to repopulate the buffer > cache for any allocation for freeing of space/inodes.. > > IOWs, for large NUMA systems using cpusets this cache should be > spread around all of memory, especially as it has NUMA aware > reclaim. For everyone else, it's just documentation that improper > cgroup or NUMA memory policy could cause you all sorts of problems > with this cache. > > It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in > filesystems for inode caches largely because, at the time (~2006), > the only reclaimable cache that could grow to any size large enough > to cause problems was the inode cache. It's been cargo-culted ever > since, whether it is needed or not (e.g. ceph). > > In the case of the xfs_bufs, I've been running workloads recently > that cache several million xfs_bufs and only a handful of inodes > rather than the other way around. If we spread inodes because > caching millions on a single node can cause problems on large NUMA > machines, then we also need to spread xfs_bufs... Hmm, could we capture this as a comment somewhere? --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote: > On Thu, Oct 31, 2019 at 08:43:35AM +1100, Dave Chinner wrote: > > On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote: > > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > > > objects - it also frees the pages attached to the buffers. Make sure > > > > the memory reclaim code accounts for this memory being freed > > > > correctly, similar to how the inode shrinker accounts for pages > > > > freed from the page cache due to mapping invalidation. > > > > > > > > We also need to make sure that the mm subsystem knows these are > > > > reclaimable objects. We provide the memory reclaim subsystem with a > > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > > > that way. > > > > > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > > > like we do inodes. > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > --- > > > > fs/xfs/xfs_buf.c | 6 +++++- > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > index e484f6bead53..45b470f55ad7 100644 > > > > --- a/fs/xfs/xfs_buf.c > > > > +++ b/fs/xfs/xfs_buf.c > > > > @@ -324,6 +324,9 @@ xfs_buf_free( > > > > > > > > __free_page(page); > > > > } > > > > + if (current->reclaim_state) > > > > + current->reclaim_state->reclaimed_slab += > > > > + bp->b_page_count; > > > > > > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together. > > > > > > > } else if (bp->b_flags & _XBF_KMEM) > > > > kmem_free(bp->b_addr); > > > > _xfs_buf_free_pages(bp); > > > > @@ -2064,7 +2067,8 @@ int __init > > > > xfs_buf_init(void) > > > > { > > > > xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", > > > > - KM_ZONE_HWALIGN, NULL); > > > > + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, > > > > > > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply > > > to a particular "use another node" memory policy when slab is in use. > > > Was that your intent? > > > > It's more documentation than anything - that we shouldn't be piling > > these structures all on to one node because that can have severe > > issues with NUMA memory reclaim algorithms. i.e. the xfs-buf > > shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single > > node can reclaim all the xfs-bufs on that node without touching any > > other node. > > > > That means, for example, if we instantiate all the AG header buffers > > on a single node (e.g. like we do at mount time) then memory > > pressure on that one node will generate IO stalls across the entire > > filesystem as other nodes doing work have to repopulate the buffer > > cache for any allocation for freeing of space/inodes.. > > > > IOWs, for large NUMA systems using cpusets this cache should be > > spread around all of memory, especially as it has NUMA aware > > reclaim. For everyone else, it's just documentation that improper > > cgroup or NUMA memory policy could cause you all sorts of problems > > with this cache. > > > > It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in > > filesystems for inode caches largely because, at the time (~2006), > > the only reclaimable cache that could grow to any size large enough > > to cause problems was the inode cache. It's been cargo-culted ever > > since, whether it is needed or not (e.g. ceph). > > > > In the case of the xfs_bufs, I've been running workloads recently > > that cache several million xfs_bufs and only a handful of inodes > > rather than the other way around. If we spread inodes because > > caching millions on a single node can cause problems on large NUMA > > machines, then we also need to spread xfs_bufs... > > Hmm, could we capture this as a comment somewhere? Sure, but where? We're planning on getting rid of the KM_ZONE flags in the near future, and most of this is specific to the impacts on XFS. I could put it in xfs-super.c above where we initialise all the slabs, I guess. Probably a separate patch, though.... Cheers, Dave.
On Fri, Nov 01, 2019 at 07:50:49AM +1100, Dave Chinner wrote: > On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote: > > On Thu, Oct 31, 2019 at 08:43:35AM +1100, Dave Chinner wrote: > > > On Wed, Oct 30, 2019 at 10:25:17AM -0700, Darrick J. Wong wrote: > > > > On Wed, Oct 09, 2019 at 02:21:02PM +1100, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > The buffer cache shrinker frees more than just the xfs_buf slab > > > > > objects - it also frees the pages attached to the buffers. Make sure > > > > > the memory reclaim code accounts for this memory being freed > > > > > correctly, similar to how the inode shrinker accounts for pages > > > > > freed from the page cache due to mapping invalidation. > > > > > > > > > > We also need to make sure that the mm subsystem knows these are > > > > > reclaimable objects. We provide the memory reclaim subsystem with a > > > > > a shrinker to reclaim xfs_bufs, so we should really mark the slab > > > > > that way. > > > > > > > > > > We also have a lot of xfs_bufs in a busy system, spread them around > > > > > like we do inodes. > > > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > --- > > > > > fs/xfs/xfs_buf.c | 6 +++++- > > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > > > > index e484f6bead53..45b470f55ad7 100644 > > > > > --- a/fs/xfs/xfs_buf.c > > > > > +++ b/fs/xfs/xfs_buf.c > > > > > @@ -324,6 +324,9 @@ xfs_buf_free( > > > > > > > > > > __free_page(page); > > > > > } > > > > > + if (current->reclaim_state) > > > > > + current->reclaim_state->reclaimed_slab += > > > > > + bp->b_page_count; > > > > > > > > Hmm, ok, I see how ZONE_RECLAIM and reclaimed_slab fit together. > > > > > > > > > } else if (bp->b_flags & _XBF_KMEM) > > > > > kmem_free(bp->b_addr); > > > > > _xfs_buf_free_pages(bp); > > > > > @@ -2064,7 +2067,8 @@ int __init > > > > > xfs_buf_init(void) > > > > > { > > > > > xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", > > > > > - KM_ZONE_HWALIGN, NULL); > > > > > + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, > > > > > > > > I guess I'm fine with ZONE_SPREAD too, insofar as it only seems to apply > > > > to a particular "use another node" memory policy when slab is in use. > > > > Was that your intent? > > > > > > It's more documentation than anything - that we shouldn't be piling > > > these structures all on to one node because that can have severe > > > issues with NUMA memory reclaim algorithms. i.e. the xfs-buf > > > shrinker sets SHRINKER_NUMA_AWARE, so memory pressure on a single > > > node can reclaim all the xfs-bufs on that node without touching any > > > other node. > > > > > > That means, for example, if we instantiate all the AG header buffers > > > on a single node (e.g. like we do at mount time) then memory > > > pressure on that one node will generate IO stalls across the entire > > > filesystem as other nodes doing work have to repopulate the buffer > > > cache for any allocation for freeing of space/inodes.. > > > > > > IOWs, for large NUMA systems using cpusets this cache should be > > > spread around all of memory, especially as it has NUMA aware > > > reclaim. For everyone else, it's just documentation that improper > > > cgroup or NUMA memory policy could cause you all sorts of problems > > > with this cache. > > > > > > It's worth noting that SLAB_MEM_SPREAD is used almost exclusively in > > > filesystems for inode caches largely because, at the time (~2006), > > > the only reclaimable cache that could grow to any size large enough > > > to cause problems was the inode cache. It's been cargo-culted ever > > > since, whether it is needed or not (e.g. ceph). > > > > > > In the case of the xfs_bufs, I've been running workloads recently > > > that cache several million xfs_bufs and only a handful of inodes > > > rather than the other way around. If we spread inodes because > > > caching millions on a single node can cause problems on large NUMA > > > machines, then we also need to spread xfs_bufs... > > > > Hmm, could we capture this as a comment somewhere? > > Sure, but where? We're planning on getting rid of the KM_ZONE flags > in the near future, and most of this is specific to the impacts on > XFS. I could put it in xfs-super.c above where we initialise all the > slabs, I guess. Probably a separate patch, though.... Sounds like a reasonable place (to me) to record the fact that we want inodes and metadata buffers not to end up concentrating on a single node. At least until we start having NUMA systems with a separate "IO node" in which to confine all the IO threads and whatnot <shudder>. :P --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Thu, Oct 31, 2019 at 02:05:51PM -0700, Darrick J. Wong wrote: > Sounds like a reasonable place (to me) to record the fact that we want > inodes and metadata buffers not to end up concentrating on a single node. > > At least until we start having NUMA systems with a separate "IO node" in > which to confine all the IO threads and whatnot <shudder>. :P Wait until someone does a Linux port to the Cray T3E ;-)
On Thu, Oct 31, 2019 at 02:05:51PM -0700, Darrick J. Wong wrote: > On Fri, Nov 01, 2019 at 07:50:49AM +1100, Dave Chinner wrote: > > On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote: > > > > In the case of the xfs_bufs, I've been running workloads recently > > > > that cache several million xfs_bufs and only a handful of inodes > > > > rather than the other way around. If we spread inodes because > > > > caching millions on a single node can cause problems on large NUMA > > > > machines, then we also need to spread xfs_bufs... > > > > > > Hmm, could we capture this as a comment somewhere? > > > > Sure, but where? We're planning on getting rid of the KM_ZONE flags > > in the near future, and most of this is specific to the impacts on > > XFS. I could put it in xfs-super.c above where we initialise all the > > slabs, I guess. Probably a separate patch, though.... > > Sounds like a reasonable place (to me) to record the fact that we want > inodes and metadata buffers not to end up concentrating on a single node. Ok. I'll add yet another patch to the preliminary part of the series. Any plans to take any of these first few patches in this cycle? > At least until we start having NUMA systems with a separate "IO node" in > which to confine all the IO threads and whatnot <shudder>. :P Been there, done that, got the t-shirt and wore it out years ago. IO-only nodes (either via software configuration, or real cpu/memory-less IO nodes) are one of the reasons we don't want node-local allocation behaviour for large NUMA configs... Cheers, Dave.
On Mon, Nov 04, 2019 at 08:26:50AM +1100, Dave Chinner wrote: > On Thu, Oct 31, 2019 at 02:05:51PM -0700, Darrick J. Wong wrote: > > On Fri, Nov 01, 2019 at 07:50:49AM +1100, Dave Chinner wrote: > > > On Wed, Oct 30, 2019 at 08:06:58PM -0700, Darrick J. Wong wrote: > > > > > In the case of the xfs_bufs, I've been running workloads recently > > > > > that cache several million xfs_bufs and only a handful of inodes > > > > > rather than the other way around. If we spread inodes because > > > > > caching millions on a single node can cause problems on large NUMA > > > > > machines, then we also need to spread xfs_bufs... > > > > > > > > Hmm, could we capture this as a comment somewhere? > > > > > > Sure, but where? We're planning on getting rid of the KM_ZONE flags > > > in the near future, and most of this is specific to the impacts on > > > XFS. I could put it in xfs-super.c above where we initialise all the > > > slabs, I guess. Probably a separate patch, though.... > > > > Sounds like a reasonable place (to me) to record the fact that we want > > inodes and metadata buffers not to end up concentrating on a single node. > > Ok. I'll add yet another patch to the preliminary part of the > series. Any plans to take any of these first few patches in this > cycle? I think I have time to review patches this week. :) (As it occurs to me that the most recent submission of this series predates this reply, and that's why he couldn't find the patch with a longer description...) > > At least until we start having NUMA systems with a separate "IO node" in > > which to confine all the IO threads and whatnot <shudder>. :P > > Been there, done that, got the t-shirt and wore it out years ago. > > IO-only nodes (either via software configuration, or real > cpu/memory-less IO nodes) are one of the reasons we don't want > node-local allocation behaviour for large NUMA configs... Same, though purely by accident -- CPU mounting bracket cracks, CPU falls out of machine, but the IO complex is still running, so when the machine restarts it has a weird NUMA node containing IO but no CPU... --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index e484f6bead53..45b470f55ad7 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -324,6 +324,9 @@ xfs_buf_free( __free_page(page); } + if (current->reclaim_state) + current->reclaim_state->reclaimed_slab += + bp->b_page_count; } else if (bp->b_flags & _XBF_KMEM) kmem_free(bp->b_addr); _xfs_buf_free_pages(bp); @@ -2064,7 +2067,8 @@ int __init xfs_buf_init(void) { xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf", - KM_ZONE_HWALIGN, NULL); + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM, + NULL); if (!xfs_buf_zone) goto out;