Message ID | 20240118222216.4131379-1-david@fromorbit.com (mailing list archive) |
---|---|
Headers | show |
Series | xfs: use large folios for buffers | expand |
On Fri, Jan 19, 2024 at 09:19:38AM +1100, Dave Chinner wrote: > The XFS buffer cache supports metadata buffers up to 64kB, and it does so by > aggregating multiple pages into a single contiguous memory region using > vmapping. This is expensive (both the setup and the runtime TLB mapping cost), > and would be unnecessary if we could allocate large contiguous memory regions > for the buffers in the first place. > > Enter multi-page folios. LOL, hch and I just wrapped up making the xfbtree buffer cache work with large folios coming from tmpfs. Though the use case there is simpler because we require blocksize==PAGE_SIZE, forbid the use of highmem, and don't need discontig buffers. Hence we sidestep vm_map_ram. :) > This patchset converts the buffer cache to use the folio API, then enhances it > to optimisitically use large folios where possible. It retains the old "vmap an > array of single page folios" functionality as a fallback when large folio > allocation fails. This means that, like page cache support for large folios, we > aren't dependent on large folio allocation succeeding all the time. > > This relegates the single page array allocation mechanism to the "slow path" > that we don't have to care so much about performance of this path anymore. This > might allow us to simplify it a bit in future. > > One of the issues with the folio conversion is that we use a couple of APIs that > take struct page ** (i.e. pointers to page pointer arrays) and there aren't > folio counterparts. These are the bulk page allocator and vm_map_ram(). In the > cases where they are used, we cast &bp->b_folios[] to (struct page **) knowing > that this array will only contain single page folios and that single page folios > and struct page are the same structure and so have the same address. This is a > bit of a hack (hence the RFC) but I'm not sure that it's worth adding folio > versions of these interfaces right now. We don't need to use the bulk page > allocator so much any more, because that's now a slow path and we could probably > just call folio_alloc() in a loop like we used to. What to do about vm_map_ram() > is a little less clear.... Yeah, that's what I suspected. > The other issue I tripped over in doing this conversion is that the > discontiguous buffer straddling code in the buf log item dirty region tracking > is broken. We don't actually exercise that code on existing configurations, and > I tripped over it when tracking down a bug in the folio conversion. I fixed it > and short-circuted the check for contiguous buffers, but that didn't fix the > failure I was seeing (which was not handling bp->b_offset and large folios > properly when building bios). Yikes. > Apart from those issues, the conversion and enhancement is relatively straight > forward. It passes fstests on both 512 and 4096 byte sector size storage (512 > byte sectors exercise the XBF_KMEM path which has non-zero bp->b_offset values) > and doesn't appear to cause any problems with large directory buffers, though I > haven't done any real testing on those yet. Large folio allocations are > definitely being exercised, though, as all the inode cluster buffers are 16kB on > a 512 byte inode V5 filesystem. > > Thoughts, comments, etc? Not yet. > Note: this patchset is on top of the NOFS removal patchset I sent a > few days ago. That can be pulled from this git branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-kmem-cleanup Oooh a branch link, thank you. It's so much easier if I can pull a branch while picking through commits over gitweb. --D > >
On Mon, Jan 22, 2024 at 02:13:23AM -0800, Andi Kleen wrote: > Dave Chinner <david@fromorbit.com> writes: > > > Thoughts, comments, etc? > > The interesting part is if it will cause additional tail latencies > allocating under fragmentation with direct reclaim, compaction > etc. being triggered before it falls back to the base page path. It's not like I don't know these problems exist with memory allocation. Go have a look at xlog_kvmalloc() which is an open coded kvmalloc() that allows the high order kmalloc allocations to fail-fast without triggering all the expensive and unnecessary direct reclaim overhead (e.g. compaction!) because we can fall back to vmalloc without huge concerns. When high order allocations start to fail, then we fall back to vmalloc and then we hit the long standing vmalloc scalability problems before anything else in XFS or the IO path becomes a bottleneck. IOWs, we already know that fail-fast high-order allocation is a more efficient and effective fast path than using vmalloc/vmap_ram() all the time. As this is an RFC, I haven't implemented stuff like this yet - I haven't seen anything in the profiles indicating that high order folio allocation is failing and causing lots of reclaim overhead, so I simply haven't added fail-fast behaviour yet... > In fact it is highly likely it will, the question is just how bad it is. > > Unfortunately benchmarking for that isn't that easy, it needs artificial > memory fragmentation and then some high stress workload, and then > instrumenting the transactions for individual latencies. I stress test and measure XFS metadata performance under sustained memory pressure all the time. This change has not caused any obvious regressions in the short time I've been testing it. I still need to do perf testing on large directory block sizes. That is where high-order allocations will get stressed - that's where xlog_kvmalloc() starts dominating the profiles as it trips over vmalloc scalability issues... > I would in any case add a tunable for it in case people run into this. No tunables. It either works or it doesn't. If we can't make it work reliably by default, we throw it in the dumpster, light it on fire and walk away. > Tail latencies are a common concern on many IO workloads. Yes, for user data operations it's a common concern. For metadata, not so much - there's so many far worse long tail latencies in metadata operations (like waiting for journal space) that memory allocation latencies in the metadata IO path are largely noise.... -Dave.
[fixed the subject, not sure what happened there] FWIW I'm not sure fail-fail is always the right strategy here, in many cases even with some reclaim, compaction may win. Just not if you're on a tight budget for the latencies. > I stress test and measure XFS metadata performance under sustained > memory pressure all the time. This change has not caused any > obvious regressions in the short time I've been testing it. Did you test for tail latencies? There are some relatively simple ways to trigger memory fragmentation, the standard way is to allocate a very large THP backed file and then punch a lot of holes. > > I still need to do perf testing on large directory block sizes. That > is where high-order allocations will get stressed - that's where > xlog_kvmalloc() starts dominating the profiles as it trips over > vmalloc scalability issues... Yes that's true. vmalloc has many issues, although with the recent patches to split the rbtrees with separate locks it may now look quite different than before. > > > I would in any case add a tunable for it in case people run into this. > > No tunables. It either works or it doesn't. If we can't make > it work reliably by default, we throw it in the dumpster, light it > on fire and walk away. I'm not sure there is a single definition of "reliably" here -- for many workloads tail latencies don't matter, so it's always reliable, as long as you have good aggregate throughput. Others have very high expectations for them. Forcing the high expectations on everyone is probably not a good general strategy though, as there are general trade offs. I could see that having lots of small tunables for every use might not be a good idea. Perhaps there would be a case for a single general tunable that controls higher order folios for everyone. > > > Tail latencies are a common concern on many IO workloads. > > Yes, for user data operations it's a common concern. For metadata, > not so much - there's so many far worse long tail latencies in > metadata operations (like waiting for journal space) that memory > allocation latencies in the metadata IO path are largely noise.... I've seen pretty long stalls in the past. The difference to the journal is also that it is local the file system, while the memory is normally shared with everyone on the node or system. So the scope of noisy neighbour impact can be quite different, especially on a large machine. -Andi
On Mon, Jan 22, 2024 at 05:34:12AM -0800, Andi Kleen wrote: > [fixed the subject, not sure what happened there] > > FWIW I'm not sure fail-fail is always the right strategy here, > in many cases even with some reclaim, compaction may win. Just not if you're > on a tight budget for the latencies. > > > I stress test and measure XFS metadata performance under sustained > > memory pressure all the time. This change has not caused any > > obvious regressions in the short time I've been testing it. > > Did you test for tail latencies? No, it's an RFC, and I mostly don't care about memory allocation tail latencies because it is highly unlikely to be a *new issue* we need to care about. The fact is taht we already do so much memory allocation and high order memory allocation (e.g. through slub, xlog_kvmalloc(), user data IO through the page cache, etc) that if there's a long tail latency problem with high order memory allocation then it will already be noticably affecting the XFS data and metadata IO latencies. Nobody is reporting problems about excessive long tail latencies when using XFS, so my care factor about long tail latencies in this specific memory allocation case is close to zero. > There are some relatively simple ways to trigger memory fragmentation, > the standard way is to allocate a very large THP backed file and then > punch a lot of holes. Or just run a filesystem with lots of variable sized high order allocations and varying cached object life times under sustained memory pressure for a significant period of time.... > > > I would in any case add a tunable for it in case people run into this. > > > > No tunables. It either works or it doesn't. If we can't make > > it work reliably by default, we throw it in the dumpster, light it > > on fire and walk away. > > I'm not sure there is a single definition of "reliably" here -- for > many workloads tail latencies don't matter, so it's always reliable, > as long as you have good aggregate throughput. > > Others have very high expectations for them. > > Forcing the high expectations on everyone is probably not a good > general strategy though, as there are general trade offs. Yup, and we have to make those tradeoffs because filesystems need to be good at "general purpose" workloads as their primary focus. Minimising long tail latencies is really just "best effort only" because we intimately aware of the fact that there are global resource limitations that cause long tail latencies in filesystem implementations that simply cannot be worked around. > I could see that having lots of small tunables for every use might not be a > good idea. Perhaps there would be a case for a single general tunable > that controls higher order folios for everyone. You're not listening: no tunables. Code that has tunables because you think it *may not work* is code that is not ready to be merged. > > > Tail latencies are a common concern on many IO workloads. > > > > Yes, for user data operations it's a common concern. For metadata, > > not so much - there's so many far worse long tail latencies in > > metadata operations (like waiting for journal space) that memory > > allocation latencies in the metadata IO path are largely noise.... > > I've seen pretty long stalls in the past. > > The difference to the journal is also that it is local the file system, while > the memory is normally shared with everyone on the node or system. So the > scope of noisy neighbour impact can be quite different, especially on a large > machine. Most systems run everything on a single filesystem. That makes them just as global as memory allocation. If the journal bottlenecks, everyone on the system suffers the performance degradation, not just the user who caused it. -Dave.