mbox series

[v4,0/9] Create large folios in iomap buffered write path

Message ID 20230710130253.3484695-1-willy@infradead.org (mailing list archive)
Headers show
Series Create large folios in iomap buffered write path | expand

Message

Matthew Wilcox (Oracle) July 10, 2023, 1:02 p.m. UTC
Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
to improve worst-case latency.  Unfortunately, this had the effect of
limiting the performance of:

fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
        -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
        -numjobs=4 -directory=/mnt/test

The problem ends up being lock contention on the i_pages spinlock as we
clear the writeback bit on each folio (and propagate that up through
the tree).  By using larger folios, we decrease the number of folios
to be processed by a factor of 256 for this benchmark, eliminating the
lock contention.

It's also the right thing to do.  This is a project that has been on
the back burner for years, it just hasn't been important enough to do
before now.

I think it's probably best if this goes through the iomap tree since
the changes outside iomap are either to the page cache or they're
trivial.

v4:
 - Rebase on v6.5-rc1
 - Add documentation for fgf_set_order()
 - Improve documentation for iomap_get_page()
 - Fix crash caused by shadow kaddr name in copy_page_from_iter_atomic()
 - Add copy_folio_from_iter_atomic()
 - Improve comment in iomap_release_folio()
 - Use __GFP_NORETRY | __GRP_NOWARN to allocate large folios
 - Change an 'unsigned long bytes' to 'size_t bytes' for consistency
 - Restructure iomap_write_iter() loop slightly

v3:
 - Fix the handling of compound highmem pages in copy_page_from_iter_atomic()
 - Rename fgp_t to fgf_t
 - Clarify some wording in the documentation

v2:
 - Fix misplaced semicolon
 - Rename fgp_order to fgp_set_order
 - Rename FGP_ORDER to FGP_GET_ORDER
 - Add fgp_t
 - Update the documentation for ->release_folio
 - Fix iomap_invalidate_folio()
 - Update iomap_release_folio()

Matthew Wilcox (Oracle) (9):
  iov_iter: Handle compound highmem pages in
    copy_page_from_iter_atomic()
  iov_iter: Add copy_folio_from_iter_atomic()
  iomap: Remove large folio handling in iomap_invalidate_folio()
  doc: Correct the description of ->release_folio
  iomap: Remove unnecessary test from iomap_release_folio()
  filemap: Add fgf_t typedef
  filemap: Allow __filemap_get_folio to allocate large folios
  iomap: Create large folios in the buffered write path
  iomap: Copy larger chunks from userspace

 Documentation/filesystems/locking.rst | 15 +++--
 fs/btrfs/file.c                       |  6 +-
 fs/f2fs/compress.c                    |  2 +-
 fs/f2fs/f2fs.h                        |  2 +-
 fs/gfs2/bmap.c                        |  2 +-
 fs/iomap/buffered-io.c                | 54 +++++++++---------
 include/linux/iomap.h                 |  2 +-
 include/linux/pagemap.h               | 82 +++++++++++++++++++++++----
 include/linux/uio.h                   |  9 ++-
 lib/iov_iter.c                        | 43 +++++++++-----
 mm/filemap.c                          | 65 +++++++++++----------
 mm/folio-compat.c                     |  2 +-
 mm/readahead.c                        | 13 -----
 13 files changed, 187 insertions(+), 110 deletions(-)

Comments

Luis Chamberlain July 10, 2023, 10:55 p.m. UTC | #1
On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote:
> Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> to improve worst-case latency.  Unfortunately, this had the effect of
> limiting the performance of:
> 
> fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
>         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
>         -numjobs=4 -directory=/mnt/test

When you say performance, do you mean overall throughput / IOPS /
latency or all?

And who noticed it / reported it? The above incantation seems pretty
specific so I'm curious who runs that test and what sort of work flow
is it trying to replicate.

> The problem ends up being lock contention on the i_pages spinlock as we
> clear the writeback bit on each folio (and propagate that up through
> the tree).  By using larger folios, we decrease the number of folios
> to be processed by a factor of 256 for this benchmark, eliminating the
> lock contention.

Implied here seems to suggest that the associated cost for the search a
larger folio is pretty negligable compared the gains of finding one.
That seems to be nice but it gets me wondering if there are other
benchmarks under which there is any penalties instead.

Ie, is the above a microbenchmark where this yields good results?

> It's also the right thing to do.  This is a project that has been on
> the back burner for years, it just hasn't been important enough to do
> before now.

Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in
writeback") dates back to just one year, and so it gets me wondering
how a project in the back burner for years now finds motivation for
just a one year old regression.

What was the original motivation of the older project dating this
effort back to its inception?

  Luis
Matthew Wilcox (Oracle) July 10, 2023, 11:53 p.m. UTC | #2
On Mon, Jul 10, 2023 at 03:55:17PM -0700, Luis Chamberlain wrote:
> On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote:
> > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> > to improve worst-case latency.  Unfortunately, this had the effect of
> > limiting the performance of:
> > 
> > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
> >         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
> >         -numjobs=4 -directory=/mnt/test
> 
> When you say performance, do you mean overall throughput / IOPS /
> latency or all?

This is buffered I/O, so when we run out of RAM, we block until the
dirty pages are written back.  I suppose that makes it throughput, but
it's throughput from the bottom of the page cache to storage, not the
usual app-to-page-cache bottleneck.

> And who noticed it / reported it? The above incantation seems pretty
> specific so I'm curious who runs that test and what sort of work flow
> is it trying to replicate.

Wang Yugui, who is on the cc reported it.
https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@e16-tech.com/

> > The problem ends up being lock contention on the i_pages spinlock as we
> > clear the writeback bit on each folio (and propagate that up through
> > the tree).  By using larger folios, we decrease the number of folios
> > to be processed by a factor of 256 for this benchmark, eliminating the
> > lock contention.
> 
> Implied here seems to suggest that the associated cost for the search a
> larger folio is pretty negligable compared the gains of finding one.
> That seems to be nice but it gets me wondering if there are other
> benchmarks under which there is any penalties instead.
> 
> Ie, is the above a microbenchmark where this yields good results?

It happens to be constructed in such a way that it yields the optimum
outcome in this case, but that clearly wasn't deliberate.  And the
solution is the one I had in mind from before the bug report came in.

I don't think you'll be able to find a benchmark that regresses as
a result of this, but I welcome your attempt!

> > It's also the right thing to do.  This is a project that has been on
> > the back burner for years, it just hasn't been important enough to do
> > before now.
> 
> Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in
> writeback") dates back to just one year, and so it gets me wondering
> how a project in the back burner for years now finds motivation for
> just a one year old regression.
> 
> What was the original motivation of the older project dating this
> effort back to its inception?

We should create larger folios on write.  It just wasn't important
enough to do before now.  But now there's an actual user who cares.
Dave Chinner July 11, 2023, 12:01 a.m. UTC | #3
On Mon, Jul 10, 2023 at 03:55:17PM -0700, Luis Chamberlain wrote:
> On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote:
> > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> > to improve worst-case latency.  Unfortunately, this had the effect of
> > limiting the performance of:
> > 
> > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
> >         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
> >         -numjobs=4 -directory=/mnt/test
> 
> When you say performance, do you mean overall throughput / IOPS /
> latency or all?
> 
> And who noticed it / reported it?

https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@e16-tech.com/

> The above incantation seems pretty
> specific so I'm curious who runs that test and what sort of work flow
> is it trying to replicate.

Not specific at all. It's just a basic concurrent sequential
buffered write performance test. It needs multiple jobs to max out
typical cheap pcie 4.0 NVMe SSD storage (i.e. 6-8GB/s) because
sequential non-async buffered writes are CPU bound at (typically)
2-3GB/s per file write.

> > The problem ends up being lock contention on the i_pages spinlock as we
> > clear the writeback bit on each folio (and propagate that up through
> > the tree).  By using larger folios, we decrease the number of folios
> > to be processed by a factor of 256 for this benchmark, eliminating the
> > lock contention.
> 
> Implied here seems to suggest that the associated cost for the search a
> larger folio is pretty negligable compared the gains of finding one.
> That seems to be nice but it gets me wondering if there are other
> benchmarks under which there is any penalties instead.
> 
> Ie, is the above a microbenchmark where this yields good results?

No, the workload gains are general - they avoid the lock contention
problems involved with cycling, accounting and state changes for
millions of objects (order-0 folios) a second through a single
exclusive lock (mapping tree lock) by reducing the mapping tree lock
cycling by a couple of orders of magnitude.

> > It's also the right thing to do.  This is a project that has been on
> > the back burner for years, it just hasn't been important enough to do
> > before now.
> 
> Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in
> writeback") dates back to just one year, and so it gets me wondering
> how a project in the back burner for years now finds motivation for
> just a one year old regression.

The change in ebb7fb1557b1 is just the straw that broke the camel's
back. It got rid of the massive IO batch processing we used to
minimise the inherent cross-process mapping tree contention in
buffered writes. i.e. the process doing write() calls, multiple
kswapds doing memory reclaim, writeback submission and writeback
completion all contend at the same time for the mapping tree lock.

We largely removed the IO submission and completion from the picture
with huge batch processing, but that then started causing latency
problems with IO completion processing. So we went back to smaller
chunks of IO submission and completion, and that means we went from
3 threads contending on the mapping tree lock to 5 threads. And that
drove the system into catastrophic lock breakdown on the mapping
tree lock.

And so -everything- then went really slow because each write() task
burns down at least 5 CPUs on the mapping tree lock each....

THis is not an XFS issue to solve - this is a general page cache
problem and we've always wanted to fix it in the page cache, either
by large folio support or by large block size support that required
aligned high-order pages in the page cache. Same solution - less
pages to iterate - but different methods...

> What was the original motivation of the older project dating this
> effort back to its inception?

https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf

That was run on a 64kB page size machine (itanic), but there were
signs that the mapping tree lock would be an issue in future.
Indeed, when these large SSI supercomputers moved to x86-64 (down to
4kB page size) a couple of years later, the mapping tree lock popped
up to the top of the list of buffered write throughput limiting
factors.

i.e. the larger the NUMA distances between the workload doing the
write and the node the mapping tree lock is located on, the slower
buffered writes go and the more CPU we burnt on the mapping tree
lock. We carefully worked around that with cpusets and locality
control, and the same was then done in HPC environments on these
types of machines and hence it wasn't an immediate limiting
factor...

But we're talking about multi-million dollar supercomputers here,
and in most cases people just rewrote the apps to use direct IO and
so the problem just went away and the HPC apps could easily use all
the perofrmance the storage provided....

IOWs, we've know about this problem for over 15 years, but the
difference is now that consumer hardware is capable of >10GB/s write
speeds (current PCIe 5.0 nvme ssds) for just a couple of hundred
dollars, rather than multi-million dollar machines in carefully
optimised environments that we first saw it on back in 2006...

Cheers,

Dave.