Message ID | 161340385320.1303470.2392622971006879777.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
Headers | show |
Series | Network fs helper library & fscache kiocb API [ver #3] | expand |
On Mon, 2021-02-15 at 15:44 +0000, David Howells wrote: > Here's a set of patches to do two things: > > (1) Add a helper library to handle the new VM readahead interface. This > is intended to be used unconditionally by the filesystem (whether or > not caching is enabled) and provides a common framework for doing > caching, transparent huge pages and, in the future, possibly fscrypt > and read bandwidth maximisation. It also allows the netfs and the > cache to align, expand and slice up a read request from the VM in > various ways; the netfs need only provide a function to read a stretch > of data to the pagecache and the helper takes care of the rest. > > (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb > facility to do async DIO to transfer data to/from the netfs's pages, > rather than using readpage with wait queue snooping on one side and > vfs_write() on the other. It also uses less memory, since it doesn't > do buffered I/O on the backing file. > > Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available > to be read from the cache. Whilst this is an improvement from the > bmap interface, it still has a problem with regard to a modern > extent-based filesystem inserting or removing bridging blocks of > zeros. Fixing that requires a much greater overhaul. > > This is a step towards overhauling the fscache API. The change is opt-in > on the part of the network filesystem. A netfs should not try to mix the > old and the new API because of conflicting ways of handling pages and the > PG_fscache page flag and because it would be mixing DIO with buffered I/O. > Further, the helper library can't be used with the old API. > > This does not change any of the fscache cookie handling APIs or the way > invalidation is done. > > In the near term, I intend to deprecate and remove the old I/O API > (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(), > fscache_write_page() and fscache_uncache_page()) and eventually replace > most of fscache/cachefiles with something simpler and easier to follow. > > The patchset contains five parts: > > (1) Some helper patches, including provision of an ITER_XARRAY iov > iterator and a function to do readahead expansion. > > (2) Patches to add the netfs helper library. > > (3) A patch to add the fscache/cachefiles kiocb API. > > (4) Patches to add support in AFS for this. > > (5) Patches from Jeff Layton to add support in Ceph for this. > > Dave Wysochanski also has patches for NFS for this, though they're not > included on this branch as there's an issue with PNFS. > > With this, AFS without a cache passes all expected xfstests; with a cache, > there's an extra failure, but that's also there before these patches. > Fixing that probably requires a greater overhaul. Ceph and NFS also pass > the expected tests. > > These patches can be found also on: > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-netfs-lib > > For diffing reference, the tag for the 9th Feb pull request is > fscache-ioapi-20210203 and can be found in the same repository. > > > > Changes > ======= > > (v3) Rolled in the bug fixes. > > Adjusted the functions that unlock and wait for PG_fscache according > to Linus's suggestion. > > Hold a ref on a page when PG_fscache is set as per Linus's > suggestion. > > Dropped NFS support and added Ceph support. > > (v2) Fixed some bugs and added NFS support. > > > References > ========== > > These patches have been published for review before, firstly as part of a > larger set: > > Link: https://lore.kernel.org/linux-fsdevel/158861203563.340223.7585359869938129395.stgit@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/159465766378.1376105.11619976251039287525.stgit@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/159465821598.1377938.2046362270225008168.stgit@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/160588455242.3465195.3214733858273019178.stgit@warthog.procyon.org.uk/ > > Then as a cut-down set: > > Link: https://lore.kernel.org/linux-fsdevel/161118128472.1232039.11746799833066425131.stgit@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/161161025063.2537118.2009249444682241405.stgit@warthog.procyon.org.uk/ > > > Proposals/information about the design has been published here: > > Link: https://lore.kernel.org/lkml/24942.1573667720@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/2758811.1610621106@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/1441311.1598547738@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/160655.1611012999@warthog.procyon.org.uk/ > > And requests for information: > > Link: https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/4467.1579020509@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/3577430.1579705075@warthog.procyon.org.uk/ > > The NFS parts, though not included here, have been tested by someone who's > using fscache in production: > > Link: https://listman.redhat.com/archives/linux-cachefs/2020-December/msg00000.html > > I've posted partial patches to try and help 9p and cifs along: > > Link: https://lore.kernel.org/linux-fsdevel/1514086.1605697347@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-cifs/1794123.1605713481@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-fsdevel/241017.1612263863@warthog.procyon.org.uk/ > Link: https://lore.kernel.org/linux-cifs/270998.1612265397@warthog.procyon.org.uk/ > > David > --- > David Howells (27): > iov_iter: Add ITER_XARRAY > mm: Add an unlock function for PG_private_2/PG_fscache > mm: Implement readahead_control pageset expansion > vfs: Export rw_verify_area() for use by cachefiles > netfs: Make a netfs helper module > netfs, mm: Move PG_fscache helper funcs to linux/netfs.h > netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache() > netfs: Provide readahead and readpage netfs helpers > netfs: Add tracepoints > netfs: Gather stats > netfs: Add write_begin helper > netfs: Define an interface to talk to a cache > netfs: Hold a ref on a page when PG_private_2 is set > fscache, cachefiles: Add alternate API to use kiocb for read/write to cache > afs: Disable use of the fscache I/O routines > afs: Pass page into dirty region helpers to provide THP size > afs: Print the operation debug_id when logging an unexpected data version > afs: Move key to afs_read struct > afs: Don't truncate iter during data fetch > afs: Log remote unmarshalling errors > afs: Set up the iov_iter before calling afs_extract_data() > afs: Use ITER_XARRAY for writing > afs: Wait on PG_fscache before modifying/releasing a page > afs: Extract writeback extension into its own function > afs: Prepare for use of THPs > afs: Use the fs operation ops to handle FetchData completion > afs: Use new fscache read helper API > > Jeff Layton (6): > ceph: disable old fscache readpage handling > ceph: rework PageFsCache handling > ceph: fix fscache invalidation > ceph: convert readpage to fscache read helper > ceph: plug write_begin into read helper > ceph: convert ceph_readpages to ceph_readahead > > > fs/Kconfig | 1 + > fs/Makefile | 1 + > fs/afs/Kconfig | 1 + > fs/afs/dir.c | 225 ++++--- > fs/afs/file.c | 470 ++++--------- > fs/afs/fs_operation.c | 4 +- > fs/afs/fsclient.c | 108 +-- > fs/afs/inode.c | 7 +- > fs/afs/internal.h | 58 +- > fs/afs/rxrpc.c | 150 ++--- > fs/afs/write.c | 610 +++++++++-------- > fs/afs/yfsclient.c | 82 +-- > fs/cachefiles/Makefile | 1 + > fs/cachefiles/interface.c | 5 +- > fs/cachefiles/internal.h | 9 + > fs/cachefiles/rdwr2.c | 412 ++++++++++++ > fs/ceph/Kconfig | 1 + > fs/ceph/addr.c | 535 ++++++--------- > fs/ceph/cache.c | 125 ---- > fs/ceph/cache.h | 101 +-- > fs/ceph/caps.c | 10 +- > fs/ceph/inode.c | 1 + > fs/ceph/super.h | 1 + > fs/fscache/Kconfig | 1 + > fs/fscache/Makefile | 3 +- > fs/fscache/internal.h | 3 + > fs/fscache/page.c | 2 +- > fs/fscache/page2.c | 117 ++++ > fs/fscache/stats.c | 1 + > fs/internal.h | 5 - > fs/netfs/Kconfig | 23 + > fs/netfs/Makefile | 5 + > fs/netfs/internal.h | 97 +++ > fs/netfs/read_helper.c | 1169 +++++++++++++++++++++++++++++++++ > fs/netfs/stats.c | 59 ++ > fs/read_write.c | 1 + > include/linux/fs.h | 1 + > include/linux/fscache-cache.h | 4 + > include/linux/fscache.h | 40 +- > include/linux/netfs.h | 195 ++++++ > include/linux/pagemap.h | 3 + > include/net/af_rxrpc.h | 2 +- > include/trace/events/afs.h | 74 +-- > include/trace/events/netfs.h | 201 ++++++ > mm/filemap.c | 20 + > mm/readahead.c | 70 ++ > net/rxrpc/recvmsg.c | 9 +- > 47 files changed, 3473 insertions(+), 1550 deletions(-) > create mode 100644 fs/cachefiles/rdwr2.c > create mode 100644 fs/fscache/page2.c > create mode 100644 fs/netfs/Kconfig > create mode 100644 fs/netfs/Makefile > create mode 100644 fs/netfs/internal.h > create mode 100644 fs/netfs/read_helper.c > create mode 100644 fs/netfs/stats.c > create mode 100644 include/linux/netfs.h > create mode 100644 include/trace/events/netfs.h > > Thanks David, I did an xfstests run on ceph with a kernel based on this and it seemed to do fine. I'll plan to pull this into the ceph-client/testing branch and run it through the ceph kclient test harness. There are only a few differences from the last run we did, so I'm not expecting big changes, but I'll keep you posted.
Jeff, What are the performance differences you are seeing (positive or negative) with ceph and netfs, especially with simple examples like file copy or grep of large files? It could be good if netfs simplifies the problem experienced by network filesystems on Linux with readahead on large sequential reads - where we don't get as much parallelism due to only having one readahead request at a time (thus in many cases there is 'dead time' on either the network or the file server while waiting for the next readpages request to be issued). This can be a significant performance problem for current readpages when network latency is long (or e.g. in cases when network encryption is enabled, and hardware offload not available so time consuming on the server or client to encrypt the packet). Do you see netfs much faster than currentreadpages for ceph? Have you been able to get much benefit from throttling readahead with ceph from the current netfs approach for clamping i/o? On Mon, Feb 15, 2021 at 12:08 PM Jeff Layton <jlayton@redhat.com> wrote: > > On Mon, 2021-02-15 at 15:44 +0000, David Howells wrote: > > Here's a set of patches to do two things: > > > > (1) Add a helper library to handle the new VM readahead interface. This > > is intended to be used unconditionally by the filesystem (whether or > > not caching is enabled) and provides a common framework for doing > > caching, transparent huge pages and, in the future, possibly fscrypt > > and read bandwidth maximisation. It also allows the netfs and the > > cache to align, expand and slice up a read request from the VM in > > various ways; the netfs need only provide a function to read a stretch > > of data to the pagecache and the helper takes care of the rest. > > > > (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb > > facility to do async DIO to transfer data to/from the netfs's pages, > > rather than using readpage with wait queue snooping on one side and > > vfs_write() on the other. It also uses less memory, since it doesn't > > do buffered I/O on the backing file. > > > > Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available > > to be read from the cache. Whilst this is an improvement from the > > bmap interface, it still has a problem with regard to a modern > > extent-based filesystem inserting or removing bridging blocks of > > zeros. Fixing that requires a much greater overhaul. > > > > This is a step towards overhauling the fscache API. The change is opt-in > > on the part of the network filesystem. A netfs should not try to mix the > > old and the new API because of conflicting ways of handling pages and the > > PG_fscache page flag and because it would be mixing DIO with buffered I/O. > > Further, the helper library can't be used with the old API. > > > > This does not change any of the fscache cookie handling APIs or the way > > invalidation is done. > > > > In the near term, I intend to deprecate and remove the old I/O API > > (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(), > > fscache_write_page() and fscache_uncache_page()) and eventually replace > > most of fscache/cachefiles with something simpler and easier to follow. > > > > The patchset contains five parts: > > > > (1) Some helper patches, including provision of an ITER_XARRAY iov > > iterator and a function to do readahead expansion. > > > > (2) Patches to add the netfs helper library. > > > > (3) A patch to add the fscache/cachefiles kiocb API. > > > > (4) Patches to add support in AFS for this. > > > > (5) Patches from Jeff Layton to add support in Ceph for this. > > > > Dave Wysochanski also has patches for NFS for this, though they're not > > included on this branch as there's an issue with PNFS. > > > > With this, AFS without a cache passes all expected xfstests; with a cache, > > there's an extra failure, but that's also there before these patches. > > Fixing that probably requires a greater overhaul. Ceph and NFS also pass > > the expected tests. > > > > These patches can be found also on: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-netfs-lib > > > > For diffing reference, the tag for the 9th Feb pull request is > > fscache-ioapi-20210203 and can be found in the same repository. > > > > > > > > Changes > > ======= > > > > (v3) Rolled in the bug fixes. > > > > Adjusted the functions that unlock and wait for PG_fscache according > > to Linus's suggestion. > > > > Hold a ref on a page when PG_fscache is set as per Linus's > > suggestion. > > > > Dropped NFS support and added Ceph support. > > > > (v2) Fixed some bugs and added NFS support. > > > > > > References > > ========== > > > > These patches have been published for review before, firstly as part of a > > larger set: > > > > Link: https://lore.kernel.org/linux-fsdevel/158861203563.340223.7585359869938129395.stgit@warthog.procyon.org.uk/ > > > > Link: https://lore.kernel.org/linux-fsdevel/159465766378.1376105.11619976251039287525.stgit@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/159465784033.1376674.18106463693989811037.stgit@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/159465821598.1377938.2046362270225008168.stgit@warthog.procyon.org.uk/ > > > > Link: https://lore.kernel.org/linux-fsdevel/160588455242.3465195.3214733858273019178.stgit@warthog.procyon.org.uk/ > > > > Then as a cut-down set: > > > > Link: https://lore.kernel.org/linux-fsdevel/161118128472.1232039.11746799833066425131.stgit@warthog.procyon.org.uk/ > > > > Link: https://lore.kernel.org/linux-fsdevel/161161025063.2537118.2009249444682241405.stgit@warthog.procyon.org.uk/ > > > > > > Proposals/information about the design has been published here: > > > > Link: https://lore.kernel.org/lkml/24942.1573667720@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/2758811.1610621106@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/1441311.1598547738@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/160655.1611012999@warthog.procyon.org.uk/ > > > > And requests for information: > > > > Link: https://lore.kernel.org/linux-fsdevel/3326.1579019665@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/4467.1579020509@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/3577430.1579705075@warthog.procyon.org.uk/ > > > > The NFS parts, though not included here, have been tested by someone who's > > using fscache in production: > > > > Link: https://listman.redhat.com/archives/linux-cachefs/2020-December/msg00000.html > > > > I've posted partial patches to try and help 9p and cifs along: > > > > Link: https://lore.kernel.org/linux-fsdevel/1514086.1605697347@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-cifs/1794123.1605713481@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-fsdevel/241017.1612263863@warthog.procyon.org.uk/ > > Link: https://lore.kernel.org/linux-cifs/270998.1612265397@warthog.procyon.org.uk/ > > > > David > > --- > > David Howells (27): > > iov_iter: Add ITER_XARRAY > > mm: Add an unlock function for PG_private_2/PG_fscache > > mm: Implement readahead_control pageset expansion > > vfs: Export rw_verify_area() for use by cachefiles > > netfs: Make a netfs helper module > > netfs, mm: Move PG_fscache helper funcs to linux/netfs.h > > netfs, mm: Add unlock_page_fscache() and wait_on_page_fscache() > > netfs: Provide readahead and readpage netfs helpers > > netfs: Add tracepoints > > netfs: Gather stats > > netfs: Add write_begin helper > > netfs: Define an interface to talk to a cache > > netfs: Hold a ref on a page when PG_private_2 is set > > fscache, cachefiles: Add alternate API to use kiocb for read/write to cache > > afs: Disable use of the fscache I/O routines > > afs: Pass page into dirty region helpers to provide THP size > > afs: Print the operation debug_id when logging an unexpected data version > > afs: Move key to afs_read struct > > afs: Don't truncate iter during data fetch > > afs: Log remote unmarshalling errors > > afs: Set up the iov_iter before calling afs_extract_data() > > afs: Use ITER_XARRAY for writing > > afs: Wait on PG_fscache before modifying/releasing a page > > afs: Extract writeback extension into its own function > > afs: Prepare for use of THPs > > afs: Use the fs operation ops to handle FetchData completion > > afs: Use new fscache read helper API > > > > Jeff Layton (6): > > ceph: disable old fscache readpage handling > > ceph: rework PageFsCache handling > > ceph: fix fscache invalidation > > ceph: convert readpage to fscache read helper > > ceph: plug write_begin into read helper > > ceph: convert ceph_readpages to ceph_readahead > > > > > > fs/Kconfig | 1 + > > fs/Makefile | 1 + > > fs/afs/Kconfig | 1 + > > fs/afs/dir.c | 225 ++++--- > > fs/afs/file.c | 470 ++++--------- > > fs/afs/fs_operation.c | 4 +- > > fs/afs/fsclient.c | 108 +-- > > fs/afs/inode.c | 7 +- > > fs/afs/internal.h | 58 +- > > fs/afs/rxrpc.c | 150 ++--- > > fs/afs/write.c | 610 +++++++++-------- > > fs/afs/yfsclient.c | 82 +-- > > fs/cachefiles/Makefile | 1 + > > fs/cachefiles/interface.c | 5 +- > > fs/cachefiles/internal.h | 9 + > > fs/cachefiles/rdwr2.c | 412 ++++++++++++ > > fs/ceph/Kconfig | 1 + > > fs/ceph/addr.c | 535 ++++++--------- > > fs/ceph/cache.c | 125 ---- > > fs/ceph/cache.h | 101 +-- > > fs/ceph/caps.c | 10 +- > > fs/ceph/inode.c | 1 + > > fs/ceph/super.h | 1 + > > fs/fscache/Kconfig | 1 + > > fs/fscache/Makefile | 3 +- > > fs/fscache/internal.h | 3 + > > fs/fscache/page.c | 2 +- > > fs/fscache/page2.c | 117 ++++ > > fs/fscache/stats.c | 1 + > > fs/internal.h | 5 - > > fs/netfs/Kconfig | 23 + > > fs/netfs/Makefile | 5 + > > fs/netfs/internal.h | 97 +++ > > fs/netfs/read_helper.c | 1169 +++++++++++++++++++++++++++++++++ > > fs/netfs/stats.c | 59 ++ > > fs/read_write.c | 1 + > > include/linux/fs.h | 1 + > > include/linux/fscache-cache.h | 4 + > > include/linux/fscache.h | 40 +- > > include/linux/netfs.h | 195 ++++++ > > include/linux/pagemap.h | 3 + > > include/net/af_rxrpc.h | 2 +- > > include/trace/events/afs.h | 74 +-- > > include/trace/events/netfs.h | 201 ++++++ > > mm/filemap.c | 20 + > > mm/readahead.c | 70 ++ > > net/rxrpc/recvmsg.c | 9 +- > > 47 files changed, 3473 insertions(+), 1550 deletions(-) > > create mode 100644 fs/cachefiles/rdwr2.c > > create mode 100644 fs/fscache/page2.c > > create mode 100644 fs/netfs/Kconfig > > create mode 100644 fs/netfs/Makefile > > create mode 100644 fs/netfs/internal.h > > create mode 100644 fs/netfs/read_helper.c > > create mode 100644 fs/netfs/stats.c > > create mode 100644 include/linux/netfs.h > > create mode 100644 include/trace/events/netfs.h > > > > > > Thanks David, > > I did an xfstests run on ceph with a kernel based on this and it seemed > to do fine. I'll plan to pull this into the ceph-client/testing branch > and run it through the ceph kclient test harness. There are only a few > differences from the last run we did, so I'm not expecting big changes, > but I'll keep you posted. > > -- > Jeff Layton <jlayton@redhat.com> >
On Mon, Feb 15, 2021 at 06:40:27PM -0600, Steve French wrote: > It could be good if netfs simplifies the problem experienced by > network filesystems on Linux with readahead on large sequential reads > - where we don't get as much parallelism due to only having one > readahead request at a time (thus in many cases there is 'dead time' > on either the network or the file server while waiting for the next > readpages request to be issued). This can be a significant > performance problem for current readpages when network latency is long > (or e.g. in cases when network encryption is enabled, and hardware > offload not available so time consuming on the server or client to > encrypt the packet). > > Do you see netfs much faster than currentreadpages for ceph? > > Have you been able to get much benefit from throttling readahead with > ceph from the current netfs approach for clamping i/o? The switch from readpages to readahead does help in a couple of corner cases. For example, if you have two processes reading the same file at the same time, one will now block on the other (due to the page lock) rather than submitting a mess of overlapping and partial reads. We're not there yet on having multiple outstanding reads. Bill and I had a chat recently about how to make the readahead code detect that it is in a "long fat pipe" situation (as opposed to just dealing with a slow device), and submit extra readahead requests to make best use of the bandwidth and minimise blocking of the application. That's not something for the netfs code to do though; we can get into that situation with highly parallel SSDs.
On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Feb 15, 2021 at 06:40:27PM -0600, Steve French wrote: > > It could be good if netfs simplifies the problem experienced by > > network filesystems on Linux with readahead on large sequential reads > > - where we don't get as much parallelism due to only having one > > readahead request at a time (thus in many cases there is 'dead time' > > on either the network or the file server while waiting for the next > > readpages request to be issued). This can be a significant > > performance problem for current readpages when network latency is long > > (or e.g. in cases when network encryption is enabled, and hardware > > offload not available so time consuming on the server or client to > > encrypt the packet). > > > > Do you see netfs much faster than currentreadpages for ceph? > > > > Have you been able to get much benefit from throttling readahead with > > ceph from the current netfs approach for clamping i/o? > > The switch from readpages to readahead does help in a couple of corner > cases. For example, if you have two processes reading the same file at > the same time, one will now block on the other (due to the page lock) > rather than submitting a mess of overlapping and partial reads. > > We're not there yet on having multiple outstanding reads. Bill and I > had a chat recently about how to make the readahead code detect that > it is in a "long fat pipe" situation (as opposed to just dealing with > a slow device), and submit extra readahead requests to make best use of > the bandwidth and minimise blocking of the application. > > That's not something for the netfs code to do though; we can get into > that situation with highly parallel SSDs. This (readahead behavior improvements in Linux, on single large file sequential read workloads like cp or grep) gets particularly interesting with SMB3 as multichannel becomes more common. With one channel having one readahead request pending on the network is suboptimal - but not as bad as when multichannel is negotiated. Interestingly in most cases two network connections to the same server (different TCP sockets,but the same mount, even in cases where only network adapter) can achieve better performance - but still significantly lags Windows (and probably other clients) as in Linux we don't keep multiple I/Os in flight at one time (unless different files are being read at the same time by different threads). As network adapters are added and removed from the server (other client typically poll to detect interface changes, and SMB3 also leverages the "witness protocol" to get notification of adapter additions or removals) - it would be helpful to change the maximum number of readahead requests in flight. In addition, as the server throttles back (reducing the number of 'credits' granted to the client) it will be important to give hints to the readahead logic about reducing the number of read ahead requests in flight. Keeping multiple readahead requests is easier to imagine when multiple processes are copying or reading files, but there are many scenarios where we could do better with parallelizing a single process doing copy by ensuring that there is no 'dead time' on the network.
On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Feb 15, 2021 at 06:40:27PM -0600, Steve French wrote: > > It could be good if netfs simplifies the problem experienced by > > network filesystems on Linux with readahead on large sequential reads > > - where we don't get as much parallelism due to only having one > > readahead request at a time (thus in many cases there is 'dead time' > > on either the network or the file server while waiting for the next > > readpages request to be issued). This can be a significant > > performance problem for current readpages when network latency is long > > (or e.g. in cases when network encryption is enabled, and hardware > > offload not available so time consuming on the server or client to > > encrypt the packet). > > > > Do you see netfs much faster than currentreadpages for ceph? > > > > Have you been able to get much benefit from throttling readahead with > > ceph from the current netfs approach for clamping i/o? > > The switch from readpages to readahead does help in a couple of corner > cases. For example, if you have two processes reading the same file at > the same time, one will now block on the other (due to the page lock) > rather than submitting a mess of overlapping and partial reads. Do you have a simple repro example of this we could try (fio, dbench, iozone etc) to get some objective perf data? My biggest worry is making sure that the switch to netfs doesn't degrade performance (which might be a low bar now since current network file copy perf seems to signifcantly lag at least Windows), and in some easy to understand scenarios want to make sure it actually helps perf.
On Mon, 2021-02-15 at 18:40 -0600, Steve French wrote: > Jeff, > What are the performance differences you are seeing (positive or > negative) with ceph and netfs, especially with simple examples like > file copy or grep of large files? > > It could be good if netfs simplifies the problem experienced by > network filesystems on Linux with readahead on large sequential reads > - where we don't get as much parallelism due to only having one > readahead request at a time (thus in many cases there is 'dead time' > on either the network or the file server while waiting for the next > readpages request to be issued). This can be a significant > performance problem for current readpages when network latency is long > (or e.g. in cases when network encryption is enabled, and hardware > offload not available so time consuming on the server or client to > encrypt the packet). > > Do you see netfs much faster than currentreadpages for ceph? > > Have you been able to get much benefit from throttling readahead with > ceph from the current netfs approach for clamping i/o? > I haven't seen big performance differences at all with this set. It's pretty much a wash, and it doesn't seem to change how the I/Os are ultimately driven on the wire. For instance, the clamp_length op basically just mirrors what ceph does today -- it ensures that the length of the I/O can't go past the end of the current object. The main benefits are that we get a large swath of readpage, readpages amd write_begin code out of ceph altogether. All of the netfs's need to gather and vet pages for I/O, etc. Most of that doesn't have anything to do with the filesystem itself. By offloading that into the netfs lib, most of that is taken care of for us and we don't need to bother with doing that ourselves.
On Mon, Feb 15, 2021 at 11:22:20PM -0600, Steve French wrote: > On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote: > > The switch from readpages to readahead does help in a couple of corner > > cases. For example, if you have two processes reading the same file at > > the same time, one will now block on the other (due to the page lock) > > rather than submitting a mess of overlapping and partial reads. > > Do you have a simple repro example of this we could try (fio, dbench, iozone > etc) to get some objective perf data? I don't. The problem was noted by the f2fs people, so maybe they have a reproducer. > My biggest worry is making sure that the switch to netfs doesn't degrade > performance (which might be a low bar now since current network file copy > perf seems to signifcantly lag at least Windows), and in some easy to understand > scenarios want to make sure it actually helps perf. I had a question about that ... you've mentioned having 4x4MB reads outstanding as being the way to get optimum performance. Is there a significant performance difference between 4x4MB, 16x1MB and 64x256kB? I'm concerned about having "too large" an I/O on the wire at a given time. For example, with a 1Gbps link, you get 250MB/s. That's a minimum latency of 16us for a 4kB page, but 16ms for a 4MB page. "For very simple tasks, people can perceive latencies down to 2 ms or less" (https://danluu.com/input-lag/) so going all the way to 4MB I/Os takes us into the perceptible latency range, whereas a 256kB I/O is only 1ms. So could you do some experiments with fio doing direct I/O to see if it takes significantly longer to do, say, 1TB of I/O in 4MB chunks vs 256kB chunks? Obviously use threads to keep lots of I/Os outstanding.
On Tue, Feb 23, 2021 at 2:28 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Feb 15, 2021 at 11:22:20PM -0600, Steve French wrote: > > On Mon, Feb 15, 2021 at 8:10 PM Matthew Wilcox <willy@infradead.org> wrote: > > > The switch from readpages to readahead does help in a couple of corner > > > cases. For example, if you have two processes reading the same file at > > > the same time, one will now block on the other (due to the page lock) > > > rather than submitting a mess of overlapping and partial reads. > > > > Do you have a simple repro example of this we could try (fio, dbench, iozone > > etc) to get some objective perf data? > > I don't. The problem was noted by the f2fs people, so maybe they have a > reproducer. > > > My biggest worry is making sure that the switch to netfs doesn't degrade > > performance (which might be a low bar now since current network file copy > > perf seems to signifcantly lag at least Windows), and in some easy to understand > > scenarios want to make sure it actually helps perf. > > I had a question about that ... you've mentioned having 4x4MB reads > outstanding as being the way to get optimum performance. Is there a > significant performance difference between 4x4MB, 16x1MB and 64x256kB? > I'm concerned about having "too large" an I/O on the wire at a given time. > For example, with a 1Gbps link, you get 250MB/s. That's a minimum > latency of 16us for a 4kB page, but 16ms for a 4MB page. > > "For very simple tasks, people can perceive latencies down to 2 ms or less" > (https://danluu.com/input-lag/) > so going all the way to 4MB I/Os takes us into the perceptible latency > range, whereas a 256kB I/O is only 1ms. > > So could you do some experiments with fio doing direct I/O to see if > it takes significantly longer to do, say, 1TB of I/O in 4MB chunks vs > 256kB chunks? Obviously use threads to keep lots of I/Os outstanding. That is a good question and it has been months since I have done experiments with something similar. Obviously this will vary depending on RDMA or not and multichannel or not - but assuming the 'normal' low end network configuration - ie a 1Gbps link and no RDMA or multichannel I could do some more recent experiments. In the past what I had noticed was that server performance for simple workloads like cp or grep increased with network I/O size to a point: smaller than 256K packet size was bad. Performance improved significantly from 256K to 512K to 1MB, but only very slightly from 1MB to 2MB to 4MB and sometimes degraded at 8MB (IIRC 8MB is the max commonly supported by SMB3 servers), but this is with only one adapter (no multichannel) and 1Gb adapters. But in those examples there wasn't a lot of concurrency on the wire. I did some experiments with increasing the read ahead size (which causes more than one async read to be issued by cifs.ko but presumably does still result in some 'dead time') which seemed to help perf of some sequential read examples (e.g. grep or cp) to some servers but I didn't try enough variety of server targets to feel confident about that change especially if netfs is coming e.g. a change I experimented with was: sb->s_bdi->ra_pages = cifs_sb->ctx->rsize / PAGE_SIZE to sb->s_bdi->ra_pages = 2 * cifs_sb->ctx->rsize / PAGE_SIZE and it did seem to help a little. I would expect that 8x1MB (ie trying to keep eight 1MB reads in process should keep the network mostly busy and not lead to too much dead time on server, client or network) and is 'good enough' in many read ahead use cases (at least for non-RDMA, and non-multichannel on a slower network) to keep the pipe file, and I would expect the performance to be similar to the equivalent using 2MB read (e.g. 4x2MB) and perhaps better than 2x4MB. Below 1MB i/o size on the wire I would expect to see degradation due to packet processing and task switching overhead. Would definitely be worth doing more experimentation here.
Steve French <smfrench@gmail.com> wrote: > This (readahead behavior improvements in Linux, on single large file > sequential read workloads like cp or grep) gets particularly interesting > with SMB3 as multichannel becomes more common. With one channel having one > readahead request pending on the network is suboptimal - but not as bad as > when multichannel is negotiated. Interestingly in most cases two network > connections to the same server (different TCP sockets,but the same mount, > even in cases where only network adapter) can achieve better performance - > but still significantly lags Windows (and probably other clients) as in > Linux we don't keep multiple I/Os in flight at one time (unless different > files are being read at the same time by different threads). I think it should be relatively straightforward to make the netfs_readahead() function generate multiple read requests. If I wasn't handed sufficient pages by the VM upfront to do two or more read requests, I would need to do extra expansion. There are a couple of ways this could be done: (1) I could expand the readahead_control after fully starting a read request and then create another independent read request, and another for how ever many we want. (2) I could expand the readahead_control first to cover however many requests I'm going to generate, then chop it up into individual read requests. However, generating larger requests means we're more likely to run into a problem for the cache: if we can't allocate enough pages to fill out a cache block, we don't have enough data to write to the cache. Further, if the pages are just unlocked and abandoned, readpage will be called to read them individually - which means they likely won't get cached unless the cache granularity is PAGE_SIZE. But that's probably okay if ENOMEM occurred. There are some other considerations too: (*) I would need to query the filesystem to find out if I should create another request. The fs would have to keep track of how many I/O reqs are in flight and what the limit is. (*) How and where should the readahead triggers be emplaced? I'm guessing that each block would need a trigger and that this should cause more requests to be generated until we hit the limit. (*) I would probably need to shuffle the request generation for the second and subsequent blocks in a single netfs_readahead() call to a worker thread because it'll probably be in a userspace kernel-side context and blocking an application from proceeding and consuming the pages already committed. David
On Wed, Feb 24, 2021 at 01:32:02PM +0000, David Howells wrote: > Steve French <smfrench@gmail.com> wrote: > > > This (readahead behavior improvements in Linux, on single large file > > sequential read workloads like cp or grep) gets particularly interesting > > with SMB3 as multichannel becomes more common. With one channel having one > > readahead request pending on the network is suboptimal - but not as bad as > > when multichannel is negotiated. Interestingly in most cases two network > > connections to the same server (different TCP sockets,but the same mount, > > even in cases where only network adapter) can achieve better performance - > > but still significantly lags Windows (and probably other clients) as in > > Linux we don't keep multiple I/Os in flight at one time (unless different > > files are being read at the same time by different threads). > > I think it should be relatively straightforward to make the netfs_readahead() > function generate multiple read requests. If I wasn't handed sufficient pages > by the VM upfront to do two or more read requests, I would need to do extra > expansion. There are a couple of ways this could be done: I don't think this is a job for netfs_readahead(). We can get into a similar situation with SSDs or RAID arrays where ideally we would have several outstanding readahead requests. If your drive is connected through a 1Gbps link (eg PCIe gen 1 x1) and has a latency of 10ms seek time, with one outstanding read, each read needs to be 12.5MB in size in order to saturate the bus. If the device supports 128 outstanding commands, each read need only be 100kB. We need the core readahead code to handle this situation. My suggestion for doing this is to send off an extra readahead request every time we hit a !Uptodate page. It looks something like this (assuming the app is processing the data fast and always hits the !Uptodate case) ... 1. hit 0, set readahead size to 64kB, mark 32kB as Readahead, send read for 0-64kB wait for 0-64kB to complete 2. hit 32kB (Readahead), no reads outstanding inc readahead size to 128kB, mark 128kB as Readahead, send read for 64k-192kB 3. hit 64kB (!Uptodate), one read outstanding mark 256kB as Readahead, send read for 192-320kB mark 384kB as Readahead, send read for 320-448kB wait for 64-192kB to complete 4. hit 128kB (Readahead), two reads outstanding inc readahead size to 256kB, mark 576kB as Readahead, send read for 448-704kB 5. hit 192kB (!Uptodate), three reads outstanding mark 832kB as Readahead, send read for 704-960kB mark 1088kB as Readahead, send read for 960-1216kB wait for 192-320kB to complete 6. hit 256kB (Readahead), four reads outstanding mark 1344kB as Readahead, send read for 1216-1472kB 7. hit 320kB (!Uptodate), five reads outstanding mark 1600kB as Readahead, send read for 1472-1728kB mark 1856kB as Readahead, send read for 1728-1984kB wait for 320-448kB to complete 8. hit 384kB (Readahead), five reads outstanding mark 2112kB as Readahead, send read for 1984-2240kB 9. hit 448kB (!Uptodate), six reads outstanding mark 2368kB as Readahead, send read for 2240-2496kB mark 2624kB as Readahead, send read for 2496-2752kB wait for 448-704kB to complete 10. hit 576kB (Readahead), seven reads outstanding mark 2880kB as Readahead, send read for 2752-3008kB ... Once we stop hitting !Uptodate pages, we'll maintain the number of pages marked as Readahead, and thus keep the number of readahead requests at the level it determined was necessary to keep the link saturated. I think we may need to put a parallelism cap in the bdi so that a device which is just slow instead of at the end of a long fat pipe doesn't get overwhelmed with requests.