diff mbox series

smb3: add rasize mount parameter to improve performance of readahead

Message ID CAH2r5mvfMfgGimkmC9nQxvOMt=2E7S1=dA33MJaszy5NHE2zxQ@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series smb3: add rasize mount parameter to improve performance of readahead | expand

Commit Message

Steve French April 24, 2021, 7:27 p.m. UTC
In some cases readahead of more than the read size can help
(to allow parallel i/o of read ahead which can improve performance).

Using the buildbot test systems, this resulted in an average improvement
of 14% to the Windows server test target for the first 12 tests I
tried (no multichannel)
changing to 12MB rasize (read ahead size).   Similarly increasing the
rasize to 12MB to Azure (this time with multichannel, 4 channels)
improved performance 37%

Note that Ceph had already introduced a mount parameter "rasize" to
allow controlling this.  Add mount parameter "rasize" to cifs.ko to
allow control of read ahead (rasize defaults to 4MB which is typically
what it used to default to to the many servers whose rsize was that).

Signed-off-by: Steve French <stfrench@microsoft.com>
---
 fs/cifs/cifsfs.c     |  1 +
 fs/cifs/fs_context.c | 25 ++++++++++++++++++++++++-
 fs/cifs/fs_context.h |  2 ++
 3 files changed, 27 insertions(+), 1 deletion(-)

Comments

Steve French April 24, 2021, 8:08 p.m. UTC | #1
I need to rerun the perf numbers - I may have gotten some of them incorrectly.

On Sat, Apr 24, 2021 at 2:27 PM Steve French <smfrench@gmail.com> wrote:
>
> In some cases readahead of more than the read size can help
> (to allow parallel i/o of read ahead which can improve performance).
>
> Using the buildbot test systems, this resulted in an average improvement
> of 14% to the Windows server test target for the first 12 tests I
> tried (no multichannel)
> changing to 12MB rasize (read ahead size).   Similarly increasing the
> rasize to 12MB to Azure (this time with multichannel, 4 channels)
> improved performance 37%
>
> Note that Ceph had already introduced a mount parameter "rasize" to
> allow controlling this.  Add mount parameter "rasize" to cifs.ko to
> allow control of read ahead (rasize defaults to 4MB which is typically
> what it used to default to to the many servers whose rsize was that).
>
> Signed-off-by: Steve French <stfrench@microsoft.com>
> ---
>  fs/cifs/cifsfs.c     |  1 +
>  fs/cifs/fs_context.c | 25 ++++++++++++++++++++++++-
>  fs/cifs/fs_context.h |  2 ++
>  3 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index 39f4889a036b..a1217682da1f 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -649,6 +649,7 @@ cifs_show_options(struct seq_file *s, struct dentry *root)
>   seq_printf(s, ",rsize=%u", cifs_sb->ctx->rsize);
>   seq_printf(s, ",wsize=%u", cifs_sb->ctx->wsize);
>   seq_printf(s, ",bsize=%u", cifs_sb->ctx->bsize);
> + seq_printf(s, ",rasize=%u", cifs_sb->ctx->rasize);
>   if (tcon->ses->server->min_offload)
>   seq_printf(s, ",esize=%u", tcon->ses->server->min_offload);
>   seq_printf(s, ",echo_interval=%lu",
> diff --git a/fs/cifs/fs_context.c b/fs/cifs/fs_context.c
> index 74758e954035..5ce1f7b854e7 100644
> --- a/fs/cifs/fs_context.c
> +++ b/fs/cifs/fs_context.c
> @@ -137,6 +137,7 @@ const struct fs_parameter_spec smb3_fs_parameters[] = {
>   fsparam_u32("min_enc_offload", Opt_min_enc_offload),
>   fsparam_u32("esize", Opt_min_enc_offload),
>   fsparam_u32("bsize", Opt_blocksize),
> + fsparam_u32("rasize", Opt_rasize),
>   fsparam_u32("rsize", Opt_rsize),
>   fsparam_u32("wsize", Opt_wsize),
>   fsparam_u32("actimeo", Opt_actimeo),
> @@ -941,6 +942,26 @@ static int smb3_fs_context_parse_param(struct
> fs_context *fc,
>   ctx->bsize = result.uint_32;
>   ctx->got_bsize = true;
>   break;
> + case Opt_rasize:
> + /*
> + * readahead size realistically should never need to be
> + * less than 1M (CIFS_DEFAULT_IOSIZE) or greater than 32M
> + * (perhaps an exception should be considered in the
> + * for the case of a large number of channels
> + * when multichannel is negotiated) since that would lead
> + * to plenty of parallel I/O in flight to the server.
> + * Note that smaller read ahead sizes would
> + * hurt performance of common tools like cp and scp
> + * which often trigger sequential i/o with read ahead
> + */
> + if ((result.uint_32 > (8 * SMB3_DEFAULT_IOSIZE)) ||
> +     (result.uint_32 < CIFS_DEFAULT_IOSIZE)) {
> + cifs_errorf(fc, "%s: Invalid rasize %d vs. %d\n",
> + __func__, result.uint_32, SMB3_DEFAULT_IOSIZE);
> + goto cifs_parse_mount_err;
> + }
> + ctx->rasize = result.uint_32;
> + break;
>   case Opt_rsize:
>   ctx->rsize = result.uint_32;
>   ctx->got_rsize = true;
> @@ -1377,7 +1398,9 @@ int smb3_init_fs_context(struct fs_context *fc)
>   ctx->cred_uid = current_uid();
>   ctx->linux_uid = current_uid();
>   ctx->linux_gid = current_gid();
> - ctx->bsize = 1024 * 1024; /* can improve cp performance significantly */
> + /* By default 4MB read ahead size, 1MB block size */
> + ctx->bsize = CIFS_DEFAULT_IOSIZE; /* can improve cp performance
> significantly */
> + ctx->rasize = SMB3_DEFAULT_IOSIZE; /* can improve sequential read ahead */
>
>   /*
>   * default to SFM style remapping of seven reserved characters
> diff --git a/fs/cifs/fs_context.h b/fs/cifs/fs_context.h
> index 56d7a75e2390..2a71c8e411ac 100644
> --- a/fs/cifs/fs_context.h
> +++ b/fs/cifs/fs_context.h
> @@ -120,6 +120,7 @@ enum cifs_param {
>   Opt_dirmode,
>   Opt_min_enc_offload,
>   Opt_blocksize,
> + Opt_rasize,
>   Opt_rsize,
>   Opt_wsize,
>   Opt_actimeo,
> @@ -235,6 +236,7 @@ struct smb3_fs_context {
>   /* reuse existing guid for multichannel */
>   u8 client_guid[SMB2_CLIENT_GUID_SIZE];
>   unsigned int bsize;
> + unsigned int rasize;
>   unsigned int rsize;
>   unsigned int wsize;
>   unsigned int min_offload;
>
> --
> Thanks,
>
> Steve
Matthew Wilcox April 25, 2021, 2:09 a.m. UTC | #2
On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote:
> Using the buildbot test systems, this resulted in an average improvement
> of 14% to the Windows server test target for the first 12 tests I
> tried (no multichannel)
> changing to 12MB rasize (read ahead size).   Similarly increasing the
> rasize to 12MB to Azure (this time with multichannel, 4 channels)
> improved performance 37%
> 
> Note that Ceph had already introduced a mount parameter "rasize" to
> allow controlling this.  Add mount parameter "rasize" to cifs.ko to
> allow control of read ahead (rasize defaults to 4MB which is typically
> what it used to default to to the many servers whose rsize was that).

I think something was missing from this patch -- I see you parse it and
set it in the mount context, but I don't see where it then gets used to
actually affect readahead.
Steve French April 25, 2021, 2:36 a.m. UTC | #3
Yep - good catch.  It is missing part of my patch :(

Ugh

Will need to rerun and get real numbers

On Sat, Apr 24, 2021 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote:
> > Using the buildbot test systems, this resulted in an average improvement
> > of 14% to the Windows server test target for the first 12 tests I
> > tried (no multichannel)
> > changing to 12MB rasize (read ahead size).   Similarly increasing the
> > rasize to 12MB to Azure (this time with multichannel, 4 channels)
> > improved performance 37%
> >
> > Note that Ceph had already introduced a mount parameter "rasize" to
> > allow controlling this.  Add mount parameter "rasize" to cifs.ko to
> > allow control of read ahead (rasize defaults to 4MB which is typically
> > what it used to default to to the many servers whose rsize was that).
>
> I think something was missing from this patch -- I see you parse it and
> set it in the mount context, but I don't see where it then gets used to
> actually affect readahead.
Steve French April 25, 2021, 4:50 p.m. UTC | #4
Updated patch attached. It does seem to help - just tried an experiment

      dd if=/mnt/test/1GBfile of=/dev/null bs=1M count=1024

to the same server, same share and compared mounting with rasize=6MB
vs. default (1MB to Azure)

(rw,relatime,vers=3.1.1,cache=strict,username=linuxsmb3testsharesmc,uid=0,noforceuid,gid=0,noforcegid,addr=20.150.70.104,file_mode=0777,dir_mode=0777,soft,persistenthandles,nounix,serverino,mapposix,mfsymlinks,nostrictsync,rsize=1048576,wsize=1048576,bsize=1048576,echo_interval=60,actimeo=1,multichannel,max_channels=2)

Got 391 MB/s  with rasize=6MB, much faster than default (which ends up
as 1MB with current code) of 163MB/s






# dd if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd
if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ;dd
if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd
if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ;
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.06764 s, 264 MB/s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.5912 s, 85.3 MB/s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.0573 s, 351 MB/s
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.58283 s, 125 MB/s

On Sat, Apr 24, 2021 at 9:36 PM Steve French <smfrench@gmail.com> wrote:
>
> Yep - good catch.  It is missing part of my patch :(
>
> Ugh
>
> Will need to rerun and get real numbers
>
> On Sat, Apr 24, 2021 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote:
> > > Using the buildbot test systems, this resulted in an average improvement
> > > of 14% to the Windows server test target for the first 12 tests I
> > > tried (no multichannel)
> > > changing to 12MB rasize (read ahead size).   Similarly increasing the
> > > rasize to 12MB to Azure (this time with multichannel, 4 channels)
> > > improved performance 37%
> > >
> > > Note that Ceph had already introduced a mount parameter "rasize" to
> > > allow controlling this.  Add mount parameter "rasize" to cifs.ko to
> > > allow control of read ahead (rasize defaults to 4MB which is typically
> > > what it used to default to to the many servers whose rsize was that).
> >
> > I think something was missing from this patch -- I see you parse it and
> > set it in the mount context, but I don't see where it then gets used to
> > actually affect readahead.
>
>
>
> --
> Thanks,
>
> Steve
Shyam Prasad N April 26, 2021, 4:52 a.m. UTC | #5
Agree with this. Was experimenting on the similar lines on Friday.
Does show good improvements with sequential workload.
For random read/write workload, the user can use the default value.

Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>

On Sun, Apr 25, 2021 at 10:20 PM Steve French <smfrench@gmail.com> wrote:
>
> Updated patch attached. It does seem to help - just tried an experiment
>
>       dd if=/mnt/test/1GBfile of=/dev/null bs=1M count=1024
>
> to the same server, same share and compared mounting with rasize=6MB
> vs. default (1MB to Azure)
>
> (rw,relatime,vers=3.1.1,cache=strict,username=linuxsmb3testsharesmc,uid=0,noforceuid,gid=0,noforcegid,addr=20.150.70.104,file_mode=0777,dir_mode=0777,soft,persistenthandles,nounix,serverino,mapposix,mfsymlinks,nostrictsync,rsize=1048576,wsize=1048576,bsize=1048576,echo_interval=60,actimeo=1,multichannel,max_channels=2)
>
> Got 391 MB/s  with rasize=6MB, much faster than default (which ends up
> as 1MB with current code) of 163MB/s
>
>
>
>
>
>
> # dd if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd
> if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ;dd
> if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd
> if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ;
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.06764 s, 264 MB/s
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.5912 s, 85.3 MB/s
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.0573 s, 351 MB/s
> 1024+0 records in
> 1024+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.58283 s, 125 MB/s
>
> On Sat, Apr 24, 2021 at 9:36 PM Steve French <smfrench@gmail.com> wrote:
> >
> > Yep - good catch.  It is missing part of my patch :(
> >
> > Ugh
> >
> > Will need to rerun and get real numbers
> >
> > On Sat, Apr 24, 2021 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote:
> > > > Using the buildbot test systems, this resulted in an average improvement
> > > > of 14% to the Windows server test target for the first 12 tests I
> > > > tried (no multichannel)
> > > > changing to 12MB rasize (read ahead size).   Similarly increasing the
> > > > rasize to 12MB to Azure (this time with multichannel, 4 channels)
> > > > improved performance 37%
> > > >
> > > > Note that Ceph had already introduced a mount parameter "rasize" to
> > > > allow controlling this.  Add mount parameter "rasize" to cifs.ko to
> > > > allow control of read ahead (rasize defaults to 4MB which is typically
> > > > what it used to default to to the many servers whose rsize was that).
> > >
> > > I think something was missing from this patch -- I see you parse it and
> > > set it in the mount context, but I don't see where it then gets used to
> > > actually affect readahead.
> >
> >
> >
> > --
> > Thanks,
> >
> > Steve
>
>
>
> --
> Thanks,
>
> Steve
Matthew Wilcox April 26, 2021, 11:54 a.m. UTC | #6
On Mon, Apr 26, 2021 at 10:22:27AM +0530, Shyam Prasad N wrote:
> Agree with this. Was experimenting on the similar lines on Friday.
> Does show good improvements with sequential workload.
> For random read/write workload, the user can use the default value.

For a random access workload, Linux's readahead shouldn't kick in.
Do you see a slowdown when using this patch with a random I/O workload?
Steve French April 27, 2021, 2:23 a.m. UTC | #7
On Mon, Apr 26, 2021 at 6:55 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 26, 2021 at 10:22:27AM +0530, Shyam Prasad N wrote:
> > Agree with this. Was experimenting on the similar lines on Friday.
> > Does show good improvements with sequential workload.
> > For random read/write workload, the user can use the default value.
>
> For a random access workload, Linux's readahead shouldn't kick in.
> Do you see a slowdown when using this patch with a random I/O workload?

I see few slowdowns in the 20 or so xfstests I have tried, but the
value of rasize
varies a lot by server type and network type and number of channels.  I don't
have enough data to set rasize well (I have experimented with 4MB to 12MB
but need more data).

In some experiments running 20+ typical xfstests

So far I see really good improvement in multichannel to Azure with setting
rasize to 4 times negotiated read size. But I saw less than a 1% gain though
to a slower Windows server running in a VM (without multichannel).  I
also didn't
see a gain to localhost Samba in the simple examples I tried. I saw
about an 11% gain to a typical Azure share without multichannel. See
some example perf data below. The numbers on right are perf with
rasize set to 4MB,
ie 4 times the negotiated read size, instead of results using defaults
of ra_pages = 1MB rsize (on the left).

generic/001 113s ...  117s
generic/005 35s ...  34s
generic/006 567s ...  503s
generic/010 1s ...  1s
generic/011 620s ...  594s
generic/024 10s ...  10s
generic/028 5s ...  5s
generic/029 2s ...  2s
generic/030 3s ...  2s
generic/036 11s ...  11s
generic/069 7s ...  7s
generic/070 287s ...  270s
generic/080 3s ...  2s
generic/084 6s ...  6s
generic/086 1s ...  1s
generic/095 25s ...  23s
generic/098 1s ...  1s
generic/109 469s ...  328s
generic/117 219s ...  201s
generic/124 21s ...  20s
generic/125 63s ...  62s
generic/130 27s ...  24s
generic/132 24s ...  25s
Shyam Prasad N April 30, 2021, 10:49 a.m. UTC | #8
Hi Matthew,

Sorry for the late reply. Still catching up on my emails.
No. I did not see read-aheads ramping up with random reads, so I feel
we're okay there with or without this patch.

Although ideally, I feel that we (cifs.ko) should be able to read in
larger granular "chunks" even for small reads, in expectation that
surrounding offsets will be read soon.
This is especially useful if the read comes from something like a loop
device backed file.

Is there a way for a filesystem to indicate to the mm/readahead layer
to read in chunks of N bytes? Even for random workloads? Even if the
actual read is much smaller?
I did some code reading of mm/readahead.c and understand that if the
file is opened with fadvise flag of FADV_RANDOM, there's some logic to
read in chunks. But that seems to work only if the actual read size is
bigger.

Regards,
Shyam

On Mon, Apr 26, 2021 at 5:25 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 26, 2021 at 10:22:27AM +0530, Shyam Prasad N wrote:
> > Agree with this. Was experimenting on the similar lines on Friday.
> > Does show good improvements with sequential workload.
> > For random read/write workload, the user can use the default value.
>
> For a random access workload, Linux's readahead shouldn't kick in.
> Do you see a slowdown when using this patch with a random I/O workload?
>
Matthew Wilcox April 30, 2021, 11:59 a.m. UTC | #9
On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote:
> Although ideally, I feel that we (cifs.ko) should be able to read in
> larger granular "chunks" even for small reads, in expectation that
> surrounding offsets will be read soon.

Why?  How is CIFS special and different from every other filesystem that
means you know what the access pattern of userspace is going to be better
than the generic VFS?

There are definitely shortcomings in the readahead code that should
be addressed, but in almost no circumstances is "read bigger chunks"
the right answer.
Shyam Prasad N April 30, 2021, 12:53 p.m. UTC | #10
I feel this is not just a problem with cifs. Any filesystem
(particularly network filesystems involving higher latency I/O to
fetch data from the server) will have issues coping up with a large
number of small random reads.
My point here is that if all reads to the server were of a minimum
"chunk" size (a contiguous range of pages), page cache could be
populated in chunks. Any future read to other pages in the chunk could
be satisfied from the page cache, thereby improving the overall
performance for similar workloads.

Regards,
Shyam

On Fri, Apr 30, 2021 at 5:30 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote:
> > Although ideally, I feel that we (cifs.ko) should be able to read in
> > larger granular "chunks" even for small reads, in expectation that
> > surrounding offsets will be read soon.
>
> Why?  How is CIFS special and different from every other filesystem that
> means you know what the access pattern of userspace is going to be better
> than the generic VFS?
>
> There are definitely shortcomings in the readahead code that should
> be addressed, but in almost no circumstances is "read bigger chunks"
> the right answer.
Steve French April 30, 2021, 7:22 p.m. UTC | #11
On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote:
> > Although ideally, I feel that we (cifs.ko) should be able to read in
> > larger granular "chunks" even for small reads, in expectation that
> > surrounding offsets will be read soon.
>
> Why?  How is CIFS special and different from every other filesystem that
> means you know what the access pattern of userspace is going to be better
> than the generic VFS?

In general small chunks are bad for network file systems since the 'cost' of
sending a large read or write on the network (and in the call stack on
the client
and server, with various task switches etc) is not much more than a small one.
This can be different on a local file system with less latency between request
and response and fewer task switches involved on client and server.


There are tradeoffs between - having multiple small chunks in flight
vs. fewer large chunks in flight - but a general idea is that if possible it can
be much faster to keep some requests in flight and keep some activity:
- on the network
- on the server side
- on the client side

to avoid "dead time" where nothing is happening on the network due to latency
decrypting on the client or server etc.  For this reason it makes sense that
having multiple 4 1MB reads in flight (e.g. copying a file with new "rasize"
mount parm set to (e.g.) 4MB for cifs.ko) can be much faster than only
having 1 1MB
read in flight at one time, and much, much faster than using direct
i/o where some
tools like "rsync" use quite small i/o sizes (cp uses 1MB i/o if
uncached i/o for
case where mounted to cifs and nfs but rsync uses a small size which hurts
uncached performance greatly)
uses much smaller)
Matthew Wilcox May 1, 2021, 6:35 p.m. UTC | #12
On Fri, Apr 30, 2021 at 02:22:20PM -0500, Steve French wrote:
> On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote:
> > > Although ideally, I feel that we (cifs.ko) should be able to read in
> > > larger granular "chunks" even for small reads, in expectation that
> > > surrounding offsets will be read soon.
> >
> > Why?  How is CIFS special and different from every other filesystem that
> > means you know what the access pattern of userspace is going to be better
> > than the generic VFS?
> 
> In general small chunks are bad for network file systems since the 'cost' of
> sending a large read or write on the network (and in the call stack on
> the client
> and server, with various task switches etc) is not much more than a small one.
> This can be different on a local file system with less latency between request
> and response and fewer task switches involved on client and server.

Block-based filesystems are often, but not always local.  For example,
we might be using nbd, iSCSI, FCoE or something similar to include
network latency between the filesystem and its storage.  Even without
those possibilities, a NAND SSD looks pretty similar.  Look at the
graphic titled "Idle Average Random Read Latency" on this page:

https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html

That seems to be showing 5us software latency for an SSD with 80us of
hardware latency.  That says to me we should have 16 outstanding reads
to a NAND SSD in order to keep the pipeline full.

Conversely, a network filesystem might be talking to localhost,
and seeing much lower latency compared to going across the data
center, between data centres or across the Pacific.

So, my point is that Linux's readahead is pretty poor.  Adding
hacks in for individual filesystems isn't a good route to fixing it,
and reading larger chunks has already passed the point of dimnishing
returns for many workloads.

I laid it out in a bit more detail here:
https://lore.kernel.org/linux-fsdevel/20210224155121.GQ2858050@casper.infradead.org/
Steve French May 1, 2021, 6:47 p.m. UTC | #13
On Sat, May 1, 2021 at 1:35 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Apr 30, 2021 at 02:22:20PM -0500, Steve French wrote:
> > On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote:
> > > > Although ideally, I feel that we (cifs.ko) should be able to read in
> > > > larger granular "chunks" even for small reads, in expectation that
> > > > surrounding offsets will be read soon.
> > >
> > > Why?  How is CIFS special and different from every other filesystem that
> > > means you know what the access pattern of userspace is going to be better
> > > than the generic VFS?
> >
> > In general small chunks are bad for network file systems since the 'cost' of
> > sending a large read or write on the network (and in the call stack on
> > the client
> > and server, with various task switches etc) is not much more than a small one.
> > This can be different on a local file system with less latency between request
> > and response and fewer task switches involved on client and server.
>
> Block-based filesystems are often, but not always local.  For example,
> we might be using nbd, iSCSI, FCoE or something similar to include
> network latency between the filesystem and its storage.  Even without
> those possibilities, a NAND SSD looks pretty similar.  Look at the
> graphic titled "Idle Average Random Read Latency" on this page:
>
> https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html
>
> That seems to be showing 5us software latency for an SSD with 80us of
> hardware latency.  That says to me we should have 16 outstanding reads
> to a NAND SSD in order to keep the pipeline full.
>
> Conversely, a network filesystem might be talking to localhost,
> and seeing much lower latency compared to going across the data
> center, between data centres or across the Pacific.
>
> So, my point is that Linux's readahead is pretty poor.  Adding
> hacks in for individual filesystems isn't a good route to fixing it,
> and reading larger chunks has already passed the point of dimnishing
> returns for many workloads.
>
> I laid it out in a bit more detail here:
> https://lore.kernel.org/linux-fsdevel/20210224155121.GQ2858050@casper.infradead.org/

Yes - those are good points.  Because the latencies vary the most for
network/cluster filesystems which can vary by more than a million
times greater (from localhost and RDMA (aka smbdirect) which can be
very low latencies, to some cloud workloads which have longer
latencies by high throughput, or to servers where the files are
'offline' (archived or in the cloud) where I have seen some examples
where it could take minutes instead) - it is especially important for
this in the long run to be better tunable.  In the short term, at
least having some tuneables on the file system mount (like Ceph's
"rapages") makes sense.

Seems like there are three problems to solve:
- the things your note mentions about how to get the core readahead
code to ramp up a 'reasonable' number of I/Os are very important
but also
- how to let a filesystem signal the readahead code to slow down or
allow partially fulfilling read ahead requests (in the SMB3 case this
can be done when 'credits' on the connection (one 'credit' is needed
for each 64K of I/O) are starting to get lower)
- how to let a filesystem signal the readahead code to temporarily
stop readahead (or max readahead at one i/o of size = readsize).  This
could happen e.g. when the filesystem gets an "out of resources" error
message from the server, or when reconnect is triggered
Steve French May 1, 2021, 6:50 p.m. UTC | #14
Forgot to mention another obvious point ... the number of 'channels'
is dynamic for some filesystems.
For example clustered SMB3 servers can notify the clients
asynchronously when more 'channels'
are added (more network connections added - e.g. in cloud environments
or clustered high performance
server environments you can temporarily add more high performance
ethernet or RDMA adapters temporarily to the host)
so it is quite possible that the server can indicate to the client
that more network throughput is now available
(Windows takes advantage of this, and even polls to check for new
interfaces every 10 minutes, but the Linux client
does not yet - but it is something we will likely add soon).

On Sat, May 1, 2021 at 1:47 PM Steve French <smfrench@gmail.com> wrote:
>
> On Sat, May 1, 2021 at 1:35 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Apr 30, 2021 at 02:22:20PM -0500, Steve French wrote:
> > > On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote:
> > > > > Although ideally, I feel that we (cifs.ko) should be able to read in
> > > > > larger granular "chunks" even for small reads, in expectation that
> > > > > surrounding offsets will be read soon.
> > > >
> > > > Why?  How is CIFS special and different from every other filesystem that
> > > > means you know what the access pattern of userspace is going to be better
> > > > than the generic VFS?
> > >
> > > In general small chunks are bad for network file systems since the 'cost' of
> > > sending a large read or write on the network (and in the call stack on
> > > the client
> > > and server, with various task switches etc) is not much more than a small one.
> > > This can be different on a local file system with less latency between request
> > > and response and fewer task switches involved on client and server.
> >
> > Block-based filesystems are often, but not always local.  For example,
> > we might be using nbd, iSCSI, FCoE or something similar to include
> > network latency between the filesystem and its storage.  Even without
> > those possibilities, a NAND SSD looks pretty similar.  Look at the
> > graphic titled "Idle Average Random Read Latency" on this page:
> >
> > https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html
> >
> > That seems to be showing 5us software latency for an SSD with 80us of
> > hardware latency.  That says to me we should have 16 outstanding reads
> > to a NAND SSD in order to keep the pipeline full.
> >
> > Conversely, a network filesystem might be talking to localhost,
> > and seeing much lower latency compared to going across the data
> > center, between data centres or across the Pacific.
> >
> > So, my point is that Linux's readahead is pretty poor.  Adding
> > hacks in for individual filesystems isn't a good route to fixing it,
> > and reading larger chunks has already passed the point of dimnishing
> > returns for many workloads.
> >
> > I laid it out in a bit more detail here:
> > https://lore.kernel.org/linux-fsdevel/20210224155121.GQ2858050@casper.infradead.org/
>
> Yes - those are good points.  Because the latencies vary the most for
> network/cluster filesystems which can vary by more than a million
> times greater (from localhost and RDMA (aka smbdirect) which can be
> very low latencies, to some cloud workloads which have longer
> latencies by high throughput, or to servers where the files are
> 'offline' (archived or in the cloud) where I have seen some examples
> where it could take minutes instead) - it is especially important for
> this in the long run to be better tunable.  In the short term, at
> least having some tuneables on the file system mount (like Ceph's
> "rapages") makes sense.
>
> Seems like there are three problems to solve:
> - the things your note mentions about how to get the core readahead
> code to ramp up a 'reasonable' number of I/Os are very important
> but also
> - how to let a filesystem signal the readahead code to slow down or
> allow partially fulfilling read ahead requests (in the SMB3 case this
> can be done when 'credits' on the connection (one 'credit' is needed
> for each 64K of I/O) are starting to get lower)
> - how to let a filesystem signal the readahead code to temporarily
> stop readahead (or max readahead at one i/o of size = readsize).  This
> could happen e.g. when the filesystem gets an "out of resources" error
> message from the server, or when reconnect is triggered
>
>
> --
> Thanks,
>
> Steve
diff mbox series

Patch

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 39f4889a036b..a1217682da1f 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -649,6 +649,7 @@  cifs_show_options(struct seq_file *s, struct dentry *root)
  seq_printf(s, ",rsize=%u", cifs_sb->ctx->rsize);
  seq_printf(s, ",wsize=%u", cifs_sb->ctx->wsize);
  seq_printf(s, ",bsize=%u", cifs_sb->ctx->bsize);
+ seq_printf(s, ",rasize=%u", cifs_sb->ctx->rasize);
  if (tcon->ses->server->min_offload)
  seq_printf(s, ",esize=%u", tcon->ses->server->min_offload);
  seq_printf(s, ",echo_interval=%lu",
diff --git a/fs/cifs/fs_context.c b/fs/cifs/fs_context.c
index 74758e954035..5ce1f7b854e7 100644
--- a/fs/cifs/fs_context.c
+++ b/fs/cifs/fs_context.c
@@ -137,6 +137,7 @@  const struct fs_parameter_spec smb3_fs_parameters[] = {
  fsparam_u32("min_enc_offload", Opt_min_enc_offload),
  fsparam_u32("esize", Opt_min_enc_offload),
  fsparam_u32("bsize", Opt_blocksize),
+ fsparam_u32("rasize", Opt_rasize),
  fsparam_u32("rsize", Opt_rsize),
  fsparam_u32("wsize", Opt_wsize),
  fsparam_u32("actimeo", Opt_actimeo),
@@ -941,6 +942,26 @@  static int smb3_fs_context_parse_param(struct
fs_context *fc,
  ctx->bsize = result.uint_32;
  ctx->got_bsize = true;
  break;
+ case Opt_rasize:
+ /*
+ * readahead size realistically should never need to be
+ * less than 1M (CIFS_DEFAULT_IOSIZE) or greater than 32M
+ * (perhaps an exception should be considered in the
+ * for the case of a large number of channels
+ * when multichannel is negotiated) since that would lead
+ * to plenty of parallel I/O in flight to the server.
+ * Note that smaller read ahead sizes would
+ * hurt performance of common tools like cp and scp
+ * which often trigger sequential i/o with read ahead
+ */
+ if ((result.uint_32 > (8 * SMB3_DEFAULT_IOSIZE)) ||
+     (result.uint_32 < CIFS_DEFAULT_IOSIZE)) {
+ cifs_errorf(fc, "%s: Invalid rasize %d vs. %d\n",
+ __func__, result.uint_32, SMB3_DEFAULT_IOSIZE);
+ goto cifs_parse_mount_err;
+ }
+ ctx->rasize = result.uint_32;
+ break;
  case Opt_rsize:
  ctx->rsize = result.uint_32;
  ctx->got_rsize = true;
@@ -1377,7 +1398,9 @@  int smb3_init_fs_context(struct fs_context *fc)
  ctx->cred_uid = current_uid();
  ctx->linux_uid = current_uid();
  ctx->linux_gid = current_gid();
- ctx->bsize = 1024 * 1024; /* can improve cp performance significantly */
+ /* By default 4MB read ahead size, 1MB block size */
+ ctx->bsize = CIFS_DEFAULT_IOSIZE; /* can improve cp performance
significantly */
+ ctx->rasize = SMB3_DEFAULT_IOSIZE; /* can improve sequential read ahead */

  /*
  * default to SFM style remapping of seven reserved characters
diff --git a/fs/cifs/fs_context.h b/fs/cifs/fs_context.h
index 56d7a75e2390..2a71c8e411ac 100644
--- a/fs/cifs/fs_context.h
+++ b/fs/cifs/fs_context.h
@@ -120,6 +120,7 @@  enum cifs_param {
  Opt_dirmode,
  Opt_min_enc_offload,
  Opt_blocksize,
+ Opt_rasize,
  Opt_rsize,
  Opt_wsize,
  Opt_actimeo,
@@ -235,6 +236,7 @@  struct smb3_fs_context {
  /* reuse existing guid for multichannel */
  u8 client_guid[SMB2_CLIENT_GUID_SIZE];
  unsigned int bsize;
+ unsigned int rasize;
  unsigned int rsize;
  unsigned int wsize;
  unsigned int min_offload;