Message ID | CAH2r5mvfMfgGimkmC9nQxvOMt=2E7S1=dA33MJaszy5NHE2zxQ@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | smb3: add rasize mount parameter to improve performance of readahead | expand |
I need to rerun the perf numbers - I may have gotten some of them incorrectly. On Sat, Apr 24, 2021 at 2:27 PM Steve French <smfrench@gmail.com> wrote: > > In some cases readahead of more than the read size can help > (to allow parallel i/o of read ahead which can improve performance). > > Using the buildbot test systems, this resulted in an average improvement > of 14% to the Windows server test target for the first 12 tests I > tried (no multichannel) > changing to 12MB rasize (read ahead size). Similarly increasing the > rasize to 12MB to Azure (this time with multichannel, 4 channels) > improved performance 37% > > Note that Ceph had already introduced a mount parameter "rasize" to > allow controlling this. Add mount parameter "rasize" to cifs.ko to > allow control of read ahead (rasize defaults to 4MB which is typically > what it used to default to to the many servers whose rsize was that). > > Signed-off-by: Steve French <stfrench@microsoft.com> > --- > fs/cifs/cifsfs.c | 1 + > fs/cifs/fs_context.c | 25 ++++++++++++++++++++++++- > fs/cifs/fs_context.h | 2 ++ > 3 files changed, 27 insertions(+), 1 deletion(-) > > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c > index 39f4889a036b..a1217682da1f 100644 > --- a/fs/cifs/cifsfs.c > +++ b/fs/cifs/cifsfs.c > @@ -649,6 +649,7 @@ cifs_show_options(struct seq_file *s, struct dentry *root) > seq_printf(s, ",rsize=%u", cifs_sb->ctx->rsize); > seq_printf(s, ",wsize=%u", cifs_sb->ctx->wsize); > seq_printf(s, ",bsize=%u", cifs_sb->ctx->bsize); > + seq_printf(s, ",rasize=%u", cifs_sb->ctx->rasize); > if (tcon->ses->server->min_offload) > seq_printf(s, ",esize=%u", tcon->ses->server->min_offload); > seq_printf(s, ",echo_interval=%lu", > diff --git a/fs/cifs/fs_context.c b/fs/cifs/fs_context.c > index 74758e954035..5ce1f7b854e7 100644 > --- a/fs/cifs/fs_context.c > +++ b/fs/cifs/fs_context.c > @@ -137,6 +137,7 @@ const struct fs_parameter_spec smb3_fs_parameters[] = { > fsparam_u32("min_enc_offload", Opt_min_enc_offload), > fsparam_u32("esize", Opt_min_enc_offload), > fsparam_u32("bsize", Opt_blocksize), > + fsparam_u32("rasize", Opt_rasize), > fsparam_u32("rsize", Opt_rsize), > fsparam_u32("wsize", Opt_wsize), > fsparam_u32("actimeo", Opt_actimeo), > @@ -941,6 +942,26 @@ static int smb3_fs_context_parse_param(struct > fs_context *fc, > ctx->bsize = result.uint_32; > ctx->got_bsize = true; > break; > + case Opt_rasize: > + /* > + * readahead size realistically should never need to be > + * less than 1M (CIFS_DEFAULT_IOSIZE) or greater than 32M > + * (perhaps an exception should be considered in the > + * for the case of a large number of channels > + * when multichannel is negotiated) since that would lead > + * to plenty of parallel I/O in flight to the server. > + * Note that smaller read ahead sizes would > + * hurt performance of common tools like cp and scp > + * which often trigger sequential i/o with read ahead > + */ > + if ((result.uint_32 > (8 * SMB3_DEFAULT_IOSIZE)) || > + (result.uint_32 < CIFS_DEFAULT_IOSIZE)) { > + cifs_errorf(fc, "%s: Invalid rasize %d vs. %d\n", > + __func__, result.uint_32, SMB3_DEFAULT_IOSIZE); > + goto cifs_parse_mount_err; > + } > + ctx->rasize = result.uint_32; > + break; > case Opt_rsize: > ctx->rsize = result.uint_32; > ctx->got_rsize = true; > @@ -1377,7 +1398,9 @@ int smb3_init_fs_context(struct fs_context *fc) > ctx->cred_uid = current_uid(); > ctx->linux_uid = current_uid(); > ctx->linux_gid = current_gid(); > - ctx->bsize = 1024 * 1024; /* can improve cp performance significantly */ > + /* By default 4MB read ahead size, 1MB block size */ > + ctx->bsize = CIFS_DEFAULT_IOSIZE; /* can improve cp performance > significantly */ > + ctx->rasize = SMB3_DEFAULT_IOSIZE; /* can improve sequential read ahead */ > > /* > * default to SFM style remapping of seven reserved characters > diff --git a/fs/cifs/fs_context.h b/fs/cifs/fs_context.h > index 56d7a75e2390..2a71c8e411ac 100644 > --- a/fs/cifs/fs_context.h > +++ b/fs/cifs/fs_context.h > @@ -120,6 +120,7 @@ enum cifs_param { > Opt_dirmode, > Opt_min_enc_offload, > Opt_blocksize, > + Opt_rasize, > Opt_rsize, > Opt_wsize, > Opt_actimeo, > @@ -235,6 +236,7 @@ struct smb3_fs_context { > /* reuse existing guid for multichannel */ > u8 client_guid[SMB2_CLIENT_GUID_SIZE]; > unsigned int bsize; > + unsigned int rasize; > unsigned int rsize; > unsigned int wsize; > unsigned int min_offload; > > -- > Thanks, > > Steve
On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote: > Using the buildbot test systems, this resulted in an average improvement > of 14% to the Windows server test target for the first 12 tests I > tried (no multichannel) > changing to 12MB rasize (read ahead size). Similarly increasing the > rasize to 12MB to Azure (this time with multichannel, 4 channels) > improved performance 37% > > Note that Ceph had already introduced a mount parameter "rasize" to > allow controlling this. Add mount parameter "rasize" to cifs.ko to > allow control of read ahead (rasize defaults to 4MB which is typically > what it used to default to to the many servers whose rsize was that). I think something was missing from this patch -- I see you parse it and set it in the mount context, but I don't see where it then gets used to actually affect readahead.
Yep - good catch. It is missing part of my patch :( Ugh Will need to rerun and get real numbers On Sat, Apr 24, 2021 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote: > > Using the buildbot test systems, this resulted in an average improvement > > of 14% to the Windows server test target for the first 12 tests I > > tried (no multichannel) > > changing to 12MB rasize (read ahead size). Similarly increasing the > > rasize to 12MB to Azure (this time with multichannel, 4 channels) > > improved performance 37% > > > > Note that Ceph had already introduced a mount parameter "rasize" to > > allow controlling this. Add mount parameter "rasize" to cifs.ko to > > allow control of read ahead (rasize defaults to 4MB which is typically > > what it used to default to to the many servers whose rsize was that). > > I think something was missing from this patch -- I see you parse it and > set it in the mount context, but I don't see where it then gets used to > actually affect readahead.
Updated patch attached. It does seem to help - just tried an experiment dd if=/mnt/test/1GBfile of=/dev/null bs=1M count=1024 to the same server, same share and compared mounting with rasize=6MB vs. default (1MB to Azure) (rw,relatime,vers=3.1.1,cache=strict,username=linuxsmb3testsharesmc,uid=0,noforceuid,gid=0,noforcegid,addr=20.150.70.104,file_mode=0777,dir_mode=0777,soft,persistenthandles,nounix,serverino,mapposix,mfsymlinks,nostrictsync,rsize=1048576,wsize=1048576,bsize=1048576,echo_interval=60,actimeo=1,multichannel,max_channels=2) Got 391 MB/s with rasize=6MB, much faster than default (which ends up as 1MB with current code) of 163MB/s # dd if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ;dd if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ; 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.06764 s, 264 MB/s 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.5912 s, 85.3 MB/s 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.0573 s, 351 MB/s 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.58283 s, 125 MB/s On Sat, Apr 24, 2021 at 9:36 PM Steve French <smfrench@gmail.com> wrote: > > Yep - good catch. It is missing part of my patch :( > > Ugh > > Will need to rerun and get real numbers > > On Sat, Apr 24, 2021 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote: > > > Using the buildbot test systems, this resulted in an average improvement > > > of 14% to the Windows server test target for the first 12 tests I > > > tried (no multichannel) > > > changing to 12MB rasize (read ahead size). Similarly increasing the > > > rasize to 12MB to Azure (this time with multichannel, 4 channels) > > > improved performance 37% > > > > > > Note that Ceph had already introduced a mount parameter "rasize" to > > > allow controlling this. Add mount parameter "rasize" to cifs.ko to > > > allow control of read ahead (rasize defaults to 4MB which is typically > > > what it used to default to to the many servers whose rsize was that). > > > > I think something was missing from this patch -- I see you parse it and > > set it in the mount context, but I don't see where it then gets used to > > actually affect readahead. > > > > -- > Thanks, > > Steve
Agree with this. Was experimenting on the similar lines on Friday. Does show good improvements with sequential workload. For random read/write workload, the user can use the default value. Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> On Sun, Apr 25, 2021 at 10:20 PM Steve French <smfrench@gmail.com> wrote: > > Updated patch attached. It does seem to help - just tried an experiment > > dd if=/mnt/test/1GBfile of=/dev/null bs=1M count=1024 > > to the same server, same share and compared mounting with rasize=6MB > vs. default (1MB to Azure) > > (rw,relatime,vers=3.1.1,cache=strict,username=linuxsmb3testsharesmc,uid=0,noforceuid,gid=0,noforcegid,addr=20.150.70.104,file_mode=0777,dir_mode=0777,soft,persistenthandles,nounix,serverino,mapposix,mfsymlinks,nostrictsync,rsize=1048576,wsize=1048576,bsize=1048576,echo_interval=60,actimeo=1,multichannel,max_channels=2) > > Got 391 MB/s with rasize=6MB, much faster than default (which ends up > as 1MB with current code) of 163MB/s > > > > > > > # dd if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd > if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ;dd > if=/mnt/test/394.29520 of=/dev/null bs=1M count=1024 ; dd > if=/mnt/scratch/394.29520 of=/mnt/test/junk1 bs=1M count=1024 ; > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.06764 s, 264 MB/s > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 12.5912 s, 85.3 MB/s > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.0573 s, 351 MB/s > 1024+0 records in > 1024+0 records out > 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.58283 s, 125 MB/s > > On Sat, Apr 24, 2021 at 9:36 PM Steve French <smfrench@gmail.com> wrote: > > > > Yep - good catch. It is missing part of my patch :( > > > > Ugh > > > > Will need to rerun and get real numbers > > > > On Sat, Apr 24, 2021 at 9:10 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Sat, Apr 24, 2021 at 02:27:11PM -0500, Steve French wrote: > > > > Using the buildbot test systems, this resulted in an average improvement > > > > of 14% to the Windows server test target for the first 12 tests I > > > > tried (no multichannel) > > > > changing to 12MB rasize (read ahead size). Similarly increasing the > > > > rasize to 12MB to Azure (this time with multichannel, 4 channels) > > > > improved performance 37% > > > > > > > > Note that Ceph had already introduced a mount parameter "rasize" to > > > > allow controlling this. Add mount parameter "rasize" to cifs.ko to > > > > allow control of read ahead (rasize defaults to 4MB which is typically > > > > what it used to default to to the many servers whose rsize was that). > > > > > > I think something was missing from this patch -- I see you parse it and > > > set it in the mount context, but I don't see where it then gets used to > > > actually affect readahead. > > > > > > > > -- > > Thanks, > > > > Steve > > > > -- > Thanks, > > Steve
On Mon, Apr 26, 2021 at 10:22:27AM +0530, Shyam Prasad N wrote: > Agree with this. Was experimenting on the similar lines on Friday. > Does show good improvements with sequential workload. > For random read/write workload, the user can use the default value. For a random access workload, Linux's readahead shouldn't kick in. Do you see a slowdown when using this patch with a random I/O workload?
On Mon, Apr 26, 2021 at 6:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Apr 26, 2021 at 10:22:27AM +0530, Shyam Prasad N wrote: > > Agree with this. Was experimenting on the similar lines on Friday. > > Does show good improvements with sequential workload. > > For random read/write workload, the user can use the default value. > > For a random access workload, Linux's readahead shouldn't kick in. > Do you see a slowdown when using this patch with a random I/O workload? I see few slowdowns in the 20 or so xfstests I have tried, but the value of rasize varies a lot by server type and network type and number of channels. I don't have enough data to set rasize well (I have experimented with 4MB to 12MB but need more data). In some experiments running 20+ typical xfstests So far I see really good improvement in multichannel to Azure with setting rasize to 4 times negotiated read size. But I saw less than a 1% gain though to a slower Windows server running in a VM (without multichannel). I also didn't see a gain to localhost Samba in the simple examples I tried. I saw about an 11% gain to a typical Azure share without multichannel. See some example perf data below. The numbers on right are perf with rasize set to 4MB, ie 4 times the negotiated read size, instead of results using defaults of ra_pages = 1MB rsize (on the left). generic/001 113s ... 117s generic/005 35s ... 34s generic/006 567s ... 503s generic/010 1s ... 1s generic/011 620s ... 594s generic/024 10s ... 10s generic/028 5s ... 5s generic/029 2s ... 2s generic/030 3s ... 2s generic/036 11s ... 11s generic/069 7s ... 7s generic/070 287s ... 270s generic/080 3s ... 2s generic/084 6s ... 6s generic/086 1s ... 1s generic/095 25s ... 23s generic/098 1s ... 1s generic/109 469s ... 328s generic/117 219s ... 201s generic/124 21s ... 20s generic/125 63s ... 62s generic/130 27s ... 24s generic/132 24s ... 25s
Hi Matthew, Sorry for the late reply. Still catching up on my emails. No. I did not see read-aheads ramping up with random reads, so I feel we're okay there with or without this patch. Although ideally, I feel that we (cifs.ko) should be able to read in larger granular "chunks" even for small reads, in expectation that surrounding offsets will be read soon. This is especially useful if the read comes from something like a loop device backed file. Is there a way for a filesystem to indicate to the mm/readahead layer to read in chunks of N bytes? Even for random workloads? Even if the actual read is much smaller? I did some code reading of mm/readahead.c and understand that if the file is opened with fadvise flag of FADV_RANDOM, there's some logic to read in chunks. But that seems to work only if the actual read size is bigger. Regards, Shyam On Mon, Apr 26, 2021 at 5:25 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Apr 26, 2021 at 10:22:27AM +0530, Shyam Prasad N wrote: > > Agree with this. Was experimenting on the similar lines on Friday. > > Does show good improvements with sequential workload. > > For random read/write workload, the user can use the default value. > > For a random access workload, Linux's readahead shouldn't kick in. > Do you see a slowdown when using this patch with a random I/O workload? >
On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote: > Although ideally, I feel that we (cifs.ko) should be able to read in > larger granular "chunks" even for small reads, in expectation that > surrounding offsets will be read soon. Why? How is CIFS special and different from every other filesystem that means you know what the access pattern of userspace is going to be better than the generic VFS? There are definitely shortcomings in the readahead code that should be addressed, but in almost no circumstances is "read bigger chunks" the right answer.
I feel this is not just a problem with cifs. Any filesystem (particularly network filesystems involving higher latency I/O to fetch data from the server) will have issues coping up with a large number of small random reads. My point here is that if all reads to the server were of a minimum "chunk" size (a contiguous range of pages), page cache could be populated in chunks. Any future read to other pages in the chunk could be satisfied from the page cache, thereby improving the overall performance for similar workloads. Regards, Shyam On Fri, Apr 30, 2021 at 5:30 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote: > > Although ideally, I feel that we (cifs.ko) should be able to read in > > larger granular "chunks" even for small reads, in expectation that > > surrounding offsets will be read soon. > > Why? How is CIFS special and different from every other filesystem that > means you know what the access pattern of userspace is going to be better > than the generic VFS? > > There are definitely shortcomings in the readahead code that should > be addressed, but in almost no circumstances is "read bigger chunks" > the right answer.
On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote: > > Although ideally, I feel that we (cifs.ko) should be able to read in > > larger granular "chunks" even for small reads, in expectation that > > surrounding offsets will be read soon. > > Why? How is CIFS special and different from every other filesystem that > means you know what the access pattern of userspace is going to be better > than the generic VFS? In general small chunks are bad for network file systems since the 'cost' of sending a large read or write on the network (and in the call stack on the client and server, with various task switches etc) is not much more than a small one. This can be different on a local file system with less latency between request and response and fewer task switches involved on client and server. There are tradeoffs between - having multiple small chunks in flight vs. fewer large chunks in flight - but a general idea is that if possible it can be much faster to keep some requests in flight and keep some activity: - on the network - on the server side - on the client side to avoid "dead time" where nothing is happening on the network due to latency decrypting on the client or server etc. For this reason it makes sense that having multiple 4 1MB reads in flight (e.g. copying a file with new "rasize" mount parm set to (e.g.) 4MB for cifs.ko) can be much faster than only having 1 1MB read in flight at one time, and much, much faster than using direct i/o where some tools like "rsync" use quite small i/o sizes (cp uses 1MB i/o if uncached i/o for case where mounted to cifs and nfs but rsync uses a small size which hurts uncached performance greatly) uses much smaller)
On Fri, Apr 30, 2021 at 02:22:20PM -0500, Steve French wrote: > On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote: > > > Although ideally, I feel that we (cifs.ko) should be able to read in > > > larger granular "chunks" even for small reads, in expectation that > > > surrounding offsets will be read soon. > > > > Why? How is CIFS special and different from every other filesystem that > > means you know what the access pattern of userspace is going to be better > > than the generic VFS? > > In general small chunks are bad for network file systems since the 'cost' of > sending a large read or write on the network (and in the call stack on > the client > and server, with various task switches etc) is not much more than a small one. > This can be different on a local file system with less latency between request > and response and fewer task switches involved on client and server. Block-based filesystems are often, but not always local. For example, we might be using nbd, iSCSI, FCoE or something similar to include network latency between the filesystem and its storage. Even without those possibilities, a NAND SSD looks pretty similar. Look at the graphic titled "Idle Average Random Read Latency" on this page: https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html That seems to be showing 5us software latency for an SSD with 80us of hardware latency. That says to me we should have 16 outstanding reads to a NAND SSD in order to keep the pipeline full. Conversely, a network filesystem might be talking to localhost, and seeing much lower latency compared to going across the data center, between data centres or across the Pacific. So, my point is that Linux's readahead is pretty poor. Adding hacks in for individual filesystems isn't a good route to fixing it, and reading larger chunks has already passed the point of dimnishing returns for many workloads. I laid it out in a bit more detail here: https://lore.kernel.org/linux-fsdevel/20210224155121.GQ2858050@casper.infradead.org/
On Sat, May 1, 2021 at 1:35 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Apr 30, 2021 at 02:22:20PM -0500, Steve French wrote: > > On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote: > > > > Although ideally, I feel that we (cifs.ko) should be able to read in > > > > larger granular "chunks" even for small reads, in expectation that > > > > surrounding offsets will be read soon. > > > > > > Why? How is CIFS special and different from every other filesystem that > > > means you know what the access pattern of userspace is going to be better > > > than the generic VFS? > > > > In general small chunks are bad for network file systems since the 'cost' of > > sending a large read or write on the network (and in the call stack on > > the client > > and server, with various task switches etc) is not much more than a small one. > > This can be different on a local file system with less latency between request > > and response and fewer task switches involved on client and server. > > Block-based filesystems are often, but not always local. For example, > we might be using nbd, iSCSI, FCoE or something similar to include > network latency between the filesystem and its storage. Even without > those possibilities, a NAND SSD looks pretty similar. Look at the > graphic titled "Idle Average Random Read Latency" on this page: > > https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html > > That seems to be showing 5us software latency for an SSD with 80us of > hardware latency. That says to me we should have 16 outstanding reads > to a NAND SSD in order to keep the pipeline full. > > Conversely, a network filesystem might be talking to localhost, > and seeing much lower latency compared to going across the data > center, between data centres or across the Pacific. > > So, my point is that Linux's readahead is pretty poor. Adding > hacks in for individual filesystems isn't a good route to fixing it, > and reading larger chunks has already passed the point of dimnishing > returns for many workloads. > > I laid it out in a bit more detail here: > https://lore.kernel.org/linux-fsdevel/20210224155121.GQ2858050@casper.infradead.org/ Yes - those are good points. Because the latencies vary the most for network/cluster filesystems which can vary by more than a million times greater (from localhost and RDMA (aka smbdirect) which can be very low latencies, to some cloud workloads which have longer latencies by high throughput, or to servers where the files are 'offline' (archived or in the cloud) where I have seen some examples where it could take minutes instead) - it is especially important for this in the long run to be better tunable. In the short term, at least having some tuneables on the file system mount (like Ceph's "rapages") makes sense. Seems like there are three problems to solve: - the things your note mentions about how to get the core readahead code to ramp up a 'reasonable' number of I/Os are very important but also - how to let a filesystem signal the readahead code to slow down or allow partially fulfilling read ahead requests (in the SMB3 case this can be done when 'credits' on the connection (one 'credit' is needed for each 64K of I/O) are starting to get lower) - how to let a filesystem signal the readahead code to temporarily stop readahead (or max readahead at one i/o of size = readsize). This could happen e.g. when the filesystem gets an "out of resources" error message from the server, or when reconnect is triggered
Forgot to mention another obvious point ... the number of 'channels' is dynamic for some filesystems. For example clustered SMB3 servers can notify the clients asynchronously when more 'channels' are added (more network connections added - e.g. in cloud environments or clustered high performance server environments you can temporarily add more high performance ethernet or RDMA adapters temporarily to the host) so it is quite possible that the server can indicate to the client that more network throughput is now available (Windows takes advantage of this, and even polls to check for new interfaces every 10 minutes, but the Linux client does not yet - but it is something we will likely add soon). On Sat, May 1, 2021 at 1:47 PM Steve French <smfrench@gmail.com> wrote: > > On Sat, May 1, 2021 at 1:35 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Apr 30, 2021 at 02:22:20PM -0500, Steve French wrote: > > > On Fri, Apr 30, 2021 at 7:00 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Fri, Apr 30, 2021 at 04:19:27PM +0530, Shyam Prasad N wrote: > > > > > Although ideally, I feel that we (cifs.ko) should be able to read in > > > > > larger granular "chunks" even for small reads, in expectation that > > > > > surrounding offsets will be read soon. > > > > > > > > Why? How is CIFS special and different from every other filesystem that > > > > means you know what the access pattern of userspace is going to be better > > > > than the generic VFS? > > > > > > In general small chunks are bad for network file systems since the 'cost' of > > > sending a large read or write on the network (and in the call stack on > > > the client > > > and server, with various task switches etc) is not much more than a small one. > > > This can be different on a local file system with less latency between request > > > and response and fewer task switches involved on client and server. > > > > Block-based filesystems are often, but not always local. For example, > > we might be using nbd, iSCSI, FCoE or something similar to include > > network latency between the filesystem and its storage. Even without > > those possibilities, a NAND SSD looks pretty similar. Look at the > > graphic titled "Idle Average Random Read Latency" on this page: > > > > https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-technology/balancing-bandwidth-and-latency-article-brief.html > > > > That seems to be showing 5us software latency for an SSD with 80us of > > hardware latency. That says to me we should have 16 outstanding reads > > to a NAND SSD in order to keep the pipeline full. > > > > Conversely, a network filesystem might be talking to localhost, > > and seeing much lower latency compared to going across the data > > center, between data centres or across the Pacific. > > > > So, my point is that Linux's readahead is pretty poor. Adding > > hacks in for individual filesystems isn't a good route to fixing it, > > and reading larger chunks has already passed the point of dimnishing > > returns for many workloads. > > > > I laid it out in a bit more detail here: > > https://lore.kernel.org/linux-fsdevel/20210224155121.GQ2858050@casper.infradead.org/ > > Yes - those are good points. Because the latencies vary the most for > network/cluster filesystems which can vary by more than a million > times greater (from localhost and RDMA (aka smbdirect) which can be > very low latencies, to some cloud workloads which have longer > latencies by high throughput, or to servers where the files are > 'offline' (archived or in the cloud) where I have seen some examples > where it could take minutes instead) - it is especially important for > this in the long run to be better tunable. In the short term, at > least having some tuneables on the file system mount (like Ceph's > "rapages") makes sense. > > Seems like there are three problems to solve: > - the things your note mentions about how to get the core readahead > code to ramp up a 'reasonable' number of I/Os are very important > but also > - how to let a filesystem signal the readahead code to slow down or > allow partially fulfilling read ahead requests (in the SMB3 case this > can be done when 'credits' on the connection (one 'credit' is needed > for each 64K of I/O) are starting to get lower) > - how to let a filesystem signal the readahead code to temporarily > stop readahead (or max readahead at one i/o of size = readsize). This > could happen e.g. when the filesystem gets an "out of resources" error > message from the server, or when reconnect is triggered > > > -- > Thanks, > > Steve
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 39f4889a036b..a1217682da1f 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -649,6 +649,7 @@ cifs_show_options(struct seq_file *s, struct dentry *root) seq_printf(s, ",rsize=%u", cifs_sb->ctx->rsize); seq_printf(s, ",wsize=%u", cifs_sb->ctx->wsize); seq_printf(s, ",bsize=%u", cifs_sb->ctx->bsize); + seq_printf(s, ",rasize=%u", cifs_sb->ctx->rasize); if (tcon->ses->server->min_offload) seq_printf(s, ",esize=%u", tcon->ses->server->min_offload); seq_printf(s, ",echo_interval=%lu", diff --git a/fs/cifs/fs_context.c b/fs/cifs/fs_context.c index 74758e954035..5ce1f7b854e7 100644 --- a/fs/cifs/fs_context.c +++ b/fs/cifs/fs_context.c @@ -137,6 +137,7 @@ const struct fs_parameter_spec smb3_fs_parameters[] = { fsparam_u32("min_enc_offload", Opt_min_enc_offload), fsparam_u32("esize", Opt_min_enc_offload), fsparam_u32("bsize", Opt_blocksize), + fsparam_u32("rasize", Opt_rasize), fsparam_u32("rsize", Opt_rsize), fsparam_u32("wsize", Opt_wsize), fsparam_u32("actimeo", Opt_actimeo), @@ -941,6 +942,26 @@ static int smb3_fs_context_parse_param(struct fs_context *fc, ctx->bsize = result.uint_32; ctx->got_bsize = true; break; + case Opt_rasize: + /* + * readahead size realistically should never need to be + * less than 1M (CIFS_DEFAULT_IOSIZE) or greater than 32M + * (perhaps an exception should be considered in the + * for the case of a large number of channels + * when multichannel is negotiated) since that would lead + * to plenty of parallel I/O in flight to the server. + * Note that smaller read ahead sizes would + * hurt performance of common tools like cp and scp + * which often trigger sequential i/o with read ahead + */ + if ((result.uint_32 > (8 * SMB3_DEFAULT_IOSIZE)) || + (result.uint_32 < CIFS_DEFAULT_IOSIZE)) { + cifs_errorf(fc, "%s: Invalid rasize %d vs. %d\n", + __func__, result.uint_32, SMB3_DEFAULT_IOSIZE); + goto cifs_parse_mount_err; + } + ctx->rasize = result.uint_32; + break; case Opt_rsize: ctx->rsize = result.uint_32; ctx->got_rsize = true; @@ -1377,7 +1398,9 @@ int smb3_init_fs_context(struct fs_context *fc) ctx->cred_uid = current_uid(); ctx->linux_uid = current_uid(); ctx->linux_gid = current_gid(); - ctx->bsize = 1024 * 1024; /* can improve cp performance significantly */ + /* By default 4MB read ahead size, 1MB block size */ + ctx->bsize = CIFS_DEFAULT_IOSIZE; /* can improve cp performance significantly */ + ctx->rasize = SMB3_DEFAULT_IOSIZE; /* can improve sequential read ahead */ /* * default to SFM style remapping of seven reserved characters diff --git a/fs/cifs/fs_context.h b/fs/cifs/fs_context.h index 56d7a75e2390..2a71c8e411ac 100644 --- a/fs/cifs/fs_context.h +++ b/fs/cifs/fs_context.h @@ -120,6 +120,7 @@ enum cifs_param { Opt_dirmode, Opt_min_enc_offload, Opt_blocksize, + Opt_rasize, Opt_rsize, Opt_wsize, Opt_actimeo, @@ -235,6 +236,7 @@ struct smb3_fs_context { /* reuse existing guid for multichannel */ u8 client_guid[SMB2_CLIENT_GUID_SIZE]; unsigned int bsize; + unsigned int rasize; unsigned int rsize; unsigned int wsize; unsigned int min_offload;
In some cases readahead of more than the read size can help (to allow parallel i/o of read ahead which can improve performance). Using the buildbot test systems, this resulted in an average improvement of 14% to the Windows server test target for the first 12 tests I tried (no multichannel) changing to 12MB rasize (read ahead size). Similarly increasing the rasize to 12MB to Azure (this time with multichannel, 4 channels) improved performance 37% Note that Ceph had already introduced a mount parameter "rasize" to allow controlling this. Add mount parameter "rasize" to cifs.ko to allow control of read ahead (rasize defaults to 4MB which is typically what it used to default to to the many servers whose rsize was that). Signed-off-by: Steve French <stfrench@microsoft.com> --- fs/cifs/cifsfs.c | 1 + fs/cifs/fs_context.c | 25 ++++++++++++++++++++++++- fs/cifs/fs_context.h | 2 ++ 3 files changed, 27 insertions(+), 1 deletion(-)