Message ID | 20230726105953.843-1-jaco@uls.co.za (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fuse: enable larger read buffers for readdir. | expand |
Hi, Just to give some context, we've got maildir folders running on top of glusterfs. Without this a "medium" sized folder can take several minutes to go through readdir (getdents64) and each system call will return 14-18 entries at a time. These calls (as measures using strace -T -p) takes anywhere from around 150 micro-seconds to several milli-seconds. Inter-server round-trip time (as measured with ping) is generally 100-120 micro-seconds so 150 µs is probably a good lower limit. I've tested the below patch from a more remote location (7-8ms ping round-trip) and in spite of this massively increased latency a readdir iteration over a small folder (a few hundred entries) was still on par with the local case, usually even marginally better (10%), where on larger folders the difference became much more pronounced, with performance for the remote location at times being drastically better than for the "close" location. This has a few benefits overall that I see: 1. This enables glusterfs to internally read and process larger chunks. Whether this is happening I don't know (I have a separate ongoing discussion with the developers on that side to see what can be done inside of glusterfs itself to improve the mechanisms on the other side of fuse). 2. Fewer overall system calls (by an order of up to 32 fewer getdents64()) calls for userspace to get a full directory listing, in cases where there's 100k+ files in a folder this makes a HUGE difference (14-18 entries vs ~500 entries per call, so >5000 calls vs 200). 3. Fewer fuse messages being passed over the fuse link. One concerns I have is that I don't understand all of the caching and stuff going on, and I'm typically as per strace seeing less than 64KB of data being returned to userspace, so I'm not sure there is a direct correlation between the FUSE readdir size and that from glibc to kernel, and I'm slightly worried that the caching may now be broken due to smaller than READDIR_PAGES_SIZE records being cached, with that said, successive readdir() iterations from userspace provided identical results so I *think* this should be OK. Someone with a better understanding should please confirm. I made the option configurable (but max it out by default) in case memory constrained systems may need to drop the 128KiB value. I suspect the discrepancy may also relate to the way in which glusterfs merges directory listings from the underlying bricks where data is actually stored. Without this patch the remote system is orders of magnitude slower that the close together systems. Kind regards, Jaco On 2023/07/26 12:59, Jaco Kroon wrote: > Signed-off-by: Jaco Kroon <jaco@uls.co.za> > --- > fs/fuse/Kconfig | 16 ++++++++++++++++ > fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ > 2 files changed, 40 insertions(+), 18 deletions(-) > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig > index 038ed0b9aaa5..0783f9ee5cd3 100644 > --- a/fs/fuse/Kconfig > +++ b/fs/fuse/Kconfig > @@ -18,6 +18,22 @@ config FUSE_FS > If you want to develop a userspace FS, or if you want to use > a filesystem based on FUSE, answer Y or M. > > +config FUSE_READDIR_ORDER > + int > + range 0 5 > + default 5 > + help > + readdir performance varies greatly depending on the size of the read. > + Larger buffers results in larger reads, thus fewer reads and higher > + performance in return. > + > + You may want to reduce this value on seriously constrained memory > + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal. > + > + This value reprents the order of the number of pages to allocate (ie, > + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32 > + pages (128KiB). > + > config CUSE > tristate "Character device in Userspace support" > depends on FUSE_FS > diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c > index dc603479b30e..98c62b623240 100644 > --- a/fs/fuse/readdir.c > +++ b/fs/fuse/readdir.c > @@ -13,6 +13,12 @@ > #include <linux/pagemap.h> > #include <linux/highmem.h> > > +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER > +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER) > +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER) > +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1) > +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER) > + > static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx) > { > struct fuse_conn *fc = get_fuse_conn(dir); > @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file, > } > version = fi->rdc.version; > size = fi->rdc.size; > - offset = size & ~PAGE_MASK; > - index = size >> PAGE_SHIFT; > + offset = size & ~READDIR_PAGES_MASK; > + index = size >> READDIR_PAGES_SHIFT; > /* Dirent doesn't fit in current page? Jump to next page. */ > - if (offset + reclen > PAGE_SIZE) { > + if (offset + reclen > READDIR_PAGES_SIZE) { > index++; > offset = 0; > } > @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file, > } > memcpy(addr + offset, dirent, reclen); > kunmap_local(addr); > - fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen; > + fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen; > fi->rdc.pos = dirent->off; > unlock: > spin_unlock(&fi->rdc.lock); > @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos) > } > > fi->rdc.cached = true; > - end = ALIGN(fi->rdc.size, PAGE_SIZE); > + end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE); > spin_unlock(&fi->rdc.lock); > > /* truncate unused tail of cache */ > @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx) > struct fuse_mount *fm = get_fuse_mount(inode); > struct fuse_io_args ia = {}; > struct fuse_args_pages *ap = &ia.ap; > - struct fuse_page_desc desc = { .length = PAGE_SIZE }; > + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE }; > u64 attr_version = 0; > bool locked; > > - page = alloc_page(GFP_KERNEL); > + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER); > if (!page) > return -ENOMEM; > > plus = fuse_use_readdirplus(inode, ctx); > ap->args.out_pages = true; > - ap->num_pages = 1; > + ap->num_pages = READDIR_PAGES; > ap->pages = &page; > ap->descs = &desc; > if (plus) { > attr_version = fuse_get_attr_version(fm->fc); > - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE, > + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE, > FUSE_READDIRPLUS); > } else { > - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE, > + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE, > FUSE_READDIR); > } > locked = fuse_lock_inode(inode); > @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff, > void *addr, unsigned int size, > struct dir_context *ctx) > { > - unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK; > + unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK; > enum fuse_parse_result res = FOUND_NONE; > > WARN_ON(offset >= size); > @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx) > > WARN_ON(fi->rdc.size < ff->readdir.cache_off); > > - index = ff->readdir.cache_off >> PAGE_SHIFT; > + index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT; > > - if (index == (fi->rdc.size >> PAGE_SHIFT)) > - size = fi->rdc.size & ~PAGE_MASK; > + if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT)) > + size = fi->rdc.size & ~READDIR_PAGES_MASK; > else > - size = PAGE_SIZE; > + size = READDIR_PAGES_SIZE; > spin_unlock(&fi->rdc.lock); > > /* EOF? */ > - if ((ff->readdir.cache_off & ~PAGE_MASK) == size) > + if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size) > return 0; > > page = find_get_page_flags(file->f_mapping, index, > @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx) > if (res == FOUND_ALL) > return 0; > > - if (size == PAGE_SIZE) { > + if (size == READDIR_PAGES_SIZE) { > /* We hit end of page: skip to next page. */ > - ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE); > + ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE); > goto retry; > } >
On 7/26/23 12:59, Jaco Kroon wrote: > Signed-off-by: Jaco Kroon <jaco@uls.co.za> > --- > fs/fuse/Kconfig | 16 ++++++++++++++++ > fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ > 2 files changed, 40 insertions(+), 18 deletions(-) > > diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig > index 038ed0b9aaa5..0783f9ee5cd3 100644 > --- a/fs/fuse/Kconfig > +++ b/fs/fuse/Kconfig > @@ -18,6 +18,22 @@ config FUSE_FS > If you want to develop a userspace FS, or if you want to use > a filesystem based on FUSE, answer Y or M. > > +config FUSE_READDIR_ORDER > + int > + range 0 5 > + default 5 > + help > + readdir performance varies greatly depending on the size of the read. > + Larger buffers results in larger reads, thus fewer reads and higher > + performance in return. > + > + You may want to reduce this value on seriously constrained memory > + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal. > + > + This value reprents the order of the number of pages to allocate (ie, > + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32 > + pages (128KiB). > + I like the idea of a larger readdir size, but shouldn't that be a server/daemon/library decision which size to use, instead of kernel compile time? So should be part of FUSE_INIT negotiation? > config CUSE > tristate "Character device in Userspace support" > depends on FUSE_FS > diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c > index dc603479b30e..98c62b623240 100644 > --- a/fs/fuse/readdir.c > +++ b/fs/fuse/readdir.c > @@ -13,6 +13,12 @@ > #include <linux/pagemap.h> > #include <linux/highmem.h> > > +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER > +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER) > +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER) > +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1) > +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER) > + > static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx) > { > struct fuse_conn *fc = get_fuse_conn(dir); > @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file, > } > version = fi->rdc.version; > size = fi->rdc.size; > - offset = size & ~PAGE_MASK; > - index = size >> PAGE_SHIFT; > + offset = size & ~READDIR_PAGES_MASK; > + index = size >> READDIR_PAGES_SHIFT; > /* Dirent doesn't fit in current page? Jump to next page. */ > - if (offset + reclen > PAGE_SIZE) { > + if (offset + reclen > READDIR_PAGES_SIZE) { > index++; > offset = 0; > } > @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file, > } > memcpy(addr + offset, dirent, reclen); > kunmap_local(addr); > - fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen; > + fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen; > fi->rdc.pos = dirent->off; > unlock: > spin_unlock(&fi->rdc.lock); > @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos) > } > > fi->rdc.cached = true; > - end = ALIGN(fi->rdc.size, PAGE_SIZE); > + end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE); > spin_unlock(&fi->rdc.lock); > > /* truncate unused tail of cache */ > @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx) > struct fuse_mount *fm = get_fuse_mount(inode); > struct fuse_io_args ia = {}; > struct fuse_args_pages *ap = &ia.ap; > - struct fuse_page_desc desc = { .length = PAGE_SIZE }; > + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE }; > u64 attr_version = 0; > bool locked; > > - page = alloc_page(GFP_KERNEL); > + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER); I guess that should become folio alloc(), one way or the other. Now I think order 0 was chosen before to avoid risk of allocation failure. I guess it might work to try a large size and to fall back to 0 when that failed. Or fail back to the slower vmalloc. > if (!page) > return -ENOMEM; > > plus = fuse_use_readdirplus(inode, ctx); > ap->args.out_pages = true; > - ap->num_pages = 1; > + ap->num_pages = READDIR_PAGES; > ap->pages = &page; > ap->descs = &desc; > if (plus) { > attr_version = fuse_get_attr_version(fm->fc); > - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE, > + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE, > FUSE_READDIRPLUS); > } else { > - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE, > + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE, > FUSE_READDIR); > } > locked = fuse_lock_inode(inode); > @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff, > void *addr, unsigned int size, > struct dir_context *ctx) > { > - unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK; > + unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK; > enum fuse_parse_result res = FOUND_NONE; > > WARN_ON(offset >= size); > @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx) > > WARN_ON(fi->rdc.size < ff->readdir.cache_off); > > - index = ff->readdir.cache_off >> PAGE_SHIFT; > + index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT; > > - if (index == (fi->rdc.size >> PAGE_SHIFT)) > - size = fi->rdc.size & ~PAGE_MASK; > + if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT)) > + size = fi->rdc.size & ~READDIR_PAGES_MASK; > else > - size = PAGE_SIZE; > + size = READDIR_PAGES_SIZE; > spin_unlock(&fi->rdc.lock); > > /* EOF? */ > - if ((ff->readdir.cache_off & ~PAGE_MASK) == size) > + if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size) > return 0; > > page = find_get_page_flags(file->f_mapping, index, > @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx) > if (res == FOUND_ALL) > return 0; > > - if (size == PAGE_SIZE) { > + if (size == READDIR_PAGES_SIZE) { > /* We hit end of page: skip to next page. */ > - ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE); > + ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE); > goto retry; > } > Thanks, Bernd
On 7/26/23 03:59, Jaco Kroon wrote: > +config FUSE_READDIR_ORDER > + int > + range 0 5 > + default 5 > + help > + readdir performance varies greatly depending on the size of the read. > + Larger buffers results in larger reads, thus fewer reads and higher > + performance in return. > + > + You may want to reduce this value on seriously constrained memory > + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal. > + > + This value reprents the order of the number of pages to allocate (ie, represents (i.e., > + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32 > + pages (128KiB).
Hi, On 2023/07/26 15:53, Bernd Schubert wrote: > > > On 7/26/23 12:59, Jaco Kroon wrote: >> Signed-off-by: Jaco Kroon <jaco@uls.co.za> >> --- >> fs/fuse/Kconfig | 16 ++++++++++++++++ >> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ >> 2 files changed, 40 insertions(+), 18 deletions(-) >> >> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig >> index 038ed0b9aaa5..0783f9ee5cd3 100644 >> --- a/fs/fuse/Kconfig >> +++ b/fs/fuse/Kconfig >> @@ -18,6 +18,22 @@ config FUSE_FS >> If you want to develop a userspace FS, or if you want to use >> a filesystem based on FUSE, answer Y or M. >> +config FUSE_READDIR_ORDER >> + int >> + range 0 5 >> + default 5 >> + help >> + readdir performance varies greatly depending on the size of >> the read. >> + Larger buffers results in larger reads, thus fewer reads and >> higher >> + performance in return. >> + >> + You may want to reduce this value on seriously constrained >> memory >> + systems where 128KiB (assuming 4KiB pages) cache pages is >> not ideal. >> + >> + This value reprents the order of the number of pages to >> allocate (ie, >> + the shift value). A value of 0 is thus 1 page (4KiB) where >> 5 is 32 >> + pages (128KiB). >> + > > I like the idea of a larger readdir size, but shouldn't that be a > server/daemon/library decision which size to use, instead of kernel > compile time? So should be part of FUSE_INIT negotiation? Yes sure, but there still needs to be a default. And one page at a time doesn't cut it. -- snip -- >> - page = alloc_page(GFP_KERNEL); >> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER); > > I guess that should become folio alloc(), one way or the other. Now I > think order 0 was chosen before to avoid risk of allocation failure. I > guess it might work to try a large size and to fall back to 0 when > that failed. Or fail back to the slower vmalloc. If this varies then a bunch of other code will become somewhat more complex, especially if one alloc succeeds, and then a follow-up succeeds. I'm not familiar with the differences between the different mechanisms available for allocation. -- snip -- > Thanks, My pleasure, Jaco
On 7/26/23 17:26, Jaco Kroon wrote: > Hi, > > On 2023/07/26 15:53, Bernd Schubert wrote: >> >> >> On 7/26/23 12:59, Jaco Kroon wrote: >>> Signed-off-by: Jaco Kroon <jaco@uls.co.za> >>> --- >>> fs/fuse/Kconfig | 16 ++++++++++++++++ >>> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ >>> 2 files changed, 40 insertions(+), 18 deletions(-) >>> >>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig >>> index 038ed0b9aaa5..0783f9ee5cd3 100644 >>> --- a/fs/fuse/Kconfig >>> +++ b/fs/fuse/Kconfig >>> @@ -18,6 +18,22 @@ config FUSE_FS >>> If you want to develop a userspace FS, or if you want to use >>> a filesystem based on FUSE, answer Y or M. >>> +config FUSE_READDIR_ORDER >>> + int >>> + range 0 5 >>> + default 5 >>> + help >>> + readdir performance varies greatly depending on the size of >>> the read. >>> + Larger buffers results in larger reads, thus fewer reads and >>> higher >>> + performance in return. >>> + >>> + You may want to reduce this value on seriously constrained >>> memory >>> + systems where 128KiB (assuming 4KiB pages) cache pages is >>> not ideal. >>> + >>> + This value reprents the order of the number of pages to >>> allocate (ie, >>> + the shift value). A value of 0 is thus 1 page (4KiB) where >>> 5 is 32 >>> + pages (128KiB). >>> + >> >> I like the idea of a larger readdir size, but shouldn't that be a >> server/daemon/library decision which size to use, instead of kernel >> compile time? So should be part of FUSE_INIT negotiation? > > Yes sure, but there still needs to be a default. And one page at a time > doesn't cut it. > > -- snip -- > >>> - page = alloc_page(GFP_KERNEL); >>> + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER); >> >> I guess that should become folio alloc(), one way or the other. Now I >> think order 0 was chosen before to avoid risk of allocation failure. I >> guess it might work to try a large size and to fall back to 0 when >> that failed. Or fail back to the slower vmalloc. > > If this varies then a bunch of other code will become somewhat more > complex, especially if one alloc succeeds, and then a follow-up succeeds. Yeah, the better choice is kvmalloc/kvfree which handles it internally. > > I'm not familiar with the differences between the different mechanisms > available for allocation. > > -- snip -- > >> Thanks, > My pleasure, > Jaco
On 7/26/23 17:26, Jaco Kroon wrote: > Hi, > > On 2023/07/26 15:53, Bernd Schubert wrote: >> >> >> On 7/26/23 12:59, Jaco Kroon wrote: >>> Signed-off-by: Jaco Kroon <jaco@uls.co.za> >>> --- >>> fs/fuse/Kconfig | 16 ++++++++++++++++ >>> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ >>> 2 files changed, 40 insertions(+), 18 deletions(-) >>> >>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig >>> index 038ed0b9aaa5..0783f9ee5cd3 100644 >>> --- a/fs/fuse/Kconfig >>> +++ b/fs/fuse/Kconfig >>> @@ -18,6 +18,22 @@ config FUSE_FS >>> If you want to develop a userspace FS, or if you want to use >>> a filesystem based on FUSE, answer Y or M. >>> +config FUSE_READDIR_ORDER >>> + int >>> + range 0 5 >>> + default 5 >>> + help >>> + readdir performance varies greatly depending on the size of >>> the read. >>> + Larger buffers results in larger reads, thus fewer reads and >>> higher >>> + performance in return. >>> + >>> + You may want to reduce this value on seriously constrained >>> memory >>> + systems where 128KiB (assuming 4KiB pages) cache pages is >>> not ideal. >>> + >>> + This value reprents the order of the number of pages to >>> allocate (ie, >>> + the shift value). A value of 0 is thus 1 page (4KiB) where >>> 5 is 32 >>> + pages (128KiB). >>> + >> >> I like the idea of a larger readdir size, but shouldn't that be a >> server/daemon/library decision which size to use, instead of kernel >> compile time? So should be part of FUSE_INIT negotiation? > > Yes sure, but there still needs to be a default. And one page at a time > doesn't cut it. With FUSE_INIT userspace would make that decision, based on what kernel fuse suggests? process_init_reply() already handles other limits - I don't see why readdir max has to be compile time option. Maybe a module option to set the limit? Thanks, Bernd
On 7/26/23 10:45, Bernd Schubert wrote: > > On 7/26/23 17:26, Jaco Kroon wrote: >> Hi, >> >> On 2023/07/26 15:53, Bernd Schubert wrote: >>> >>> On 7/26/23 12:59, Jaco Kroon wrote: >>>> Signed-off-by: Jaco Kroon <jaco@uls.co.za> >>>> --- >>>> fs/fuse/Kconfig | 16 ++++++++++++++++ >>>> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ >>>> 2 files changed, 40 insertions(+), 18 deletions(-) >>>> >>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig >>>> index 038ed0b9aaa5..0783f9ee5cd3 100644 >>>> --- a/fs/fuse/Kconfig >>>> +++ b/fs/fuse/Kconfig >>>> @@ -18,6 +18,22 @@ config FUSE_FS >>>> If you want to develop a userspace FS, or if you want to use >>>> a filesystem based on FUSE, answer Y or M. >>>> +config FUSE_READDIR_ORDER >>>> + int >>>> + range 0 5 >>>> + default 5 >>>> + help >>>> + readdir performance varies greatly depending on the size of >>>> the read. >>>> + Larger buffers results in larger reads, thus fewer reads and >>>> higher >>>> + performance in return. >>>> + >>>> + You may want to reduce this value on seriously constrained >>>> memory >>>> + systems where 128KiB (assuming 4KiB pages) cache pages is >>>> not ideal. >>>> + >>>> + This value reprents the order of the number of pages to >>>> allocate (ie, >>>> + the shift value). A value of 0 is thus 1 page (4KiB) where >>>> 5 is 32 >>>> + pages (128KiB). >>>> + >>> I like the idea of a larger readdir size, but shouldn't that be a >>> server/daemon/library decision which size to use, instead of kernel >>> compile time? So should be part of FUSE_INIT negotiation? >> Yes sure, but there still needs to be a default. And one page at a time >> doesn't cut it. > With FUSE_INIT userspace would make that decision, based on what kernel > fuse suggests? process_init_reply() already handles other limits - I > don't see why readdir max has to be compile time option. Maybe a module > option to set the limit? > > Thanks, > Bernd I had similar question / comment. This seems to me to be more appropriately handed by the server via FUSE_INIT. And wouldn't "max" more easily be FUSE_MAX_MAX_PAGES? Is there a reason not to allow upwards of 256 pages sized readdir buffer?
Hi, On 2023/07/26 19:23, Antonio SJ Musumeci wrote: > On 7/26/23 10:45, Bernd Schubert wrote: >> On 7/26/23 17:26, Jaco Kroon wrote: >>> Hi, >>> >>> On 2023/07/26 15:53, Bernd Schubert wrote: >>>> On 7/26/23 12:59, Jaco Kroon wrote: >>>>> Signed-off-by: Jaco Kroon <jaco@uls.co.za> >>>>> --- >>>>> fs/fuse/Kconfig | 16 ++++++++++++++++ >>>>> fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ >>>>> 2 files changed, 40 insertions(+), 18 deletions(-) >>>>> >>>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig >>>>> index 038ed0b9aaa5..0783f9ee5cd3 100644 >>>>> --- a/fs/fuse/Kconfig >>>>> +++ b/fs/fuse/Kconfig >>>>> @@ -18,6 +18,22 @@ config FUSE_FS >>>>> If you want to develop a userspace FS, or if you want to use >>>>> a filesystem based on FUSE, answer Y or M. >>>>> +config FUSE_READDIR_ORDER >>>>> + int >>>>> + range 0 5 >>>>> + default 5 >>>>> + help >>>>> + readdir performance varies greatly depending on the size of >>>>> the read. >>>>> + Larger buffers results in larger reads, thus fewer reads and >>>>> higher >>>>> + performance in return. >>>>> + >>>>> + You may want to reduce this value on seriously constrained >>>>> memory >>>>> + systems where 128KiB (assuming 4KiB pages) cache pages is >>>>> not ideal. >>>>> + >>>>> + This value reprents the order of the number of pages to >>>>> allocate (ie, >>>>> + the shift value). A value of 0 is thus 1 page (4KiB) where >>>>> 5 is 32 >>>>> + pages (128KiB). >>>>> + >>>> I like the idea of a larger readdir size, but shouldn't that be a >>>> server/daemon/library decision which size to use, instead of kernel >>>> compile time? So should be part of FUSE_INIT negotiation? >>> Yes sure, but there still needs to be a default. And one page at a time >>> doesn't cut it. >> With FUSE_INIT userspace would make that decision, based on what kernel >> fuse suggests? process_init_reply() already handles other limits - I >> don't see why readdir max has to be compile time option. Maybe a module >> option to set the limit? >> >> Thanks, >> Bernd > I had similar question / comment. This seems to me to be more > appropriately handed by the server via FUSE_INIT. > > And wouldn't "max" more easily be FUSE_MAX_MAX_PAGES? Is there a reason > not to allow upwards of 256 pages sized readdir buffer? Will look into FUSE_INIT. The FUSE_INIT as I understand from what I've read has some expansion constraints or the structure is somehow negotiated. Older clients in other words that's not aware of the option will follow some default. For backwards compatibility that default should probably be 1 page. For performance reasons it makes sense that this limit be larger. glibc uses a 128KiB buffer for getdents64, so I'm not sure >128KiB here makes sense. Or if these two buffers are even directly related. Default to fc->max_pages (which defaults to 32 or 128KiB) if the user-space side doesn't understand the max_readdir_pages limit aspect of things? Assuming these limits should be set separately. I'm thinking piggy backing on fc->max_pages is just fine to be honest. For the sake of avoiding division and modulo operations in the cache, I'm thinking round-down max_pages to the closest power of two for the sake of sticking to binary operators rather than divisions and mods? Current patch introduces a definite memory leak either way. Tore through about 12GB of RAM in a matter of 20 minutes or so. Just going to patch it that way first, and then based on responses above will look into an alternative patch that does not depend on a compile-time option. Guessing __free_page should be a multi-page variant now. Thanks for all the feedback so far. Kind regards, Jaco
On Wed, Jul 26, 2023 at 08:25:48PM +0200, Jaco Kroon wrote: > glibc uses a 128KiB buffer for getdents64, so I'm not sure >128KiB here > makes sense. Or if these two buffers are even directly related. Definitely related. See how the dir_emit() works: if the entry doesn't fit in the userspace buffer, then ->readdir() returns. If entries remain in the fuse buffer at that time, they are simply discarded. Simply put, making the buffer too large is going to be a waste of resources and at some point will be slower than the single page buffer. Note: discarding received entries is the only possible action if caching is disabled. With caching enabled it would make sense to pre-fill the cache with the overflowed entries and use them in the next iteration. However the caching code is not prepared for using partial caches at the moment. The best strategy would be to find the optimal buffer size based on the size of the userspace buffer. Putting that info into struct dir_context doesn't sound too complicated... Here's a patch. It doesn't touch readdir. Simply setting the fuse buffer size to the userspace buffer size should work, the record sizes are similar (fuse's is slightly larger than libc's, so no overflow should ever happen). Thanks, Miklos --- fs/exportfs/expfs.c | 1 + fs/overlayfs/readdir.c | 12 +++++++++++- fs/readdir.c | 29 ++++++++++++++--------------- include/linux/fs.h | 7 +++++++ 4 files changed, 33 insertions(+), 16 deletions(-) --- a/fs/exportfs/expfs.c +++ b/fs/exportfs/expfs.c @@ -286,6 +286,7 @@ static int get_name(const struct path *p }; struct getdents_callback buffer = { .ctx.actor = filldir_one, + .ctx.count = INT_MAX, .name = name, }; --- a/fs/overlayfs/readdir.c +++ b/fs/overlayfs/readdir.c @@ -347,6 +347,7 @@ static int ovl_dir_read_merged(struct de struct path realpath; struct ovl_readdir_data rdd = { .ctx.actor = ovl_fill_merge, + .ctx.count = INT_MAX, .dentry = dentry, .list = list, .root = root, @@ -557,6 +558,7 @@ static int ovl_dir_read_impure(const str struct ovl_cache_entry *p, *n; struct ovl_readdir_data rdd = { .ctx.actor = ovl_fill_plain, + .ctx.count = INT_MAX, .list = list, .root = root, }; @@ -658,6 +660,7 @@ static bool ovl_fill_real(struct dir_con struct ovl_readdir_translate *rdt = container_of(ctx, struct ovl_readdir_translate, ctx); struct dir_context *orig_ctx = rdt->orig_ctx; + bool res; if (rdt->parent_ino && strcmp(name, "..") == 0) { ino = rdt->parent_ino; @@ -672,7 +675,10 @@ static bool ovl_fill_real(struct dir_con name, namelen, rdt->xinowarn); } - return orig_ctx->actor(orig_ctx, name, namelen, offset, ino, d_type); + res = orig_ctx->actor(orig_ctx, name, namelen, offset, ino, d_type); + ctx->count = orig_ctx->count; + + return res; } static bool ovl_is_impure_dir(struct file *file) @@ -699,6 +705,7 @@ static int ovl_iterate_real(struct file const struct ovl_layer *lower_layer = ovl_layer_lower(dir); struct ovl_readdir_translate rdt = { .ctx.actor = ovl_fill_real, + .ctx.count = ctx->count, .orig_ctx = ctx, .xinobits = ovl_xino_bits(ofs), .xinowarn = ovl_xino_warn(ofs), @@ -1058,6 +1065,7 @@ int ovl_check_d_type_supported(const str int err; struct ovl_readdir_data rdd = { .ctx.actor = ovl_check_d_type, + .ctx.count = INT_MAX, .d_type_supported = false, }; @@ -1079,6 +1087,7 @@ static int ovl_workdir_cleanup_recurse(s struct ovl_cache_entry *p; struct ovl_readdir_data rdd = { .ctx.actor = ovl_fill_plain, + .ctx.count = INT_MAX, .list = &list, }; bool incompat = false; @@ -1163,6 +1172,7 @@ int ovl_indexdir_cleanup(struct ovl_fs * struct ovl_cache_entry *p; struct ovl_readdir_data rdd = { .ctx.actor = ovl_fill_plain, + .ctx.count = INT_MAX, .list = &list, }; --- a/fs/readdir.c +++ b/fs/readdir.c @@ -184,6 +184,7 @@ SYSCALL_DEFINE3(old_readdir, unsigned in struct fd f = fdget_pos(fd); struct readdir_callback buf = { .ctx.actor = fillonedir, + .ctx.count = 1, /* Hint to fs: just one entry. */ .dirent = dirent }; @@ -215,7 +216,6 @@ struct getdents_callback { struct dir_context ctx; struct linux_dirent __user * current_dir; int prev_reclen; - int count; int error; }; @@ -234,7 +234,7 @@ static bool filldir(struct dir_context * if (unlikely(buf->error)) return false; buf->error = -EINVAL; /* only used if we fail.. */ - if (reclen > buf->count) + if (reclen > ctx->count) return false; d_ino = ino; if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) { @@ -259,7 +259,7 @@ static bool filldir(struct dir_context * buf->current_dir = (void __user *)dirent + reclen; buf->prev_reclen = reclen; - buf->count -= reclen; + ctx->count -= reclen; return true; efault_end: user_write_access_end(); @@ -274,7 +274,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, struct fd f; struct getdents_callback buf = { .ctx.actor = filldir, - .count = count, + .ctx.count = count, .current_dir = dirent }; int error; @@ -293,7 +293,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, if (put_user(buf.ctx.pos, &lastdirent->d_off)) error = -EFAULT; else - error = count - buf.count; + error = count - buf.ctx.count; } fdput_pos(f); return error; @@ -303,7 +303,6 @@ struct getdents_callback64 { struct dir_context ctx; struct linux_dirent64 __user * current_dir; int prev_reclen; - int count; int error; }; @@ -321,7 +320,7 @@ static bool filldir64(struct dir_context if (unlikely(buf->error)) return false; buf->error = -EINVAL; /* only used if we fail.. */ - if (reclen > buf->count) + if (reclen > ctx->count) return false; prev_reclen = buf->prev_reclen; if (prev_reclen && signal_pending(current)) @@ -341,7 +340,7 @@ static bool filldir64(struct dir_context buf->prev_reclen = reclen; buf->current_dir = (void __user *)dirent + reclen; - buf->count -= reclen; + ctx->count -= reclen; return true; efault_end: @@ -357,7 +356,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int struct fd f; struct getdents_callback64 buf = { .ctx.actor = filldir64, - .count = count, + .ctx.count = count, .current_dir = dirent }; int error; @@ -377,7 +376,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int if (put_user(d_off, &lastdirent->d_off)) error = -EFAULT; else - error = count - buf.count; + error = count - buf.ctx.count; } fdput_pos(f); return error; @@ -442,6 +441,7 @@ COMPAT_SYSCALL_DEFINE3(old_readdir, unsi struct fd f = fdget_pos(fd); struct compat_readdir_callback buf = { .ctx.actor = compat_fillonedir, + .ctx.count = 1, /* Hint to fs: just one entry. */ .dirent = dirent }; @@ -467,7 +467,6 @@ struct compat_getdents_callback { struct dir_context ctx; struct compat_linux_dirent __user *current_dir; int prev_reclen; - int count; int error; }; @@ -486,7 +485,7 @@ static bool compat_filldir(struct dir_co if (unlikely(buf->error)) return false; buf->error = -EINVAL; /* only used if we fail.. */ - if (reclen > buf->count) + if (reclen > ctx->count) return false; d_ino = ino; if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) { @@ -510,7 +509,7 @@ static bool compat_filldir(struct dir_co buf->prev_reclen = reclen; buf->current_dir = (void __user *)dirent + reclen; - buf->count -= reclen; + ctx->count -= reclen; return true; efault_end: user_write_access_end(); @@ -525,8 +524,8 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigne struct fd f; struct compat_getdents_callback buf = { .ctx.actor = compat_filldir, + .ctx.count = count, .current_dir = dirent, - .count = count }; int error; @@ -544,7 +543,7 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigne if (put_user(buf.ctx.pos, &lastdirent->d_off)) error = -EFAULT; else - error = count - buf.count; + error = count - buf.ctx.count; } fdput_pos(f); return error; --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1719,6 +1719,13 @@ typedef bool (*filldir_t)(struct dir_con struct dir_context { filldir_t actor; loff_t pos; + /* + * Filesystems MUST NOT MODIFY count, but may use as a hint: + * 0 unknown + * > 0 space in buffer (assume at least one entry) + * INT_MAX unlimited + */ + int count; }; /*
On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <jaco@uls.co.za> wrote: > Will look into FUSE_INIT. The FUSE_INIT as I understand from what I've > read has some expansion constraints or the structure is somehow > negotiated. Older clients in other words that's not aware of the option > will follow some default. For backwards compatibility that default > should probably be 1 page. For performance reasons it makes sense that > this limit be larger. Yes, might need this for backward compatibility. But perhaps a feature flag is enough and the readdir buf can be limited to fc->max_read. Thanks, Miklos
On 7/27/23 21:21, Miklos Szeredi wrote: > On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <jaco@uls.co.za> wrote: > >> Will look into FUSE_INIT. The FUSE_INIT as I understand from what I've >> read has some expansion constraints or the structure is somehow >> negotiated. Older clients in other words that's not aware of the option >> will follow some default. For backwards compatibility that default >> should probably be 1 page. For performance reasons it makes sense that >> this limit be larger. > > Yes, might need this for backward compatibility. But perhaps a > feature flag is enough and the readdir buf can be limited to > fc->max_read. fc->max_read is set by default to ~0 and only set to something else when the max_read mount option is given? So typically that is a large value (UINT_MAX)? Thanks, Bernd
On Thu, 27 Jul 2023 at 21:43, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > > On 7/27/23 21:21, Miklos Szeredi wrote: > > On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <jaco@uls.co.za> wrote: > > > >> Will look into FUSE_INIT. The FUSE_INIT as I understand from what I've > >> read has some expansion constraints or the structure is somehow > >> negotiated. Older clients in other words that's not aware of the option > >> will follow some default. For backwards compatibility that default > >> should probably be 1 page. For performance reasons it makes sense that > >> this limit be larger. > > > > Yes, might need this for backward compatibility. But perhaps a > > feature flag is enough and the readdir buf can be limited to > > fc->max_read. > > fc->max_read is set by default to ~0 and only set to something else when > the max_read mount option is given? So typically that is a large value > (UINT_MAX)? That's fine. It probably still makes sense to limit it to 128k, but with the ctx->count patch it would be naturally limited by the size of the userspace buffer. There's really no downside to enabling a large buffer, other than an unlikely regression in userspace. If server wants to return less entries, it still can. Unlike plain reads, filling the buffer to the fullest extent isn't required for readdir. So the buffer size calculation can be somthing like: init: #define FUSE_READDIR_MAX (128*1024) fc->max_readdir = PAGE_SIZE; if (flags & FUSE_LARGE_READDIR) fc->max_readdir = min(fc->max_read, FUSE_READDIR_MAX); [...] readdir: bufsize = min(fc->max_readdir, max(ctx->count, PAGE_SIZE)); Thanks, Miklos
diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig index 038ed0b9aaa5..0783f9ee5cd3 100644 --- a/fs/fuse/Kconfig +++ b/fs/fuse/Kconfig @@ -18,6 +18,22 @@ config FUSE_FS If you want to develop a userspace FS, or if you want to use a filesystem based on FUSE, answer Y or M. +config FUSE_READDIR_ORDER + int + range 0 5 + default 5 + help + readdir performance varies greatly depending on the size of the read. + Larger buffers results in larger reads, thus fewer reads and higher + performance in return. + + You may want to reduce this value on seriously constrained memory + systems where 128KiB (assuming 4KiB pages) cache pages is not ideal. + + This value reprents the order of the number of pages to allocate (ie, + the shift value). A value of 0 is thus 1 page (4KiB) where 5 is 32 + pages (128KiB). + config CUSE tristate "Character device in Userspace support" depends on FUSE_FS diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c index dc603479b30e..98c62b623240 100644 --- a/fs/fuse/readdir.c +++ b/fs/fuse/readdir.c @@ -13,6 +13,12 @@ #include <linux/pagemap.h> #include <linux/highmem.h> +#define READDIR_PAGES_ORDER CONFIG_FUSE_READDIR_ORDER +#define READDIR_PAGES (1 << READDIR_PAGES_ORDER) +#define READDIR_PAGES_SIZE (PAGE_SIZE << READDIR_PAGES_ORDER) +#define READDIR_PAGES_MASK (READDIR_PAGES_SIZE - 1) +#define READDIR_PAGES_SHIFT (PAGE_SHIFT + READDIR_PAGES_ORDER) + static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx) { struct fuse_conn *fc = get_fuse_conn(dir); @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file, } version = fi->rdc.version; size = fi->rdc.size; - offset = size & ~PAGE_MASK; - index = size >> PAGE_SHIFT; + offset = size & ~READDIR_PAGES_MASK; + index = size >> READDIR_PAGES_SHIFT; /* Dirent doesn't fit in current page? Jump to next page. */ - if (offset + reclen > PAGE_SIZE) { + if (offset + reclen > READDIR_PAGES_SIZE) { index++; offset = 0; } @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file, } memcpy(addr + offset, dirent, reclen); kunmap_local(addr); - fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen; + fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen; fi->rdc.pos = dirent->off; unlock: spin_unlock(&fi->rdc.lock); @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos) } fi->rdc.cached = true; - end = ALIGN(fi->rdc.size, PAGE_SIZE); + end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE); spin_unlock(&fi->rdc.lock); /* truncate unused tail of cache */ @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx) struct fuse_mount *fm = get_fuse_mount(inode); struct fuse_io_args ia = {}; struct fuse_args_pages *ap = &ia.ap; - struct fuse_page_desc desc = { .length = PAGE_SIZE }; + struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE }; u64 attr_version = 0; bool locked; - page = alloc_page(GFP_KERNEL); + page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER); if (!page) return -ENOMEM; plus = fuse_use_readdirplus(inode, ctx); ap->args.out_pages = true; - ap->num_pages = 1; + ap->num_pages = READDIR_PAGES; ap->pages = &page; ap->descs = &desc; if (plus) { attr_version = fuse_get_attr_version(fm->fc); - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE, + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE, FUSE_READDIRPLUS); } else { - fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE, + fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE, FUSE_READDIR); } locked = fuse_lock_inode(inode); @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff, void *addr, unsigned int size, struct dir_context *ctx) { - unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK; + unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK; enum fuse_parse_result res = FOUND_NONE; WARN_ON(offset >= size); @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx) WARN_ON(fi->rdc.size < ff->readdir.cache_off); - index = ff->readdir.cache_off >> PAGE_SHIFT; + index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT; - if (index == (fi->rdc.size >> PAGE_SHIFT)) - size = fi->rdc.size & ~PAGE_MASK; + if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT)) + size = fi->rdc.size & ~READDIR_PAGES_MASK; else - size = PAGE_SIZE; + size = READDIR_PAGES_SIZE; spin_unlock(&fi->rdc.lock); /* EOF? */ - if ((ff->readdir.cache_off & ~PAGE_MASK) == size) + if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size) return 0; page = find_get_page_flags(file->f_mapping, index, @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx) if (res == FOUND_ALL) return 0; - if (size == PAGE_SIZE) { + if (size == READDIR_PAGES_SIZE) { /* We hit end of page: skip to next page. */ - ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE); + ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE); goto retry; }
Signed-off-by: Jaco Kroon <jaco@uls.co.za> --- fs/fuse/Kconfig | 16 ++++++++++++++++ fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------ 2 files changed, 40 insertions(+), 18 deletions(-)