diff mbox series

fuse: enable larger read buffers for readdir.

Message ID 20230726105953.843-1-jaco@uls.co.za (mailing list archive)
State New, archived
Headers show
Series fuse: enable larger read buffers for readdir. | expand

Commit Message

Jaco Kroon July 26, 2023, 10:59 a.m. UTC
Signed-off-by: Jaco Kroon <jaco@uls.co.za>
---
 fs/fuse/Kconfig   | 16 ++++++++++++++++
 fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
 2 files changed, 40 insertions(+), 18 deletions(-)

Comments

Jaco Kroon July 26, 2023, 11:43 a.m. UTC | #1
Hi,

Just to give some context, we've got maildir folders running on top of 
glusterfs.  Without this a "medium" sized folder can take several 
minutes to go through readdir (getdents64) and each system call will 
return 14-18 entries at a time.  These calls (as measures using strace 
-T -p) takes anywhere from around 150 micro-seconds to several 
milli-seconds.  Inter-server round-trip time (as measured with ping) is 
generally 100-120 micro-seconds so 150 µs is probably a good lower limit.

I've tested the below patch from a more remote location (7-8ms ping 
round-trip) and in spite of this massively increased latency a readdir 
iteration over a small folder (a few hundred entries) was still on par 
with the local case, usually even marginally better (10%), where on 
larger folders the difference became much more pronounced, with 
performance for the remote location at times being drastically better 
than for the "close" location.

This has a few benefits overall that I see:

1.  This enables glusterfs to internally read and process larger 
chunks.  Whether this is happening I don't know (I have a separate 
ongoing discussion with the developers on that side to see what can be 
done inside of glusterfs itself to improve the mechanisms on the other 
side of fuse).

2.  Fewer overall system calls (by an order of up to 32 fewer 
getdents64()) calls for userspace to get a full directory listing, in 
cases where there's 100k+ files in a folder this makes a HUGE difference 
(14-18 entries vs ~500 entries per call, so >5000 calls vs 200).

3.  Fewer fuse messages being passed over the fuse link.

One concerns I have is that I don't understand all of the caching and 
stuff going on, and I'm typically as per strace seeing less than 64KB of 
data being returned to userspace, so I'm not sure there is a direct 
correlation between the FUSE readdir size and that from glibc to kernel, 
and I'm slightly worried that the caching may now be broken due to 
smaller than READDIR_PAGES_SIZE records being cached, with that said, 
successive readdir() iterations from userspace provided identical 
results so I *think* this should be OK.  Someone with a better 
understanding should please confirm.

I made the option configurable (but max it out by default) in case 
memory constrained systems may need to drop the 128KiB value.

I suspect the discrepancy may also relate to the way in which glusterfs 
merges directory listings from the underlying bricks where data is 
actually stored.

Without this patch the remote system is orders of magnitude slower that 
the close together systems.

Kind regards,
Jaco


On 2023/07/26 12:59, Jaco Kroon wrote:
> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
> ---
>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>   2 files changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 038ed0b9aaa5..0783f9ee5cd3 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -18,6 +18,22 @@ config FUSE_FS
>   	  If you want to develop a userspace FS, or if you want to use
>   	  a filesystem based on FUSE, answer Y or M.
>   
> +config FUSE_READDIR_ORDER
> +	int
> +	range 0 5
> +	default 5
> +	help
> +		readdir performance varies greatly depending on the size of the read.
> +		Larger buffers results in larger reads, thus fewer reads and higher
> +		performance in return.
> +
> +		You may want to reduce this value on seriously constrained memory
> +		systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> +		This value reprents the order of the number of pages to allocate (ie,
> +		the shift value).  A value of 0 is thus 1 page (4KiB) where 5 is 32
> +		pages (128KiB).
> +
>   config CUSE
>   	tristate "Character device in Userspace support"
>   	depends on FUSE_FS
> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> index dc603479b30e..98c62b623240 100644
> --- a/fs/fuse/readdir.c
> +++ b/fs/fuse/readdir.c
> @@ -13,6 +13,12 @@
>   #include <linux/pagemap.h>
>   #include <linux/highmem.h>
>   
> +#define READDIR_PAGES_ORDER		CONFIG_FUSE_READDIR_ORDER
> +#define READDIR_PAGES			(1 << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_SIZE		(PAGE_SIZE << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_MASK		(READDIR_PAGES_SIZE - 1)
> +#define READDIR_PAGES_SHIFT		(PAGE_SHIFT + READDIR_PAGES_ORDER)
> +
>   static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
>   {
>   	struct fuse_conn *fc = get_fuse_conn(dir);
> @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file,
>   	}
>   	version = fi->rdc.version;
>   	size = fi->rdc.size;
> -	offset = size & ~PAGE_MASK;
> -	index = size >> PAGE_SHIFT;
> +	offset = size & ~READDIR_PAGES_MASK;
> +	index = size >> READDIR_PAGES_SHIFT;
>   	/* Dirent doesn't fit in current page?  Jump to next page. */
> -	if (offset + reclen > PAGE_SIZE) {
> +	if (offset + reclen > READDIR_PAGES_SIZE) {
>   		index++;
>   		offset = 0;
>   	}
> @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file,
>   	}
>   	memcpy(addr + offset, dirent, reclen);
>   	kunmap_local(addr);
> -	fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen;
> +	fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen;
>   	fi->rdc.pos = dirent->off;
>   unlock:
>   	spin_unlock(&fi->rdc.lock);
> @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos)
>   	}
>   
>   	fi->rdc.cached = true;
> -	end = ALIGN(fi->rdc.size, PAGE_SIZE);
> +	end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE);
>   	spin_unlock(&fi->rdc.lock);
>   
>   	/* truncate unused tail of cache */
> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>   	struct fuse_mount *fm = get_fuse_mount(inode);
>   	struct fuse_io_args ia = {};
>   	struct fuse_args_pages *ap = &ia.ap;
> -	struct fuse_page_desc desc = { .length = PAGE_SIZE };
> +	struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
>   	u64 attr_version = 0;
>   	bool locked;
>   
> -	page = alloc_page(GFP_KERNEL);
> +	page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>   	if (!page)
>   		return -ENOMEM;
>   
>   	plus = fuse_use_readdirplus(inode, ctx);
>   	ap->args.out_pages = true;
> -	ap->num_pages = 1;
> +	ap->num_pages = READDIR_PAGES;
>   	ap->pages = &page;
>   	ap->descs = &desc;
>   	if (plus) {
>   		attr_version = fuse_get_attr_version(fm->fc);
> -		fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> +		fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
>   				    FUSE_READDIRPLUS);
>   	} else {
> -		fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> +		fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
>   				    FUSE_READDIR);
>   	}
>   	locked = fuse_lock_inode(inode);
> @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
>   					       void *addr, unsigned int size,
>   					       struct dir_context *ctx)
>   {
> -	unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
> +	unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK;
>   	enum fuse_parse_result res = FOUND_NONE;
>   
>   	WARN_ON(offset >= size);
> @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
>   
>   	WARN_ON(fi->rdc.size < ff->readdir.cache_off);
>   
> -	index = ff->readdir.cache_off >> PAGE_SHIFT;
> +	index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT;
>   
> -	if (index == (fi->rdc.size >> PAGE_SHIFT))
> -		size = fi->rdc.size & ~PAGE_MASK;
> +	if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT))
> +		size = fi->rdc.size & ~READDIR_PAGES_MASK;
>   	else
> -		size = PAGE_SIZE;
> +		size = READDIR_PAGES_SIZE;
>   	spin_unlock(&fi->rdc.lock);
>   
>   	/* EOF? */
> -	if ((ff->readdir.cache_off & ~PAGE_MASK) == size)
> +	if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size)
>   		return 0;
>   
>   	page = find_get_page_flags(file->f_mapping, index,
> @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
>   	if (res == FOUND_ALL)
>   		return 0;
>   
> -	if (size == PAGE_SIZE) {
> +	if (size == READDIR_PAGES_SIZE) {
>   		/* We hit end of page: skip to next page. */
> -		ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE);
> +		ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE);
>   		goto retry;
>   	}
>
Bernd Schubert July 26, 2023, 1:53 p.m. UTC | #2
On 7/26/23 12:59, Jaco Kroon wrote:
> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
> ---
>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>   2 files changed, 40 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
> index 038ed0b9aaa5..0783f9ee5cd3 100644
> --- a/fs/fuse/Kconfig
> +++ b/fs/fuse/Kconfig
> @@ -18,6 +18,22 @@ config FUSE_FS
>   	  If you want to develop a userspace FS, or if you want to use
>   	  a filesystem based on FUSE, answer Y or M.
>   
> +config FUSE_READDIR_ORDER
> +	int
> +	range 0 5
> +	default 5
> +	help
> +		readdir performance varies greatly depending on the size of the read.
> +		Larger buffers results in larger reads, thus fewer reads and higher
> +		performance in return.
> +
> +		You may want to reduce this value on seriously constrained memory
> +		systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> +		This value reprents the order of the number of pages to allocate (ie,
> +		the shift value).  A value of 0 is thus 1 page (4KiB) where 5 is 32
> +		pages (128KiB).
> +

I like the idea of a larger readdir size, but shouldn't that be a 
server/daemon/library decision which size to use, instead of kernel 
compile time? So should be part of FUSE_INIT negotiation?

>   config CUSE
>   	tristate "Character device in Userspace support"
>   	depends on FUSE_FS
> diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
> index dc603479b30e..98c62b623240 100644
> --- a/fs/fuse/readdir.c
> +++ b/fs/fuse/readdir.c
> @@ -13,6 +13,12 @@
>   #include <linux/pagemap.h>
>   #include <linux/highmem.h>
>   
> +#define READDIR_PAGES_ORDER		CONFIG_FUSE_READDIR_ORDER
> +#define READDIR_PAGES			(1 << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_SIZE		(PAGE_SIZE << READDIR_PAGES_ORDER)
> +#define READDIR_PAGES_MASK		(READDIR_PAGES_SIZE - 1)
> +#define READDIR_PAGES_SHIFT		(PAGE_SHIFT + READDIR_PAGES_ORDER)
> +
>   static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
>   {
>   	struct fuse_conn *fc = get_fuse_conn(dir);
> @@ -52,10 +58,10 @@ static void fuse_add_dirent_to_cache(struct file *file,
>   	}
>   	version = fi->rdc.version;
>   	size = fi->rdc.size;
> -	offset = size & ~PAGE_MASK;
> -	index = size >> PAGE_SHIFT;
> +	offset = size & ~READDIR_PAGES_MASK;
> +	index = size >> READDIR_PAGES_SHIFT;
>   	/* Dirent doesn't fit in current page?  Jump to next page. */
> -	if (offset + reclen > PAGE_SIZE) {
> +	if (offset + reclen > READDIR_PAGES_SIZE) {
>   		index++;
>   		offset = 0;
>   	}
> @@ -83,7 +89,7 @@ static void fuse_add_dirent_to_cache(struct file *file,
>   	}
>   	memcpy(addr + offset, dirent, reclen);
>   	kunmap_local(addr);
> -	fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen;
> +	fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen;
>   	fi->rdc.pos = dirent->off;
>   unlock:
>   	spin_unlock(&fi->rdc.lock);
> @@ -104,7 +110,7 @@ static void fuse_readdir_cache_end(struct file *file, loff_t pos)
>   	}
>   
>   	fi->rdc.cached = true;
> -	end = ALIGN(fi->rdc.size, PAGE_SIZE);
> +	end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE);
>   	spin_unlock(&fi->rdc.lock);
>   
>   	/* truncate unused tail of cache */
> @@ -328,25 +334,25 @@ static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
>   	struct fuse_mount *fm = get_fuse_mount(inode);
>   	struct fuse_io_args ia = {};
>   	struct fuse_args_pages *ap = &ia.ap;
> -	struct fuse_page_desc desc = { .length = PAGE_SIZE };
> +	struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
>   	u64 attr_version = 0;
>   	bool locked;
>   
> -	page = alloc_page(GFP_KERNEL);
> +	page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);

I guess that should become folio alloc(), one way or the other. Now I 
think order 0 was chosen before to avoid risk of allocation failure. I 
guess it might work to try a large size and to fall back to 0 when that 
failed. Or fail back to the slower vmalloc.

>   	if (!page)
>   		return -ENOMEM;
>   
>   	plus = fuse_use_readdirplus(inode, ctx);
>   	ap->args.out_pages = true;
> -	ap->num_pages = 1;
> +	ap->num_pages = READDIR_PAGES;
>   	ap->pages = &page;
>   	ap->descs = &desc;
>   	if (plus) {
>   		attr_version = fuse_get_attr_version(fm->fc);
> -		fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> +		fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
>   				    FUSE_READDIRPLUS);
>   	} else {
> -		fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
> +		fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
>   				    FUSE_READDIR);
>   	}
>   	locked = fuse_lock_inode(inode);
> @@ -383,7 +389,7 @@ static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
>   					       void *addr, unsigned int size,
>   					       struct dir_context *ctx)
>   {
> -	unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
> +	unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK;
>   	enum fuse_parse_result res = FOUND_NONE;
>   
>   	WARN_ON(offset >= size);
> @@ -504,16 +510,16 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
>   
>   	WARN_ON(fi->rdc.size < ff->readdir.cache_off);
>   
> -	index = ff->readdir.cache_off >> PAGE_SHIFT;
> +	index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT;
>   
> -	if (index == (fi->rdc.size >> PAGE_SHIFT))
> -		size = fi->rdc.size & ~PAGE_MASK;
> +	if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT))
> +		size = fi->rdc.size & ~READDIR_PAGES_MASK;
>   	else
> -		size = PAGE_SIZE;
> +		size = READDIR_PAGES_SIZE;
>   	spin_unlock(&fi->rdc.lock);
>   
>   	/* EOF? */
> -	if ((ff->readdir.cache_off & ~PAGE_MASK) == size)
> +	if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size)
>   		return 0;
>   
>   	page = find_get_page_flags(file->f_mapping, index,
> @@ -559,9 +565,9 @@ static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
>   	if (res == FOUND_ALL)
>   		return 0;
>   
> -	if (size == PAGE_SIZE) {
> +	if (size == READDIR_PAGES_SIZE) {
>   		/* We hit end of page: skip to next page. */
> -		ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE);
> +		ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE);
>   		goto retry;
>   	}
>   

Thanks,
Bernd
Randy Dunlap July 26, 2023, 3:19 p.m. UTC | #3
On 7/26/23 03:59, Jaco Kroon wrote:
> +config FUSE_READDIR_ORDER
> +	int
> +	range 0 5
> +	default 5
> +	help
> +		readdir performance varies greatly depending on the size of the read.
> +		Larger buffers results in larger reads, thus fewer reads and higher
> +		performance in return.
> +
> +		You may want to reduce this value on seriously constrained memory
> +		systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
> +
> +		This value reprents the order of the number of pages to allocate (ie,

	                   represents                                            (i.e.,

> +		the shift value).  A value of 0 is thus 1 page (4KiB) where 5 is 32
> +		pages (128KiB).
Jaco Kroon July 26, 2023, 3:26 p.m. UTC | #4
Hi,

On 2023/07/26 15:53, Bernd Schubert wrote:
>
>
> On 7/26/23 12:59, Jaco Kroon wrote:
>> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
>> ---
>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>> --- a/fs/fuse/Kconfig
>> +++ b/fs/fuse/Kconfig
>> @@ -18,6 +18,22 @@ config FUSE_FS
>>         If you want to develop a userspace FS, or if you want to use
>>         a filesystem based on FUSE, answer Y or M.
>>   +config FUSE_READDIR_ORDER
>> +    int
>> +    range 0 5
>> +    default 5
>> +    help
>> +        readdir performance varies greatly depending on the size of 
>> the read.
>> +        Larger buffers results in larger reads, thus fewer reads and 
>> higher
>> +        performance in return.
>> +
>> +        You may want to reduce this value on seriously constrained 
>> memory
>> +        systems where 128KiB (assuming 4KiB pages) cache pages is 
>> not ideal.
>> +
>> +        This value reprents the order of the number of pages to 
>> allocate (ie,
>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where 
>> 5 is 32
>> +        pages (128KiB).
>> +
>
> I like the idea of a larger readdir size, but shouldn't that be a 
> server/daemon/library decision which size to use, instead of kernel 
> compile time? So should be part of FUSE_INIT negotiation?

Yes sure, but there still needs to be a default.  And one page at a time 
doesn't cut it.

-- snip --

>>   -    page = alloc_page(GFP_KERNEL);
>> +    page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>
> I guess that should become folio alloc(), one way or the other. Now I 
> think order 0 was chosen before to avoid risk of allocation failure. I 
> guess it might work to try a large size and to fall back to 0 when 
> that failed. Or fail back to the slower vmalloc.

If this varies then a bunch of other code will become somewhat more 
complex, especially if one alloc succeeds, and then a follow-up succeeds.

I'm not familiar with the differences between the different mechanisms 
available for allocation.

-- snip --

> Thanks,
My pleasure,
Jaco
Bernd Schubert July 26, 2023, 3:30 p.m. UTC | #5
On 7/26/23 17:26, Jaco Kroon wrote:
> Hi,
> 
> On 2023/07/26 15:53, Bernd Schubert wrote:
>>
>>
>> On 7/26/23 12:59, Jaco Kroon wrote:
>>> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
>>> ---
>>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>> --- a/fs/fuse/Kconfig
>>> +++ b/fs/fuse/Kconfig
>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>         If you want to develop a userspace FS, or if you want to use
>>>         a filesystem based on FUSE, answer Y or M.
>>>   +config FUSE_READDIR_ORDER
>>> +    int
>>> +    range 0 5
>>> +    default 5
>>> +    help
>>> +        readdir performance varies greatly depending on the size of 
>>> the read.
>>> +        Larger buffers results in larger reads, thus fewer reads and 
>>> higher
>>> +        performance in return.
>>> +
>>> +        You may want to reduce this value on seriously constrained 
>>> memory
>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is 
>>> not ideal.
>>> +
>>> +        This value reprents the order of the number of pages to 
>>> allocate (ie,
>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where 
>>> 5 is 32
>>> +        pages (128KiB).
>>> +
>>
>> I like the idea of a larger readdir size, but shouldn't that be a 
>> server/daemon/library decision which size to use, instead of kernel 
>> compile time? So should be part of FUSE_INIT negotiation?
> 
> Yes sure, but there still needs to be a default.  And one page at a time 
> doesn't cut it.
> 
> -- snip --
> 
>>>   -    page = alloc_page(GFP_KERNEL);
>>> +    page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
>>
>> I guess that should become folio alloc(), one way or the other. Now I 
>> think order 0 was chosen before to avoid risk of allocation failure. I 
>> guess it might work to try a large size and to fall back to 0 when 
>> that failed. Or fail back to the slower vmalloc.
> 
> If this varies then a bunch of other code will become somewhat more 
> complex, especially if one alloc succeeds, and then a follow-up succeeds.

Yeah, the better choice is kvmalloc/kvfree which handles it internally.

> 
> I'm not familiar with the differences between the different mechanisms 
> available for allocation.
> 
> -- snip --
> 
>> Thanks,
> My pleasure,
> Jaco
Bernd Schubert July 26, 2023, 3:45 p.m. UTC | #6
On 7/26/23 17:26, Jaco Kroon wrote:
> Hi,
> 
> On 2023/07/26 15:53, Bernd Schubert wrote:
>>
>>
>> On 7/26/23 12:59, Jaco Kroon wrote:
>>> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
>>> ---
>>>   fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>   fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>   2 files changed, 40 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>> --- a/fs/fuse/Kconfig
>>> +++ b/fs/fuse/Kconfig
>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>         If you want to develop a userspace FS, or if you want to use
>>>         a filesystem based on FUSE, answer Y or M.
>>>   +config FUSE_READDIR_ORDER
>>> +    int
>>> +    range 0 5
>>> +    default 5
>>> +    help
>>> +        readdir performance varies greatly depending on the size of 
>>> the read.
>>> +        Larger buffers results in larger reads, thus fewer reads and 
>>> higher
>>> +        performance in return.
>>> +
>>> +        You may want to reduce this value on seriously constrained 
>>> memory
>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is 
>>> not ideal.
>>> +
>>> +        This value reprents the order of the number of pages to 
>>> allocate (ie,
>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where 
>>> 5 is 32
>>> +        pages (128KiB).
>>> +
>>
>> I like the idea of a larger readdir size, but shouldn't that be a 
>> server/daemon/library decision which size to use, instead of kernel 
>> compile time? So should be part of FUSE_INIT negotiation?
> 
> Yes sure, but there still needs to be a default.  And one page at a time 
> doesn't cut it.

With FUSE_INIT userspace would make that decision, based on what kernel 
fuse suggests? process_init_reply() already handles other limits - I 
don't see why readdir max has to be compile time option. Maybe a module 
option to set the limit?

Thanks,
Bernd
Antonio SJ Musumeci July 26, 2023, 5:23 p.m. UTC | #7
On 7/26/23 10:45, Bernd Schubert wrote:
>
> On 7/26/23 17:26, Jaco Kroon wrote:
>> Hi,
>>
>> On 2023/07/26 15:53, Bernd Schubert wrote:
>>>
>>> On 7/26/23 12:59, Jaco Kroon wrote:
>>>> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
>>>> ---
>>>>    fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>>    fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>>    2 files changed, 40 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>>> --- a/fs/fuse/Kconfig
>>>> +++ b/fs/fuse/Kconfig
>>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>>          If you want to develop a userspace FS, or if you want to use
>>>>          a filesystem based on FUSE, answer Y or M.
>>>>    +config FUSE_READDIR_ORDER
>>>> +    int
>>>> +    range 0 5
>>>> +    default 5
>>>> +    help
>>>> +        readdir performance varies greatly depending on the size of
>>>> the read.
>>>> +        Larger buffers results in larger reads, thus fewer reads and
>>>> higher
>>>> +        performance in return.
>>>> +
>>>> +        You may want to reduce this value on seriously constrained
>>>> memory
>>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is
>>>> not ideal.
>>>> +
>>>> +        This value reprents the order of the number of pages to
>>>> allocate (ie,
>>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where
>>>> 5 is 32
>>>> +        pages (128KiB).
>>>> +
>>> I like the idea of a larger readdir size, but shouldn't that be a
>>> server/daemon/library decision which size to use, instead of kernel
>>> compile time? So should be part of FUSE_INIT negotiation?
>> Yes sure, but there still needs to be a default.  And one page at a time
>> doesn't cut it.
> With FUSE_INIT userspace would make that decision, based on what kernel
> fuse suggests? process_init_reply() already handles other limits - I
> don't see why readdir max has to be compile time option. Maybe a module
> option to set the limit?
>
> Thanks,
> Bernd

I had similar question / comment. This seems to me to be more 
appropriately handed by the server via FUSE_INIT.

And wouldn't "max" more easily be FUSE_MAX_MAX_PAGES? Is there a reason 
not to allow upwards of 256 pages sized readdir buffer?
Jaco Kroon July 26, 2023, 6:25 p.m. UTC | #8
Hi,

On 2023/07/26 19:23, Antonio SJ Musumeci wrote:
> On 7/26/23 10:45, Bernd Schubert wrote:
>> On 7/26/23 17:26, Jaco Kroon wrote:
>>> Hi,
>>>
>>> On 2023/07/26 15:53, Bernd Schubert wrote:
>>>> On 7/26/23 12:59, Jaco Kroon wrote:
>>>>> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
>>>>> ---
>>>>>     fs/fuse/Kconfig   | 16 ++++++++++++++++
>>>>>     fs/fuse/readdir.c | 42 ++++++++++++++++++++++++------------------
>>>>>     2 files changed, 40 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
>>>>> index 038ed0b9aaa5..0783f9ee5cd3 100644
>>>>> --- a/fs/fuse/Kconfig
>>>>> +++ b/fs/fuse/Kconfig
>>>>> @@ -18,6 +18,22 @@ config FUSE_FS
>>>>>           If you want to develop a userspace FS, or if you want to use
>>>>>           a filesystem based on FUSE, answer Y or M.
>>>>>     +config FUSE_READDIR_ORDER
>>>>> +    int
>>>>> +    range 0 5
>>>>> +    default 5
>>>>> +    help
>>>>> +        readdir performance varies greatly depending on the size of
>>>>> the read.
>>>>> +        Larger buffers results in larger reads, thus fewer reads and
>>>>> higher
>>>>> +        performance in return.
>>>>> +
>>>>> +        You may want to reduce this value on seriously constrained
>>>>> memory
>>>>> +        systems where 128KiB (assuming 4KiB pages) cache pages is
>>>>> not ideal.
>>>>> +
>>>>> +        This value reprents the order of the number of pages to
>>>>> allocate (ie,
>>>>> +        the shift value).  A value of 0 is thus 1 page (4KiB) where
>>>>> 5 is 32
>>>>> +        pages (128KiB).
>>>>> +
>>>> I like the idea of a larger readdir size, but shouldn't that be a
>>>> server/daemon/library decision which size to use, instead of kernel
>>>> compile time? So should be part of FUSE_INIT negotiation?
>>> Yes sure, but there still needs to be a default.  And one page at a time
>>> doesn't cut it.
>> With FUSE_INIT userspace would make that decision, based on what kernel
>> fuse suggests? process_init_reply() already handles other limits - I
>> don't see why readdir max has to be compile time option. Maybe a module
>> option to set the limit?
>>
>> Thanks,
>> Bernd
> I had similar question / comment. This seems to me to be more
> appropriately handed by the server via FUSE_INIT.
>
> And wouldn't "max" more easily be FUSE_MAX_MAX_PAGES? Is there a reason
> not to allow upwards of 256 pages sized readdir buffer?

Will look into FUSE_INIT.  The FUSE_INIT as I understand from what I've 
read has some expansion constraints or the structure is somehow 
negotiated.  Older clients in other words that's not aware of the option 
will follow some default.  For backwards compatibility that default 
should probably be 1 page.  For performance reasons it makes sense that 
this limit be larger.

glibc uses a 128KiB buffer for getdents64, so I'm not sure >128KiB here 
makes sense.  Or if these two buffers are even directly related.

Default to fc->max_pages (which defaults to 32 or 128KiB) if the 
user-space side doesn't understand the max_readdir_pages limit aspect of 
things?  Assuming these limits should be set separately.  I'm thinking 
piggy backing on fc->max_pages is just fine to be honest.

For the sake of avoiding division and modulo operations in the cache, 
I'm thinking round-down max_pages to the closest power of two for the 
sake of sticking to binary operators rather than divisions and mods?

Current patch introduces a definite memory leak either way.  Tore 
through about 12GB of RAM in a matter of 20 minutes or so.  Just going 
to patch it that way first, and then based on responses above will look 
into an alternative patch that does not depend on a compile-time 
option.  Guessing __free_page should be a multi-page variant now.

Thanks for all the feedback so far.

Kind regards,
Jaco
Miklos Szeredi July 27, 2023, 3:21 p.m. UTC | #9
On Wed, Jul 26, 2023 at 08:25:48PM +0200, Jaco Kroon wrote:

> glibc uses a 128KiB buffer for getdents64, so I'm not sure >128KiB here
> makes sense.  Or if these two buffers are even directly related.

Definitely related.  See how the dir_emit() works: if the entry doesn't fit in
the userspace buffer, then ->readdir() returns.  If entries remain in the fuse
buffer at that time, they are simply discarded.  Simply put, making the
buffer too large is going to be a waste of resources and at some point will be
slower than the single page buffer.

Note: discarding received entries is the only possible action if caching is
disabled.  With caching enabled it would make sense to pre-fill the cache with
the overflowed entries and use them in the next iteration.  However the caching
code is not prepared for using partial caches at the moment.

The best strategy would be to find the optimal buffer size based on the size of
the userspace buffer.  Putting that info into struct dir_context doesn't sound
too complicated...

Here's a patch.  It doesn't touch readdir.  Simply setting the fuse buffer size
to the userspace buffer size should work, the record sizes are similar (fuse's
is slightly larger than libc's, so no overflow should ever happen).

Thanks,
Miklos


---
 fs/exportfs/expfs.c    |    1 +
 fs/overlayfs/readdir.c |   12 +++++++++++-
 fs/readdir.c           |   29 ++++++++++++++---------------
 include/linux/fs.h     |    7 +++++++
 4 files changed, 33 insertions(+), 16 deletions(-)

--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -286,6 +286,7 @@ static int get_name(const struct path *p
 	};
 	struct getdents_callback buffer = {
 		.ctx.actor = filldir_one,
+		.ctx.count = INT_MAX,
 		.name = name,
 	};
 
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -347,6 +347,7 @@ static int ovl_dir_read_merged(struct de
 	struct path realpath;
 	struct ovl_readdir_data rdd = {
 		.ctx.actor = ovl_fill_merge,
+		.ctx.count = INT_MAX,
 		.dentry = dentry,
 		.list = list,
 		.root = root,
@@ -557,6 +558,7 @@ static int ovl_dir_read_impure(const str
 	struct ovl_cache_entry *p, *n;
 	struct ovl_readdir_data rdd = {
 		.ctx.actor = ovl_fill_plain,
+		.ctx.count = INT_MAX,
 		.list = list,
 		.root = root,
 	};
@@ -658,6 +660,7 @@ static bool ovl_fill_real(struct dir_con
 	struct ovl_readdir_translate *rdt =
 		container_of(ctx, struct ovl_readdir_translate, ctx);
 	struct dir_context *orig_ctx = rdt->orig_ctx;
+	bool res;
 
 	if (rdt->parent_ino && strcmp(name, "..") == 0) {
 		ino = rdt->parent_ino;
@@ -672,7 +675,10 @@ static bool ovl_fill_real(struct dir_con
 					  name, namelen, rdt->xinowarn);
 	}
 
-	return orig_ctx->actor(orig_ctx, name, namelen, offset, ino, d_type);
+	res = orig_ctx->actor(orig_ctx, name, namelen, offset, ino, d_type);
+	ctx->count = orig_ctx->count;
+
+	return res;
 }
 
 static bool ovl_is_impure_dir(struct file *file)
@@ -699,6 +705,7 @@ static int ovl_iterate_real(struct file
 	const struct ovl_layer *lower_layer = ovl_layer_lower(dir);
 	struct ovl_readdir_translate rdt = {
 		.ctx.actor = ovl_fill_real,
+		.ctx.count = ctx->count,
 		.orig_ctx = ctx,
 		.xinobits = ovl_xino_bits(ofs),
 		.xinowarn = ovl_xino_warn(ofs),
@@ -1058,6 +1065,7 @@ int ovl_check_d_type_supported(const str
 	int err;
 	struct ovl_readdir_data rdd = {
 		.ctx.actor = ovl_check_d_type,
+		.ctx.count = INT_MAX,
 		.d_type_supported = false,
 	};
 
@@ -1079,6 +1087,7 @@ static int ovl_workdir_cleanup_recurse(s
 	struct ovl_cache_entry *p;
 	struct ovl_readdir_data rdd = {
 		.ctx.actor = ovl_fill_plain,
+		.ctx.count = INT_MAX,
 		.list = &list,
 	};
 	bool incompat = false;
@@ -1163,6 +1172,7 @@ int ovl_indexdir_cleanup(struct ovl_fs *
 	struct ovl_cache_entry *p;
 	struct ovl_readdir_data rdd = {
 		.ctx.actor = ovl_fill_plain,
+		.ctx.count = INT_MAX,
 		.list = &list,
 	};
 
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -184,6 +184,7 @@ SYSCALL_DEFINE3(old_readdir, unsigned in
 	struct fd f = fdget_pos(fd);
 	struct readdir_callback buf = {
 		.ctx.actor = fillonedir,
+		.ctx.count = 1, /* Hint to fs: just one entry. */
 		.dirent = dirent
 	};
 
@@ -215,7 +216,6 @@ struct getdents_callback {
 	struct dir_context ctx;
 	struct linux_dirent __user * current_dir;
 	int prev_reclen;
-	int count;
 	int error;
 };
 
@@ -234,7 +234,7 @@ static bool filldir(struct dir_context *
 	if (unlikely(buf->error))
 		return false;
 	buf->error = -EINVAL;	/* only used if we fail.. */
-	if (reclen > buf->count)
+	if (reclen > ctx->count)
 		return false;
 	d_ino = ino;
 	if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
@@ -259,7 +259,7 @@ static bool filldir(struct dir_context *
 
 	buf->current_dir = (void __user *)dirent + reclen;
 	buf->prev_reclen = reclen;
-	buf->count -= reclen;
+	ctx->count -= reclen;
 	return true;
 efault_end:
 	user_write_access_end();
@@ -274,7 +274,7 @@ SYSCALL_DEFINE3(getdents, unsigned int,
 	struct fd f;
 	struct getdents_callback buf = {
 		.ctx.actor = filldir,
-		.count = count,
+		.ctx.count = count,
 		.current_dir = dirent
 	};
 	int error;
@@ -293,7 +293,7 @@ SYSCALL_DEFINE3(getdents, unsigned int,
 		if (put_user(buf.ctx.pos, &lastdirent->d_off))
 			error = -EFAULT;
 		else
-			error = count - buf.count;
+			error = count - buf.ctx.count;
 	}
 	fdput_pos(f);
 	return error;
@@ -303,7 +303,6 @@ struct getdents_callback64 {
 	struct dir_context ctx;
 	struct linux_dirent64 __user * current_dir;
 	int prev_reclen;
-	int count;
 	int error;
 };
 
@@ -321,7 +320,7 @@ static bool filldir64(struct dir_context
 	if (unlikely(buf->error))
 		return false;
 	buf->error = -EINVAL;	/* only used if we fail.. */
-	if (reclen > buf->count)
+	if (reclen > ctx->count)
 		return false;
 	prev_reclen = buf->prev_reclen;
 	if (prev_reclen && signal_pending(current))
@@ -341,7 +340,7 @@ static bool filldir64(struct dir_context
 
 	buf->prev_reclen = reclen;
 	buf->current_dir = (void __user *)dirent + reclen;
-	buf->count -= reclen;
+	ctx->count -= reclen;
 	return true;
 
 efault_end:
@@ -357,7 +356,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int
 	struct fd f;
 	struct getdents_callback64 buf = {
 		.ctx.actor = filldir64,
-		.count = count,
+		.ctx.count = count,
 		.current_dir = dirent
 	};
 	int error;
@@ -377,7 +376,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int
 		if (put_user(d_off, &lastdirent->d_off))
 			error = -EFAULT;
 		else
-			error = count - buf.count;
+			error = count - buf.ctx.count;
 	}
 	fdput_pos(f);
 	return error;
@@ -442,6 +441,7 @@ COMPAT_SYSCALL_DEFINE3(old_readdir, unsi
 	struct fd f = fdget_pos(fd);
 	struct compat_readdir_callback buf = {
 		.ctx.actor = compat_fillonedir,
+		.ctx.count = 1, /* Hint to fs: just one entry. */
 		.dirent = dirent
 	};
 
@@ -467,7 +467,6 @@ struct compat_getdents_callback {
 	struct dir_context ctx;
 	struct compat_linux_dirent __user *current_dir;
 	int prev_reclen;
-	int count;
 	int error;
 };
 
@@ -486,7 +485,7 @@ static bool compat_filldir(struct dir_co
 	if (unlikely(buf->error))
 		return false;
 	buf->error = -EINVAL;	/* only used if we fail.. */
-	if (reclen > buf->count)
+	if (reclen > ctx->count)
 		return false;
 	d_ino = ino;
 	if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
@@ -510,7 +509,7 @@ static bool compat_filldir(struct dir_co
 
 	buf->prev_reclen = reclen;
 	buf->current_dir = (void __user *)dirent + reclen;
-	buf->count -= reclen;
+	ctx->count -= reclen;
 	return true;
 efault_end:
 	user_write_access_end();
@@ -525,8 +524,8 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigne
 	struct fd f;
 	struct compat_getdents_callback buf = {
 		.ctx.actor = compat_filldir,
+		.ctx.count = count,
 		.current_dir = dirent,
-		.count = count
 	};
 	int error;
 
@@ -544,7 +543,7 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigne
 		if (put_user(buf.ctx.pos, &lastdirent->d_off))
 			error = -EFAULT;
 		else
-			error = count - buf.count;
+			error = count - buf.ctx.count;
 	}
 	fdput_pos(f);
 	return error;
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1719,6 +1719,13 @@ typedef bool (*filldir_t)(struct dir_con
 struct dir_context {
 	filldir_t actor;
 	loff_t pos;
+	/*
+	 * Filesystems MUST NOT MODIFY count, but may use as a hint:
+	 * 0	    unknown
+	 * > 0      space in buffer (assume at least one entry)
+	 * INT_MAX  unlimited
+	 */
+	int count;
 };
 
 /*
Miklos Szeredi July 27, 2023, 7:21 p.m. UTC | #10
On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <jaco@uls.co.za> wrote:

> Will look into FUSE_INIT.  The FUSE_INIT as I understand from what I've
> read has some expansion constraints or the structure is somehow
> negotiated.  Older clients in other words that's not aware of the option
> will follow some default.  For backwards compatibility that default
> should probably be 1 page.  For performance reasons it makes sense that
> this limit be larger.

Yes, might need this for backward compatibility.  But perhaps a
feature flag is enough and the readdir buf can be limited to
fc->max_read.

Thanks,
Miklos
Bernd Schubert July 27, 2023, 7:43 p.m. UTC | #11
On 7/27/23 21:21, Miklos Szeredi wrote:
> On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <jaco@uls.co.za> wrote:
> 
>> Will look into FUSE_INIT.  The FUSE_INIT as I understand from what I've
>> read has some expansion constraints or the structure is somehow
>> negotiated.  Older clients in other words that's not aware of the option
>> will follow some default.  For backwards compatibility that default
>> should probably be 1 page.  For performance reasons it makes sense that
>> this limit be larger.
> 
> Yes, might need this for backward compatibility.  But perhaps a
> feature flag is enough and the readdir buf can be limited to
> fc->max_read.

fc->max_read is set by default to ~0 and only set to something else when 
the max_read mount option is given? So typically that is a large value 
(UINT_MAX)?


Thanks,
Bernd
Miklos Szeredi July 28, 2023, 8:42 a.m. UTC | #12
On Thu, 27 Jul 2023 at 21:43, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>
>
>
> On 7/27/23 21:21, Miklos Szeredi wrote:
> > On Wed, 26 Jul 2023 at 20:40, Jaco Kroon <jaco@uls.co.za> wrote:
> >
> >> Will look into FUSE_INIT.  The FUSE_INIT as I understand from what I've
> >> read has some expansion constraints or the structure is somehow
> >> negotiated.  Older clients in other words that's not aware of the option
> >> will follow some default.  For backwards compatibility that default
> >> should probably be 1 page.  For performance reasons it makes sense that
> >> this limit be larger.
> >
> > Yes, might need this for backward compatibility.  But perhaps a
> > feature flag is enough and the readdir buf can be limited to
> > fc->max_read.
>
> fc->max_read is set by default to ~0 and only set to something else when
> the max_read mount option is given? So typically that is a large value
> (UINT_MAX)?

That's fine.  It probably still makes sense to limit it to 128k, but
with the ctx->count patch it would be naturally limited by the size of
the userspace buffer.  There's really no downside to enabling a large
buffer, other than an unlikely regression in userspace.    If server
wants to return less entries, it still can.  Unlike plain reads,
filling the buffer to the fullest extent isn't required for readdir.

So the buffer size calculation can be somthing like:

init:
#define FUSE_READDIR_MAX (128*1024)
fc->max_readdir = PAGE_SIZE;
if (flags & FUSE_LARGE_READDIR)
   fc->max_readdir = min(fc->max_read, FUSE_READDIR_MAX);

[...]
readdir:
   bufsize = min(fc->max_readdir, max(ctx->count, PAGE_SIZE));

Thanks,
Miklos
diff mbox series

Patch

diff --git a/fs/fuse/Kconfig b/fs/fuse/Kconfig
index 038ed0b9aaa5..0783f9ee5cd3 100644
--- a/fs/fuse/Kconfig
+++ b/fs/fuse/Kconfig
@@ -18,6 +18,22 @@  config FUSE_FS
 	  If you want to develop a userspace FS, or if you want to use
 	  a filesystem based on FUSE, answer Y or M.
 
+config FUSE_READDIR_ORDER
+	int
+	range 0 5
+	default 5
+	help
+		readdir performance varies greatly depending on the size of the read.
+		Larger buffers results in larger reads, thus fewer reads and higher
+		performance in return.
+
+		You may want to reduce this value on seriously constrained memory
+		systems where 128KiB (assuming 4KiB pages) cache pages is not ideal.
+
+		This value reprents the order of the number of pages to allocate (ie,
+		the shift value).  A value of 0 is thus 1 page (4KiB) where 5 is 32
+		pages (128KiB).
+
 config CUSE
 	tristate "Character device in Userspace support"
 	depends on FUSE_FS
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index dc603479b30e..98c62b623240 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -13,6 +13,12 @@ 
 #include <linux/pagemap.h>
 #include <linux/highmem.h>
 
+#define READDIR_PAGES_ORDER		CONFIG_FUSE_READDIR_ORDER
+#define READDIR_PAGES			(1 << READDIR_PAGES_ORDER)
+#define READDIR_PAGES_SIZE		(PAGE_SIZE << READDIR_PAGES_ORDER)
+#define READDIR_PAGES_MASK		(READDIR_PAGES_SIZE - 1)
+#define READDIR_PAGES_SHIFT		(PAGE_SHIFT + READDIR_PAGES_ORDER)
+
 static bool fuse_use_readdirplus(struct inode *dir, struct dir_context *ctx)
 {
 	struct fuse_conn *fc = get_fuse_conn(dir);
@@ -52,10 +58,10 @@  static void fuse_add_dirent_to_cache(struct file *file,
 	}
 	version = fi->rdc.version;
 	size = fi->rdc.size;
-	offset = size & ~PAGE_MASK;
-	index = size >> PAGE_SHIFT;
+	offset = size & ~READDIR_PAGES_MASK;
+	index = size >> READDIR_PAGES_SHIFT;
 	/* Dirent doesn't fit in current page?  Jump to next page. */
-	if (offset + reclen > PAGE_SIZE) {
+	if (offset + reclen > READDIR_PAGES_SIZE) {
 		index++;
 		offset = 0;
 	}
@@ -83,7 +89,7 @@  static void fuse_add_dirent_to_cache(struct file *file,
 	}
 	memcpy(addr + offset, dirent, reclen);
 	kunmap_local(addr);
-	fi->rdc.size = (index << PAGE_SHIFT) + offset + reclen;
+	fi->rdc.size = (index << READDIR_PAGES_SHIFT) + offset + reclen;
 	fi->rdc.pos = dirent->off;
 unlock:
 	spin_unlock(&fi->rdc.lock);
@@ -104,7 +110,7 @@  static void fuse_readdir_cache_end(struct file *file, loff_t pos)
 	}
 
 	fi->rdc.cached = true;
-	end = ALIGN(fi->rdc.size, PAGE_SIZE);
+	end = ALIGN(fi->rdc.size, READDIR_PAGES_SIZE);
 	spin_unlock(&fi->rdc.lock);
 
 	/* truncate unused tail of cache */
@@ -328,25 +334,25 @@  static int fuse_readdir_uncached(struct file *file, struct dir_context *ctx)
 	struct fuse_mount *fm = get_fuse_mount(inode);
 	struct fuse_io_args ia = {};
 	struct fuse_args_pages *ap = &ia.ap;
-	struct fuse_page_desc desc = { .length = PAGE_SIZE };
+	struct fuse_page_desc desc = { .length = READDIR_PAGES_SIZE };
 	u64 attr_version = 0;
 	bool locked;
 
-	page = alloc_page(GFP_KERNEL);
+	page = alloc_pages(GFP_KERNEL, READDIR_PAGES_ORDER);
 	if (!page)
 		return -ENOMEM;
 
 	plus = fuse_use_readdirplus(inode, ctx);
 	ap->args.out_pages = true;
-	ap->num_pages = 1;
+	ap->num_pages = READDIR_PAGES;
 	ap->pages = &page;
 	ap->descs = &desc;
 	if (plus) {
 		attr_version = fuse_get_attr_version(fm->fc);
-		fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
+		fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
 				    FUSE_READDIRPLUS);
 	} else {
-		fuse_read_args_fill(&ia, file, ctx->pos, PAGE_SIZE,
+		fuse_read_args_fill(&ia, file, ctx->pos, READDIR_PAGES_SIZE,
 				    FUSE_READDIR);
 	}
 	locked = fuse_lock_inode(inode);
@@ -383,7 +389,7 @@  static enum fuse_parse_result fuse_parse_cache(struct fuse_file *ff,
 					       void *addr, unsigned int size,
 					       struct dir_context *ctx)
 {
-	unsigned int offset = ff->readdir.cache_off & ~PAGE_MASK;
+	unsigned int offset = ff->readdir.cache_off & ~READDIR_PAGES_MASK;
 	enum fuse_parse_result res = FOUND_NONE;
 
 	WARN_ON(offset >= size);
@@ -504,16 +510,16 @@  static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
 
 	WARN_ON(fi->rdc.size < ff->readdir.cache_off);
 
-	index = ff->readdir.cache_off >> PAGE_SHIFT;
+	index = ff->readdir.cache_off >> READDIR_PAGES_SHIFT;
 
-	if (index == (fi->rdc.size >> PAGE_SHIFT))
-		size = fi->rdc.size & ~PAGE_MASK;
+	if (index == (fi->rdc.size >> READDIR_PAGES_SHIFT))
+		size = fi->rdc.size & ~READDIR_PAGES_MASK;
 	else
-		size = PAGE_SIZE;
+		size = READDIR_PAGES_SIZE;
 	spin_unlock(&fi->rdc.lock);
 
 	/* EOF? */
-	if ((ff->readdir.cache_off & ~PAGE_MASK) == size)
+	if ((ff->readdir.cache_off & ~READDIR_PAGES_MASK) == size)
 		return 0;
 
 	page = find_get_page_flags(file->f_mapping, index,
@@ -559,9 +565,9 @@  static int fuse_readdir_cached(struct file *file, struct dir_context *ctx)
 	if (res == FOUND_ALL)
 		return 0;
 
-	if (size == PAGE_SIZE) {
+	if (size == READDIR_PAGES_SIZE) {
 		/* We hit end of page: skip to next page. */
-		ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, PAGE_SIZE);
+		ff->readdir.cache_off = ALIGN(ff->readdir.cache_off, READDIR_PAGES_SIZE);
 		goto retry;
 	}