diff mbox series

[2/3] sysctl: add support for drop_caches for individual filesystem

Message ID 20241010112543.1609648-3-yebin@huaweicloud.com (mailing list archive)
State New
Headers show
Series add support for drop_caches for individual filesystem | expand

Commit Message

yebin Oct. 10, 2024, 11:25 a.m. UTC
From: Ye Bin <yebin10@huawei.com>

In order to better analyze the issue of file system uninstallation caused
by kernel module opening files, it is necessary to perform dentry recycling
on a single file system. But now, apart from global dentry recycling, it is
not supported to do dentry recycling on a single file system separately.
This feature has usage scenarios in problem localization scenarios.At the
same time, it also provides users with a slightly fine-grained
pagecache/entry recycling mechanism.
This patch supports the recycling of pagecache/entry for individual file
systems.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h |  2 ++
 kernel/sysctl.c    |  9 +++++++++
 3 files changed, 54 insertions(+)

Comments

Jan Kara Oct. 10, 2024, 12:16 p.m. UTC | #1
On Thu 10-10-24 19:25:42, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
> 
> In order to better analyze the issue of file system uninstallation caused
> by kernel module opening files, it is necessary to perform dentry recycling

I don't quite understand the use case you mention here. Can you explain it
a bit more (that being said I've needed dropping caches for a particular sb
myself a few times for debugging purposes so I generally agree it is a
useful feature).

> on a single file system. But now, apart from global dentry recycling, it is
> not supported to do dentry recycling on a single file system separately.
> This feature has usage scenarios in problem localization scenarios.At the
> same time, it also provides users with a slightly fine-grained
> pagecache/entry recycling mechanism.
> This patch supports the recycling of pagecache/entry for individual file
> systems.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
>  fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h |  2 ++
>  kernel/sysctl.c    |  9 +++++++++
>  3 files changed, 54 insertions(+)
> 
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index d45ef541d848..99d412cf3e52 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
>  	}
>  	return 0;
>  }
> +
> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
> +				  void *buffer, size_t *length, loff_t *ppos)
> +{
> +	unsigned int major, minor;
> +	unsigned int ctl;
> +	struct super_block *sb;
> +	static int stfu;
> +
> +	if (!write)
> +		return 0;
> +
> +	if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
> +		return -EINVAL;

I think specifying bdev major & minor number is not a great interface these
days. In particular for filesystems which are not bdev based such as NFS. I
think specifying path to some file/dir in the filesystem is nicer and you
can easily resolve that to sb here as well.

								Honza
yebin (H) Oct. 10, 2024, 12:44 p.m. UTC | #2
On 2024/10/10 20:16, Jan Kara wrote:
> On Thu 10-10-24 19:25:42, Ye Bin wrote:
>> From: Ye Bin <yebin10@huawei.com>
>>
>> In order to better analyze the issue of file system uninstallation caused
>> by kernel module opening files, it is necessary to perform dentry recycling
> I don't quite understand the use case you mention here. Can you explain it
> a bit more (that being said I've needed dropping caches for a particular sb
> myself a few times for debugging purposes so I generally agree it is a
> useful feature).
Well, I'm analyzing what files are still open and the file system can't 
be unmounted.
The process occupied by the opened file cannot be found through the 
fuser. That is,
the file may be occupied by the kernel mode. You can insert a module or 
use kprobe
to obtain all cached files of the corresponding file system. But there 
can be a lot of
files, so I want to clean up irrelevant files first.
>> on a single file system. But now, apart from global dentry recycling, it is
>> not supported to do dentry recycling on a single file system separately.
>> This feature has usage scenarios in problem localization scenarios.At the
>> same time, it also provides users with a slightly fine-grained
>> pagecache/entry recycling mechanism.
>> This patch supports the recycling of pagecache/entry for individual file
>> systems.
>>
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
>> ---
>>   fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
>>   include/linux/mm.h |  2 ++
>>   kernel/sysctl.c    |  9 +++++++++
>>   3 files changed, 54 insertions(+)
>>
>> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
>> index d45ef541d848..99d412cf3e52 100644
>> --- a/fs/drop_caches.c
>> +++ b/fs/drop_caches.c
>> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
>>   	}
>>   	return 0;
>>   }
>> +
>> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
>> +				  void *buffer, size_t *length, loff_t *ppos)
>> +{
>> +	unsigned int major, minor;
>> +	unsigned int ctl;
>> +	struct super_block *sb;
>> +	static int stfu;
>> +
>> +	if (!write)
>> +		return 0;
>> +
>> +	if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
>> +		return -EINVAL;
> I think specifying bdev major & minor number is not a great interface these
> days. In particular for filesystems which are not bdev based such as NFS. I
> think specifying path to some file/dir in the filesystem is nicer and you
> can easily resolve that to sb here as well.
>
> 								Honza
That's a really good idea. I think by specifying bdev "major & minor", 
you can reclaim
the file system pagecache that is not unmounted due to "umount -l" mode. 
In this
case, the sb of the corresponding file system cannot be found in the 
specified path.
So I think we can support both ways. I look forward to your opinion.
Benjamin Coddington Oct. 10, 2024, 1:35 p.m. UTC | #3
On 10 Oct 2024, at 8:16, Jan Kara wrote:

> On Thu 10-10-24 19:25:42, Ye Bin wrote:
>> From: Ye Bin <yebin10@huawei.com>
>>
>> In order to better analyze the issue of file system uninstallation caused
>> by kernel module opening files, it is necessary to perform dentry recycling
>
> I don't quite understand the use case you mention here. Can you explain it
> a bit more (that being said I've needed dropping caches for a particular sb
> myself a few times for debugging purposes so I generally agree it is a
> useful feature).
>
>> on a single file system. But now, apart from global dentry recycling, it is
>> not supported to do dentry recycling on a single file system separately.
>> This feature has usage scenarios in problem localization scenarios.At the
>> same time, it also provides users with a slightly fine-grained
>> pagecache/entry recycling mechanism.
>> This patch supports the recycling of pagecache/entry for individual file
>> systems.
>>
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
>> ---
>>  fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/mm.h |  2 ++
>>  kernel/sysctl.c    |  9 +++++++++
>>  3 files changed, 54 insertions(+)
>>
>> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
>> index d45ef541d848..99d412cf3e52 100644
>> --- a/fs/drop_caches.c
>> +++ b/fs/drop_caches.c
>> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
>>  	}
>>  	return 0;
>>  }
>> +
>> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
>> +				  void *buffer, size_t *length, loff_t *ppos)
>> +{
>> +	unsigned int major, minor;
>> +	unsigned int ctl;
>> +	struct super_block *sb;
>> +	static int stfu;
>> +
>> +	if (!write)
>> +		return 0;
>> +
>> +	if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
>> +		return -EINVAL;
>
> I think specifying bdev major & minor number is not a great interface these
> days. In particular for filesystems which are not bdev based such as NFS. I
> think specifying path to some file/dir in the filesystem is nicer and you
> can easily resolve that to sb here as well.

Slight disagreement here since NFS uses set_anon_super() and major:minor
will work fine with it.  I'd prefer it actually since it avoids this
interface having to do a pathwalk and make decisions about what's mounted
where and in what namespace.

Ben
Thomas Weißschuh Oct. 10, 2024, 1:48 p.m. UTC | #4
On 2024-10-10 19:25:42+0800, Ye Bin wrote:
> From: Ye Bin <yebin10@huawei.com>
> 
> In order to better analyze the issue of file system uninstallation caused
> by kernel module opening files, it is necessary to perform dentry recycling
> on a single file system. But now, apart from global dentry recycling, it is
> not supported to do dentry recycling on a single file system separately.
> This feature has usage scenarios in problem localization scenarios.At the
> same time, it also provides users with a slightly fine-grained
> pagecache/entry recycling mechanism.
> This patch supports the recycling of pagecache/entry for individual file
> systems.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>
> ---
>  fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h |  2 ++
>  kernel/sysctl.c    |  9 +++++++++
>  3 files changed, 54 insertions(+)
> 
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index d45ef541d848..99d412cf3e52 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
>  	}
>  	return 0;
>  }
> +
> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
> +				  void *buffer, size_t *length, loff_t *ppos)
> +{
> +	unsigned int major, minor;
> +	unsigned int ctl;
> +	struct super_block *sb;
> +	static int stfu;
> +
> +	if (!write)
> +		return 0;
> +
> +	if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
> +		return -EINVAL;
> +
> +	if (ctl < *((int *)table->extra1) || ctl > *((int *)table->extra2))
> +		return -EINVAL;
> +
> +	sb = user_get_super(MKDEV(major, minor), false);
> +	if (!sb)
> +		return -EINVAL;
> +
> +	if (ctl & 1) {

BIT(0)

> +		lru_add_drain_all();
> +		drop_pagecache_sb(sb, NULL);
> +		count_vm_event(DROP_PAGECACHE);
> +	}
> +
> +	if (ctl & 2) {
> +		shrink_dcache_sb(sb);
> +		shrink_icache_sb(sb);
> +		count_vm_event(DROP_SLAB);
> +	}
> +
> +	drop_super(sb);
> +
> +	if (!stfu)
> +		pr_info("%s (%d): drop_fs_caches: %u:%u:%d\n", current->comm,
> +			task_pid_nr(current), major, minor, ctl);
> +	stfu |= ctl & 4;

This looks very weird. I guess it's already in the original
drop_caches_sysctl_handler().

> +
> +	return 0;
> +}
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 344541f8cba0..43079478296f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3788,6 +3788,8 @@ extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
>  extern int sysctl_drop_caches;
>  int drop_caches_sysctl_handler(const struct ctl_table *, int, void *, size_t *,
>  		loff_t *);
> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
> +				  void *buffer, size_t *length, loff_t *ppos);
>  #endif
>  
>  void drop_slab(void);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 79e6cb1d5c48..d434cbe10e47 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2101,6 +2101,15 @@ static struct ctl_table vm_table[] = {

Sooner or later this table should move out of kernel/sysctl.c and into a
subsystem-specific file.
This also means the handler doesn't need to be exported.

>  		.extra1		= SYSCTL_ONE,
>  		.extra2		= SYSCTL_FOUR,
>  	},
> +	{
> +		.procname	= "drop_fs_caches",
> +		.data		= NULL,

NULL is already the default.

> +		.maxlen		= 256,

The maxlen field refers to the data field.
As there is no data, there should be no maxlen.

> +		.mode		= 0200,
> +		.proc_handler	= drop_fs_caches_sysctl_handler,
> +		.extra1         = SYSCTL_ONE,
> +		.extra2         = SYSCTL_FOUR,

These extras are meant as parameters for generic handlers.
Inlining the limits into your hander makes it much clearer.

> +	},
>  	{
>  		.procname	= "page_lock_unfairness",
>  		.data		= &sysctl_page_lock_unfairness,
> -- 
> 2.31.1
>
Jan Kara Oct. 10, 2024, 5:04 p.m. UTC | #5
On Thu 10-10-24 09:35:46, Benjamin Coddington wrote:
> On 10 Oct 2024, at 8:16, Jan Kara wrote:
> 
> > On Thu 10-10-24 19:25:42, Ye Bin wrote:
> >> From: Ye Bin <yebin10@huawei.com>
> >>
> >> In order to better analyze the issue of file system uninstallation caused
> >> by kernel module opening files, it is necessary to perform dentry recycling
> >
> > I don't quite understand the use case you mention here. Can you explain it
> > a bit more (that being said I've needed dropping caches for a particular sb
> > myself a few times for debugging purposes so I generally agree it is a
> > useful feature).
> >
> >> on a single file system. But now, apart from global dentry recycling, it is
> >> not supported to do dentry recycling on a single file system separately.
> >> This feature has usage scenarios in problem localization scenarios.At the
> >> same time, it also provides users with a slightly fine-grained
> >> pagecache/entry recycling mechanism.
> >> This patch supports the recycling of pagecache/entry for individual file
> >> systems.
> >>
> >> Signed-off-by: Ye Bin <yebin10@huawei.com>
> >> ---
> >>  fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
> >>  include/linux/mm.h |  2 ++
> >>  kernel/sysctl.c    |  9 +++++++++
> >>  3 files changed, 54 insertions(+)
> >>
> >> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> >> index d45ef541d848..99d412cf3e52 100644
> >> --- a/fs/drop_caches.c
> >> +++ b/fs/drop_caches.c
> >> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
> >>  	}
> >>  	return 0;
> >>  }
> >> +
> >> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
> >> +				  void *buffer, size_t *length, loff_t *ppos)
> >> +{
> >> +	unsigned int major, minor;
> >> +	unsigned int ctl;
> >> +	struct super_block *sb;
> >> +	static int stfu;
> >> +
> >> +	if (!write)
> >> +		return 0;
> >> +
> >> +	if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
> >> +		return -EINVAL;
> >
> > I think specifying bdev major & minor number is not a great interface these
> > days. In particular for filesystems which are not bdev based such as NFS. I
> > think specifying path to some file/dir in the filesystem is nicer and you
> > can easily resolve that to sb here as well.
> 
> Slight disagreement here since NFS uses set_anon_super() and major:minor
> will work fine with it.

OK, fair point, anon bdev numbers can be used. But filesystems using
get_tree_nodev() would still be problematic.

> I'd prefer it actually since it avoids this
> interface having to do a pathwalk and make decisions about what's mounted
> where and in what namespace.

I don't understand the problem here. We'd do user_path_at(AT_FDCWD, ...,
&path) and then take path.mnt->mnt_sb. That doesn't look terribly
complicated to me. Plus it naturally deals with issues like namespacing
etc. although they are not a huge issue here because the functionality
should be restricted to CAP_SYS_ADMIN anyway.

								Honza
Al Viro Oct. 10, 2024, 5:17 p.m. UTC | #6
On Thu, Oct 10, 2024 at 07:25:42PM +0800, Ye Bin wrote:

> +	if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
> +		return -EINVAL;
> +
> +	if (ctl < *((int *)table->extra1) || ctl > *((int *)table->extra2))
> +		return -EINVAL;
> +
> +	sb = user_get_super(MKDEV(major, minor), false);
> +	if (!sb)
> +		return -EINVAL;

Odd user interface aside, you do realize that you've just grabbed ->s_umount
from inside a ->write() instance?  Considering how much can be grabbed
under ->s_umount... Ow.

IOW, I very much doubt that doing that kind of stuff from sysctl is a good
idea - if nothing else, we'll end up with syzbot screaming its head off
about many and varied potential deadlocks, as soon as it discovers that one.
And I wouldn't swear that all of those would be false positives.
Amir Goldstein Oct. 11, 2024, 11:44 a.m. UTC | #7
On Thu, Oct 10, 2024 at 7:04 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 10-10-24 09:35:46, Benjamin Coddington wrote:
> > On 10 Oct 2024, at 8:16, Jan Kara wrote:
> >
> > > On Thu 10-10-24 19:25:42, Ye Bin wrote:
> > >> From: Ye Bin <yebin10@huawei.com>
> > >>
> > >> In order to better analyze the issue of file system uninstallation caused
> > >> by kernel module opening files, it is necessary to perform dentry recycling
> > >
> > > I don't quite understand the use case you mention here. Can you explain it
> > > a bit more (that being said I've needed dropping caches for a particular sb
> > > myself a few times for debugging purposes so I generally agree it is a
> > > useful feature).
> > >
> > >> on a single file system. But now, apart from global dentry recycling, it is
> > >> not supported to do dentry recycling on a single file system separately.
> > >> This feature has usage scenarios in problem localization scenarios.At the
> > >> same time, it also provides users with a slightly fine-grained
> > >> pagecache/entry recycling mechanism.
> > >> This patch supports the recycling of pagecache/entry for individual file
> > >> systems.
> > >>
> > >> Signed-off-by: Ye Bin <yebin10@huawei.com>
> > >> ---
> > >>  fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
> > >>  include/linux/mm.h |  2 ++
> > >>  kernel/sysctl.c    |  9 +++++++++
> > >>  3 files changed, 54 insertions(+)
> > >>
> > >> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> > >> index d45ef541d848..99d412cf3e52 100644
> > >> --- a/fs/drop_caches.c
> > >> +++ b/fs/drop_caches.c
> > >> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
> > >>    }
> > >>    return 0;
> > >>  }
> > >> +
> > >> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
> > >> +                            void *buffer, size_t *length, loff_t *ppos)
> > >> +{
> > >> +  unsigned int major, minor;
> > >> +  unsigned int ctl;
> > >> +  struct super_block *sb;
> > >> +  static int stfu;
> > >> +
> > >> +  if (!write)
> > >> +          return 0;
> > >> +
> > >> +  if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
> > >> +          return -EINVAL;
> > >
> > > I think specifying bdev major & minor number is not a great interface these
> > > days. In particular for filesystems which are not bdev based such as NFS. I
> > > think specifying path to some file/dir in the filesystem is nicer and you
> > > can easily resolve that to sb here as well.
> >
> > Slight disagreement here since NFS uses set_anon_super() and major:minor
> > will work fine with it.
>
> OK, fair point, anon bdev numbers can be used. But filesystems using
> get_tree_nodev() would still be problematic.
>
> > I'd prefer it actually since it avoids this
> > interface having to do a pathwalk and make decisions about what's mounted
> > where and in what namespace.
>
> I don't understand the problem here. We'd do user_path_at(AT_FDCWD, ...,
> &path) and then take path.mnt->mnt_sb. That doesn't look terribly
> complicated to me. Plus it naturally deals with issues like namespacing
> etc. although they are not a huge issue here because the functionality
> should be restricted to CAP_SYS_ADMIN anyway.
>

Both looking up bdev and looking up path from write() can make syzbot
and lockdep very upset:
https://lore.kernel.org/linux-fsdevel/00000000000098f75506153551a1@google.com/

I thought Christian had a proposal for dropping cache per-sb API via fadvise()
or something?

Why use sysfs API for this and not fd to reference an sb?

Thanks,
Amir.
Jan Kara Oct. 14, 2024, 11:24 a.m. UTC | #8
On Fri 11-10-24 13:44:57, Amir Goldstein wrote:
> On Thu, Oct 10, 2024 at 7:04 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 10-10-24 09:35:46, Benjamin Coddington wrote:
> > > On 10 Oct 2024, at 8:16, Jan Kara wrote:
> > >
> > > > On Thu 10-10-24 19:25:42, Ye Bin wrote:
> > > >> From: Ye Bin <yebin10@huawei.com>
> > > >>
> > > >> In order to better analyze the issue of file system uninstallation caused
> > > >> by kernel module opening files, it is necessary to perform dentry recycling
> > > >
> > > > I don't quite understand the use case you mention here. Can you explain it
> > > > a bit more (that being said I've needed dropping caches for a particular sb
> > > > myself a few times for debugging purposes so I generally agree it is a
> > > > useful feature).
> > > >
> > > >> on a single file system. But now, apart from global dentry recycling, it is
> > > >> not supported to do dentry recycling on a single file system separately.
> > > >> This feature has usage scenarios in problem localization scenarios.At the
> > > >> same time, it also provides users with a slightly fine-grained
> > > >> pagecache/entry recycling mechanism.
> > > >> This patch supports the recycling of pagecache/entry for individual file
> > > >> systems.
> > > >>
> > > >> Signed-off-by: Ye Bin <yebin10@huawei.com>
> > > >> ---
> > > >>  fs/drop_caches.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
> > > >>  include/linux/mm.h |  2 ++
> > > >>  kernel/sysctl.c    |  9 +++++++++
> > > >>  3 files changed, 54 insertions(+)
> > > >>
> > > >> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> > > >> index d45ef541d848..99d412cf3e52 100644
> > > >> --- a/fs/drop_caches.c
> > > >> +++ b/fs/drop_caches.c
> > > >> @@ -77,3 +77,46 @@ int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
> > > >>    }
> > > >>    return 0;
> > > >>  }
> > > >> +
> > > >> +int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
> > > >> +                            void *buffer, size_t *length, loff_t *ppos)
> > > >> +{
> > > >> +  unsigned int major, minor;
> > > >> +  unsigned int ctl;
> > > >> +  struct super_block *sb;
> > > >> +  static int stfu;
> > > >> +
> > > >> +  if (!write)
> > > >> +          return 0;
> > > >> +
> > > >> +  if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
> > > >> +          return -EINVAL;
> > > >
> > > > I think specifying bdev major & minor number is not a great interface these
> > > > days. In particular for filesystems which are not bdev based such as NFS. I
> > > > think specifying path to some file/dir in the filesystem is nicer and you
> > > > can easily resolve that to sb here as well.
> > >
> > > Slight disagreement here since NFS uses set_anon_super() and major:minor
> > > will work fine with it.
> >
> > OK, fair point, anon bdev numbers can be used. But filesystems using
> > get_tree_nodev() would still be problematic.
> >
> > > I'd prefer it actually since it avoids this
> > > interface having to do a pathwalk and make decisions about what's mounted
> > > where and in what namespace.
> >
> > I don't understand the problem here. We'd do user_path_at(AT_FDCWD, ...,
> > &path) and then take path.mnt->mnt_sb. That doesn't look terribly
> > complicated to me. Plus it naturally deals with issues like namespacing
> > etc. although they are not a huge issue here because the functionality
> > should be restricted to CAP_SYS_ADMIN anyway.
> >
> 
> Both looking up bdev and looking up path from write() can make syzbot
> and lockdep very upset:
> https://lore.kernel.org/linux-fsdevel/00000000000098f75506153551a1@google.com/

OK, thanks for the reference.

> I thought Christian had a proposal for dropping cache per-sb API via
> fadvise() or something?
> 
> Why use sysfs API for this and not fd to reference an sb?

I guess because the original drop_caches is in the sysfs. But yes, in
principle we could use fd pointing to the filesystem for this. I'm just not
sure fadvise(2) is really the right syscall for this because it is
currently all about page cache of a file and this call should shrink also
the dcache / icache. But ioctl() (not sure if this debug-mostly
functionality is worth a syscall) implemented in VFS would certainly be
possible and perhaps nicer than sysfs interface.

								Honza
diff mbox series

Patch

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d45ef541d848..99d412cf3e52 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -77,3 +77,46 @@  int drop_caches_sysctl_handler(const struct ctl_table *table, int write,
 	}
 	return 0;
 }
+
+int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
+				  void *buffer, size_t *length, loff_t *ppos)
+{
+	unsigned int major, minor;
+	unsigned int ctl;
+	struct super_block *sb;
+	static int stfu;
+
+	if (!write)
+		return 0;
+
+	if (sscanf(buffer, "%u:%u:%u", &major, &minor, &ctl) != 3)
+		return -EINVAL;
+
+	if (ctl < *((int *)table->extra1) || ctl > *((int *)table->extra2))
+		return -EINVAL;
+
+	sb = user_get_super(MKDEV(major, minor), false);
+	if (!sb)
+		return -EINVAL;
+
+	if (ctl & 1) {
+		lru_add_drain_all();
+		drop_pagecache_sb(sb, NULL);
+		count_vm_event(DROP_PAGECACHE);
+	}
+
+	if (ctl & 2) {
+		shrink_dcache_sb(sb);
+		shrink_icache_sb(sb);
+		count_vm_event(DROP_SLAB);
+	}
+
+	drop_super(sb);
+
+	if (!stfu)
+		pr_info("%s (%d): drop_fs_caches: %u:%u:%d\n", current->comm,
+			task_pid_nr(current), major, minor, ctl);
+	stfu |= ctl & 4;
+
+	return 0;
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 344541f8cba0..43079478296f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3788,6 +3788,8 @@  extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(const struct ctl_table *, int, void *, size_t *,
 		loff_t *);
+int drop_fs_caches_sysctl_handler(const struct ctl_table *table, int write,
+				  void *buffer, size_t *length, loff_t *ppos);
 #endif
 
 void drop_slab(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 79e6cb1d5c48..d434cbe10e47 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2101,6 +2101,15 @@  static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ONE,
 		.extra2		= SYSCTL_FOUR,
 	},
+	{
+		.procname	= "drop_fs_caches",
+		.data		= NULL,
+		.maxlen		= 256,
+		.mode		= 0200,
+		.proc_handler	= drop_fs_caches_sysctl_handler,
+		.extra1         = SYSCTL_ONE,
+		.extra2         = SYSCTL_FOUR,
+	},
 	{
 		.procname	= "page_lock_unfairness",
 		.data		= &sysctl_page_lock_unfairness,