kernel: sysctl: make drop_caches write-only

Message ID	20191031221602.9375-1-hannes@cmpxchg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=qTSC=YY=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A91062086D From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH] kernel: sysctl: make drop_caches write-only Date: Thu, 31 Oct 2019 18:16:02 -0400 Message-Id: <20191031221602.9375-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	kernel: sysctl: make drop_caches write-only \| expand kernel: sysctl: make drop_caches write-only

Johannes Weiner Oct. 31, 2019, 10:16 p.m. UTC

Currently, the drop_caches proc file and sysctl read back the last
value written, suggesting this is somehow a stateful setting instead
of a one-time command. Make it write-only, like e.g. compact_memory.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sysctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Andrew Morton Oct. 31, 2019, 11:28 p.m. UTC | #1

On Thu, 31 Oct 2019 18:16:02 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
> 
> ...
>
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
>  		.procname	= "drop_caches",
>  		.data		= &sysctl_drop_caches,
>  		.maxlen		= sizeof(int),
> -		.mode		= 0644,
> +		.mode		= 0200,
>  		.proc_handler	= drop_caches_sysctl_handler,
>  		.extra1		= SYSCTL_ONE,
>  		.extra2		= &four,

hm.

Risk: some (odd) userspace code will break.  Fixable by manually chmodding
it back again.

Reward: very little.

Is the reward worth the risk?

Chris Down Nov. 1, 2019, 10:58 a.m. UTC | #2

Johannes Weiner writes:
>Currently, the drop_caches proc file and sysctl read back the last
>value written, suggesting this is somehow a stateful setting instead
>of a one-time command. Make it write-only, like e.g. compact_memory.
>
>Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Since we already have the kmsg notifier when it's used, this seems reasonable 
to me.

Acked-by: Chris Down <chris@chrisdown.name>

Chris Down Nov. 1, 2019, 11:09 a.m. UTC | #3

Hm, not sure why my client didn't show this reply.

Andrew Morton writes:
>Risk: some (odd) userspace code will break.  Fixable by manually chmodding
>it back again.

The only scenario I can construct in my head is that someone has built 
something to watch drop_caches for modification, but we already have the kmsg 
output for that.

>Reward: very little.
>
>Is the reward worth the risk?

There is evidence that this has already caused confusion[0] for many, judging 
by the number of views and votes. I think the reward is higher than stated 
here, since it makes the intent and lack of persistent API in the API clearer, 
and less likely to cause confusion in future.

0: https://unix.stackexchange.com/q/17936/10762

Chris Down Nov. 1, 2019, 11:09 a.m. UTC | #4

Chris Down writes:
>There is evidence that this has already caused confusion[0] for many, 
>judging by the number of views and votes. I think the reward is higher 
>than stated here, since it makes the intent and lack of persistent API 
>in the API clearer, and less likely to cause confusion in future.
>
>0: https://unix.stackexchange.com/q/17936/10762

s/persistent API/persistent state/

Johannes Weiner Nov. 1, 2019, 2:45 p.m. UTC | #5

On Fri, Nov 01, 2019 at 11:09:01AM +0000, Chris Down wrote:
> Hm, not sure why my client didn't show this reply.
> 
> Andrew Morton writes:
> > Risk: some (odd) userspace code will break.  Fixable by manually chmodding
> > it back again.
> 
> The only scenario I can construct in my head is that someone has built
> something to watch drop_caches for modification, but we already have the
> kmsg output for that.
> 
> > Reward: very little.
> > 
> > Is the reward worth the risk?
> 
> There is evidence that this has already caused confusion[0] for many,
> judging by the number of views and votes. I think the reward is higher than
> stated here, since it makes the intent and lack of persistent API in the API
> clearer, and less likely to cause confusion in future.
> 
> 0: https://unix.stackexchange.com/q/17936/10762

Yes, I should have mentioned this in the changelog, but:

While mitigating a VM problem at scale in our fleet, there was
confusion about whether writing to this file will permanently switch
the kernel into a non-caching mode. This influences the decision
making in a tense situation, where tens of people are trying to fix
tens of thousands of affected machines: Do we need a rollback
strategy? What are the performance implications of operating in a
non-caching state for several days? It also caused confusion when the
kernel team said we may need to write the file several times to make
sure it's effective ("But it already reads back 3?").

Andrew Morton Nov. 1, 2019, 6:59 p.m. UTC | #6

On Fri, 1 Nov 2019 10:45:40 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Nov 01, 2019 at 11:09:01AM +0000, Chris Down wrote:
> > Hm, not sure why my client didn't show this reply.
> > 
> > Andrew Morton writes:
> > > Risk: some (odd) userspace code will break.  Fixable by manually chmodding
> > > it back again.
> > 
> > The only scenario I can construct in my head is that someone has built
> > something to watch drop_caches for modification, but we already have the
> > kmsg output for that.

The scenario is that something opens /proc/sys/vm/drop_caches for
reading, gets unexpected EPERM and blows up?

> > > Reward: very little.
> > > 
> > > Is the reward worth the risk?
> > 
> > There is evidence that this has already caused confusion[0] for many,
> > judging by the number of views and votes. I think the reward is higher than
> > stated here, since it makes the intent and lack of persistent API in the API
> > clearer, and less likely to cause confusion in future.
> > 
> > 0: https://unix.stackexchange.com/q/17936/10762
> 
> Yes, I should have mentioned this in the changelog, but:
> 
> While mitigating a VM problem at scale in our fleet, there was
> confusion about whether writing to this file will permanently switch
> the kernel into a non-caching mode. This influences the decision
> making in a tense situation, where tens of people are trying to fix
> tens of thousands of affected machines: Do we need a rollback
> strategy? What are the performance implications of operating in a
> non-caching state for several days? It also caused confusion when the
> kernel team said we may need to write the file several times to make
> sure it's effective ("But it already reads back 3?").

OK.  What if we make reads always return "0"?  That will fix the
misleading output and is more backwards-compatible?

Chris Down Nov. 1, 2019, 7:24 p.m. UTC | #7

Andrew Morton writes:
>> > The only scenario I can construct in my head is that someone has built
>> > something to watch drop_caches for modification, but we already have the
>> > kmsg output for that.
>
>The scenario is that something opens /proc/sys/vm/drop_caches for
>reading, gets unexpected EPERM and blows up?

Right, but...

>OK.  What if we make reads always return "0"?  That will fix the
>misleading output and is more backwards-compatible?

...I'm not convinced that if an application has no error boundary for that 
EPERM that it can tolerate a change in behaviour, either. I mean, if it's 
opening it at all, presumably it intends to do *something* based on the value 
(regardless of import or lack thereof). It may do nothing, but it's not 
possible to know whether that's better or worse than blowing up.

I have mixed feelings on this one. Pragmatically, as someone who programs in 
userspace, I'd like failures based on changes in infrastructure to be loud, not 
silent. If I'm doing something which doesn't work, I'd like to know about it. 
Of course, one can make the argument that as a user of such an application, 
sometimes you don't have that luxury.

Either change is an upgrade from the current situation, at least. I prefer 
towards whatever makes the API the least confusing, which appears to be 
Johannes' original change, but I'd support a patch which always set it to 
0 instead if it was deemed safer.

Andrew Morton Nov. 1, 2019, 7:29 p.m. UTC | #8

On Fri, 1 Nov 2019 19:24:05 +0000 Chris Down <chris@chrisdown.name> wrote:

> Andrew Morton writes:
> >> > The only scenario I can construct in my head is that someone has built
> >> > something to watch drop_caches for modification, but we already have the
> >> > kmsg output for that.
> >
> >The scenario is that something opens /proc/sys/vm/drop_caches for
> >reading, gets unexpected EPERM and blows up?
> 
> Right, but...
> 
> >OK.  What if we make reads always return "0"?  That will fix the
> >misleading output and is more backwards-compatible?
> 
> ...I'm not convinced that if an application has no error boundary for that 
> EPERM that it can tolerate a change in behaviour, either. I mean, if it's 
> opening it at all, presumably it intends to do *something* based on the value 
> (regardless of import or lack thereof). It may do nothing, but it's not 
> possible to know whether that's better or worse than blowing up.
> 
> I have mixed feelings on this one. Pragmatically, as someone who programs in 
> userspace, I'd like failures based on changes in infrastructure to be loud, not 
> silent. If I'm doing something which doesn't work, I'd like to know about it. 
> Of course, one can make the argument that as a user of such an application, 
> sometimes you don't have that luxury.
> 
> Either change is an upgrade from the current situation, at least. I prefer 
> towards whatever makes the API the least confusing, which appears to be 
> Johannes' original change, but I'd support a patch which always set it to 
> 0 instead if it was deemed safer.

On the other hand..  As I mentioned earlier, if someone's code is
failing because of the permissions change, they can chmod
/proc/sys/vm/drop_caches at boot time and be happy.  They have no such
workaround if their software misbehaves due to a read always returning
"0".

Andrew Morton Nov. 1, 2019, 7:35 p.m. UTC | #9

On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> > Either change is an upgrade from the current situation, at least. I prefer 
> > towards whatever makes the API the least confusing, which appears to be 
> > Johannes' original change, but I'd support a patch which always set it to 
> > 0 instead if it was deemed safer.
> 
> On the other hand..  As I mentioned earlier, if someone's code is
> failing because of the permissions change, they can chmod
> /proc/sys/vm/drop_caches at boot time and be happy.  They have no such
> workaround if their software misbehaves due to a read always returning
> "0".

I lied.  I can chmod things in /proc but I can't chmod things in
/proc/sys/vm.  Huh, why did we do that?

Alexey Dobriyan Nov. 2, 2019, 3:55 p.m. UTC | #10

On Fri, Nov 01, 2019 at 12:35:44PM -0700, Andrew Morton wrote:
> On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > Either change is an upgrade from the current situation, at least. I prefer 
> > > towards whatever makes the API the least confusing, which appears to be 
> > > Johannes' original change, but I'd support a patch which always set it to 
> > > 0 instead if it was deemed safer.
> > 
> > On the other hand..  As I mentioned earlier, if someone's code is
> > failing because of the permissions change, they can chmod
> > /proc/sys/vm/drop_caches at boot time and be happy.  They have no such
> > workaround if their software misbehaves due to a read always returning
> > "0".
> 
> I lied.  I can chmod things in /proc but I can't chmod things in
> /proc/sys/vm.  Huh, why did we do that?

To conserve memory! It was in 2007.
For the record I support 0200 on vm.drop_caches.

	commit 77b14db502cb85a031fe8fde6c85d52f3e0acb63
	[PATCH] sysctl: reimplement the sysctl proc support

	+static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
	+{
	+       struct inode *inode = dentry->d_inode;
	+       int error;
	+
	+       if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
	+               return -EPERM;

Eric W. Biederman Nov. 3, 2019, 7 p.m. UTC | #11

Alexey Dobriyan <adobriyan@gmail.com> writes:

> On Fri, Nov 01, 2019 at 12:35:44PM -0700, Andrew Morton wrote:
>> On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
>> 
>> > > Either change is an upgrade from the current situation, at least. I prefer 
>> > > towards whatever makes the API the least confusing, which appears to be 
>> > > Johannes' original change, but I'd support a patch which always set it to 
>> > > 0 instead if it was deemed safer.
>> > 
>> > On the other hand..  As I mentioned earlier, if someone's code is
>> > failing because of the permissions change, they can chmod
>> > /proc/sys/vm/drop_caches at boot time and be happy.  They have no such
>> > workaround if their software misbehaves due to a read always returning
>> > "0".
>> 
>> I lied.  I can chmod things in /proc but I can't chmod things in
>> /proc/sys/vm.  Huh, why did we do that?
>
> To conserve memory! It was in 2007.
> For the record I support 0200 on vm.drop_caches.
>
> 	commit 77b14db502cb85a031fe8fde6c85d52f3e0acb63
> 	[PATCH] sysctl: reimplement the sysctl proc support
>
> 	+static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
> 	+{
> 	+       struct inode *inode = dentry->d_inode;
> 	+       int error;
> 	+
> 	+       if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
> 	+               return -EPERM;


Almost.

The rewrite was both to concerve memory and to support the network
namespace.  Which required a different view of proc files.

But in this case we have always unconditionally called sysctl_perm.  The
change above at best removed a layer of obfuscation that made it look
like some other permission check was being honored.

Eric

David Hildenbrand Nov. 4, 2019, 10:37 a.m. UTC | #12

On 31.10.19 23:16, Johannes Weiner wrote:
> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>   kernel/sysctl.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 31ece1120aa4..50373984a5e2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
>   		.procname	= "drop_caches",
>   		.data		= &sysctl_drop_caches,
>   		.maxlen		= sizeof(int),
> -		.mode		= 0644,
> +		.mode		= 0200,
>   		.proc_handler	= drop_caches_sysctl_handler,
>   		.extra1		= SYSCTL_ONE,
>   		.extra2		= &four,
> 

Makes perfect sense to me (and we might notice while in next/master if 
this breaks something, hopefully)

Acked-by: David Hildenbrand <david@redhat.com>

Vlastimil Babka Nov. 4, 2019, 1:25 p.m. UTC | #13

On 10/31/19 11:16 PM, Johannes Weiner wrote:
> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Generic tools such as "sysctl -a" will be fine as they already are fine
with compact_memory being 0200, and if somebody has a specific script,
they are maybe already making wrong decisions, so at least hopefully
they will learn and adjust.

> ---
>  kernel/sysctl.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 31ece1120aa4..50373984a5e2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
>  		.procname	= "drop_caches",
>  		.data		= &sysctl_drop_caches,
>  		.maxlen		= sizeof(int),
> -		.mode		= 0644,
> +		.mode		= 0200,
>  		.proc_handler	= drop_caches_sysctl_handler,
>  		.extra1		= SYSCTL_ONE,
>  		.extra2		= &four,
>

Michal Hocko Nov. 5, 2019, 6:20 a.m. UTC | #14

On Thu 31-10-19 18:16:02, Johannes Weiner wrote:
> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Please add a note about the confusion this has caused already.
The link posted by Chris would be useful as well.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  kernel/sysctl.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 31ece1120aa4..50373984a5e2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
>  		.procname	= "drop_caches",
>  		.data		= &sysctl_drop_caches,
>  		.maxlen		= sizeof(int),
> -		.mode		= 0644,
> +		.mode		= 0200,
>  		.proc_handler	= drop_caches_sysctl_handler,
>  		.extra1		= SYSCTL_ONE,
>  		.extra2		= &four,
> -- 
> 2.23.0

kernel: sysctl: make drop_caches write-only

Commit Message

Comments

Patch