Message ID | 20191031221602.9375-1-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | kernel: sysctl: make drop_caches write-only | expand |
On Thu, 31 Oct 2019 18:16:02 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > Currently, the drop_caches proc file and sysctl read back the last > value written, suggesting this is somehow a stateful setting instead > of a one-time command. Make it write-only, like e.g. compact_memory. > > ... > > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = { > .procname = "drop_caches", > .data = &sysctl_drop_caches, > .maxlen = sizeof(int), > - .mode = 0644, > + .mode = 0200, > .proc_handler = drop_caches_sysctl_handler, > .extra1 = SYSCTL_ONE, > .extra2 = &four, hm. Risk: some (odd) userspace code will break. Fixable by manually chmodding it back again. Reward: very little. Is the reward worth the risk?
Johannes Weiner writes: >Currently, the drop_caches proc file and sysctl read back the last >value written, suggesting this is somehow a stateful setting instead >of a one-time command. Make it write-only, like e.g. compact_memory. > >Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Since we already have the kmsg notifier when it's used, this seems reasonable to me. Acked-by: Chris Down <chris@chrisdown.name>
Hm, not sure why my client didn't show this reply. Andrew Morton writes: >Risk: some (odd) userspace code will break. Fixable by manually chmodding >it back again. The only scenario I can construct in my head is that someone has built something to watch drop_caches for modification, but we already have the kmsg output for that. >Reward: very little. > >Is the reward worth the risk? There is evidence that this has already caused confusion[0] for many, judging by the number of views and votes. I think the reward is higher than stated here, since it makes the intent and lack of persistent API in the API clearer, and less likely to cause confusion in future. 0: https://unix.stackexchange.com/q/17936/10762
Chris Down writes: >There is evidence that this has already caused confusion[0] for many, >judging by the number of views and votes. I think the reward is higher >than stated here, since it makes the intent and lack of persistent API >in the API clearer, and less likely to cause confusion in future. > >0: https://unix.stackexchange.com/q/17936/10762 s/persistent API/persistent state/
On Fri, Nov 01, 2019 at 11:09:01AM +0000, Chris Down wrote: > Hm, not sure why my client didn't show this reply. > > Andrew Morton writes: > > Risk: some (odd) userspace code will break. Fixable by manually chmodding > > it back again. > > The only scenario I can construct in my head is that someone has built > something to watch drop_caches for modification, but we already have the > kmsg output for that. > > > Reward: very little. > > > > Is the reward worth the risk? > > There is evidence that this has already caused confusion[0] for many, > judging by the number of views and votes. I think the reward is higher than > stated here, since it makes the intent and lack of persistent API in the API > clearer, and less likely to cause confusion in future. > > 0: https://unix.stackexchange.com/q/17936/10762 Yes, I should have mentioned this in the changelog, but: While mitigating a VM problem at scale in our fleet, there was confusion about whether writing to this file will permanently switch the kernel into a non-caching mode. This influences the decision making in a tense situation, where tens of people are trying to fix tens of thousands of affected machines: Do we need a rollback strategy? What are the performance implications of operating in a non-caching state for several days? It also caused confusion when the kernel team said we may need to write the file several times to make sure it's effective ("But it already reads back 3?").
On Fri, 1 Nov 2019 10:45:40 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote: > On Fri, Nov 01, 2019 at 11:09:01AM +0000, Chris Down wrote: > > Hm, not sure why my client didn't show this reply. > > > > Andrew Morton writes: > > > Risk: some (odd) userspace code will break. Fixable by manually chmodding > > > it back again. > > > > The only scenario I can construct in my head is that someone has built > > something to watch drop_caches for modification, but we already have the > > kmsg output for that. The scenario is that something opens /proc/sys/vm/drop_caches for reading, gets unexpected EPERM and blows up? > > > Reward: very little. > > > > > > Is the reward worth the risk? > > > > There is evidence that this has already caused confusion[0] for many, > > judging by the number of views and votes. I think the reward is higher than > > stated here, since it makes the intent and lack of persistent API in the API > > clearer, and less likely to cause confusion in future. > > > > 0: https://unix.stackexchange.com/q/17936/10762 > > Yes, I should have mentioned this in the changelog, but: > > While mitigating a VM problem at scale in our fleet, there was > confusion about whether writing to this file will permanently switch > the kernel into a non-caching mode. This influences the decision > making in a tense situation, where tens of people are trying to fix > tens of thousands of affected machines: Do we need a rollback > strategy? What are the performance implications of operating in a > non-caching state for several days? It also caused confusion when the > kernel team said we may need to write the file several times to make > sure it's effective ("But it already reads back 3?"). OK. What if we make reads always return "0"? That will fix the misleading output and is more backwards-compatible?
Andrew Morton writes: >> > The only scenario I can construct in my head is that someone has built >> > something to watch drop_caches for modification, but we already have the >> > kmsg output for that. > >The scenario is that something opens /proc/sys/vm/drop_caches for >reading, gets unexpected EPERM and blows up? Right, but... >OK. What if we make reads always return "0"? That will fix the >misleading output and is more backwards-compatible? ...I'm not convinced that if an application has no error boundary for that EPERM that it can tolerate a change in behaviour, either. I mean, if it's opening it at all, presumably it intends to do *something* based on the value (regardless of import or lack thereof). It may do nothing, but it's not possible to know whether that's better or worse than blowing up. I have mixed feelings on this one. Pragmatically, as someone who programs in userspace, I'd like failures based on changes in infrastructure to be loud, not silent. If I'm doing something which doesn't work, I'd like to know about it. Of course, one can make the argument that as a user of such an application, sometimes you don't have that luxury. Either change is an upgrade from the current situation, at least. I prefer towards whatever makes the API the least confusing, which appears to be Johannes' original change, but I'd support a patch which always set it to 0 instead if it was deemed safer.
On Fri, 1 Nov 2019 19:24:05 +0000 Chris Down <chris@chrisdown.name> wrote: > Andrew Morton writes: > >> > The only scenario I can construct in my head is that someone has built > >> > something to watch drop_caches for modification, but we already have the > >> > kmsg output for that. > > > >The scenario is that something opens /proc/sys/vm/drop_caches for > >reading, gets unexpected EPERM and blows up? > > Right, but... > > >OK. What if we make reads always return "0"? That will fix the > >misleading output and is more backwards-compatible? > > ...I'm not convinced that if an application has no error boundary for that > EPERM that it can tolerate a change in behaviour, either. I mean, if it's > opening it at all, presumably it intends to do *something* based on the value > (regardless of import or lack thereof). It may do nothing, but it's not > possible to know whether that's better or worse than blowing up. > > I have mixed feelings on this one. Pragmatically, as someone who programs in > userspace, I'd like failures based on changes in infrastructure to be loud, not > silent. If I'm doing something which doesn't work, I'd like to know about it. > Of course, one can make the argument that as a user of such an application, > sometimes you don't have that luxury. > > Either change is an upgrade from the current situation, at least. I prefer > towards whatever makes the API the least confusing, which appears to be > Johannes' original change, but I'd support a patch which always set it to > 0 instead if it was deemed safer. On the other hand.. As I mentioned earlier, if someone's code is failing because of the permissions change, they can chmod /proc/sys/vm/drop_caches at boot time and be happy. They have no such workaround if their software misbehaves due to a read always returning "0".
On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > Either change is an upgrade from the current situation, at least. I prefer > > towards whatever makes the API the least confusing, which appears to be > > Johannes' original change, but I'd support a patch which always set it to > > 0 instead if it was deemed safer. > > On the other hand.. As I mentioned earlier, if someone's code is > failing because of the permissions change, they can chmod > /proc/sys/vm/drop_caches at boot time and be happy. They have no such > workaround if their software misbehaves due to a read always returning > "0". I lied. I can chmod things in /proc but I can't chmod things in /proc/sys/vm. Huh, why did we do that?
On Fri, Nov 01, 2019 at 12:35:44PM -0700, Andrew Morton wrote: > On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Either change is an upgrade from the current situation, at least. I prefer > > > towards whatever makes the API the least confusing, which appears to be > > > Johannes' original change, but I'd support a patch which always set it to > > > 0 instead if it was deemed safer. > > > > On the other hand.. As I mentioned earlier, if someone's code is > > failing because of the permissions change, they can chmod > > /proc/sys/vm/drop_caches at boot time and be happy. They have no such > > workaround if their software misbehaves due to a read always returning > > "0". > > I lied. I can chmod things in /proc but I can't chmod things in > /proc/sys/vm. Huh, why did we do that? To conserve memory! It was in 2007. For the record I support 0200 on vm.drop_caches. commit 77b14db502cb85a031fe8fde6c85d52f3e0acb63 [PATCH] sysctl: reimplement the sysctl proc support +static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr) +{ + struct inode *inode = dentry->d_inode; + int error; + + if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID)) + return -EPERM;
Alexey Dobriyan <adobriyan@gmail.com> writes: > On Fri, Nov 01, 2019 at 12:35:44PM -0700, Andrew Morton wrote: >> On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: >> >> > > Either change is an upgrade from the current situation, at least. I prefer >> > > towards whatever makes the API the least confusing, which appears to be >> > > Johannes' original change, but I'd support a patch which always set it to >> > > 0 instead if it was deemed safer. >> > >> > On the other hand.. As I mentioned earlier, if someone's code is >> > failing because of the permissions change, they can chmod >> > /proc/sys/vm/drop_caches at boot time and be happy. They have no such >> > workaround if their software misbehaves due to a read always returning >> > "0". >> >> I lied. I can chmod things in /proc but I can't chmod things in >> /proc/sys/vm. Huh, why did we do that? > > To conserve memory! It was in 2007. > For the record I support 0200 on vm.drop_caches. > > commit 77b14db502cb85a031fe8fde6c85d52f3e0acb63 > [PATCH] sysctl: reimplement the sysctl proc support > > +static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr) > +{ > + struct inode *inode = dentry->d_inode; > + int error; > + > + if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID)) > + return -EPERM; Almost. The rewrite was both to concerve memory and to support the network namespace. Which required a different view of proc files. But in this case we have always unconditionally called sysctl_perm. The change above at best removed a layer of obfuscation that made it look like some other permission check was being honored. Eric
On 31.10.19 23:16, Johannes Weiner wrote: > Currently, the drop_caches proc file and sysctl read back the last > value written, suggesting this is somehow a stateful setting instead > of a one-time command. Make it write-only, like e.g. compact_memory. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > kernel/sysctl.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 31ece1120aa4..50373984a5e2 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = { > .procname = "drop_caches", > .data = &sysctl_drop_caches, > .maxlen = sizeof(int), > - .mode = 0644, > + .mode = 0200, > .proc_handler = drop_caches_sysctl_handler, > .extra1 = SYSCTL_ONE, > .extra2 = &four, > Makes perfect sense to me (and we might notice while in next/master if this breaks something, hopefully) Acked-by: David Hildenbrand <david@redhat.com>
On 10/31/19 11:16 PM, Johannes Weiner wrote: > Currently, the drop_caches proc file and sysctl read back the last > value written, suggesting this is somehow a stateful setting instead > of a one-time command. Make it write-only, like e.g. compact_memory. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Generic tools such as "sysctl -a" will be fine as they already are fine with compact_memory being 0200, and if somebody has a specific script, they are maybe already making wrong decisions, so at least hopefully they will learn and adjust. > --- > kernel/sysctl.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 31ece1120aa4..50373984a5e2 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = { > .procname = "drop_caches", > .data = &sysctl_drop_caches, > .maxlen = sizeof(int), > - .mode = 0644, > + .mode = 0200, > .proc_handler = drop_caches_sysctl_handler, > .extra1 = SYSCTL_ONE, > .extra2 = &four, >
On Thu 31-10-19 18:16:02, Johannes Weiner wrote: > Currently, the drop_caches proc file and sysctl read back the last > value written, suggesting this is somehow a stateful setting instead > of a one-time command. Make it write-only, like e.g. compact_memory. > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Please add a note about the confusion this has caused already. The link posted by Chris would be useful as well. Acked-by: Michal Hocko <mhocko@suse.com> > --- > kernel/sysctl.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 31ece1120aa4..50373984a5e2 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = { > .procname = "drop_caches", > .data = &sysctl_drop_caches, > .maxlen = sizeof(int), > - .mode = 0644, > + .mode = 0200, > .proc_handler = drop_caches_sysctl_handler, > .extra1 = SYSCTL_ONE, > .extra2 = &four, > -- > 2.23.0
diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 31ece1120aa4..50373984a5e2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = { .procname = "drop_caches", .data = &sysctl_drop_caches, .maxlen = sizeof(int), - .mode = 0644, + .mode = 0200, .proc_handler = drop_caches_sysctl_handler, .extra1 = SYSCTL_ONE, .extra2 = &four,
Currently, the drop_caches proc file and sysctl read back the last value written, suggesting this is somehow a stateful setting instead of a one-time command. Make it write-only, like e.g. compact_memory. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sysctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)