Message ID | 87lhg8pwvz.fsf@x220.int.ebiederm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> Quoting Andy Lutomirski (luto@amacapital.net): >>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >>> <ebiederm@xmission.com> wrote: >>> > I had hoped to get some Tested-By's on that patch series. >>> >>> Sorry, I've been totally swamped. >>> >>> I suspect that Sandstorm is okay, but I haven't had a chance to test >>> it for real. Sandstorm makes only limited use of proc and sysfs in >>> containers, but I'll see if I can test it for real this weekend. >> >> Testing this with unprivileged containers, I get >> >> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >> - error mounting sysfs on >> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > > Grr.. I was afraid this would break something. :( > > Looking at my system I see that sysfs is currently mounted > "nosuid,nodev,noexec" > > Looking at the lxc-start code I don't see it as including any of those > mount options. In practice for sysfs I think those options are > meaningless (as there should be no devices and nothing executable in > sysfs) but I can understand the past concerns with chmod on virtual > filesystems that would incline people to use them, so I think the > failure is reporting a legitimate security issue in the lxc userspace > code where the the unprivileged code is currently attempting to give > greater access to sysfs than was given by the original mount of sysfs. > > As nosuid,nodev,noexec should not impair the operation of sysfs > operation it looks like you can always specify those options and just > make this concern go away. Linus is pretty strict about not breaking the ABI, and this definitely counts as breaking the ABI. There's an exception for security issues, but is there really a security issue here? That is, do we lose anything important if we just drop the offending part of the patch set? As you've said, there shouldn't be sensitive device nodes, executables, or setuid files in proc or sysfs in the first place. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman > <ebiederm@xmission.com> wrote: >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >> >>> Quoting Andy Lutomirski (luto@amacapital.net): >>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >>>> <ebiederm@xmission.com> wrote: >>>> > I had hoped to get some Tested-By's on that patch series. >>>> >>>> Sorry, I've been totally swamped. >>>> >>>> I suspect that Sandstorm is okay, but I haven't had a chance to test >>>> it for real. Sandstorm makes only limited use of proc and sysfs in >>>> containers, but I'll see if I can test it for real this weekend. >>> >>> Testing this with unprivileged containers, I get >>> >>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >>> - error mounting sysfs on >>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 >> >> Grr.. I was afraid this would break something. :( >> >> Looking at my system I see that sysfs is currently mounted >> "nosuid,nodev,noexec" >> >> Looking at the lxc-start code I don't see it as including any of those >> mount options. In practice for sysfs I think those options are >> meaningless (as there should be no devices and nothing executable in >> sysfs) but I can understand the past concerns with chmod on virtual >> filesystems that would incline people to use them, so I think the >> failure is reporting a legitimate security issue in the lxc userspace >> code where the the unprivileged code is currently attempting to give >> greater access to sysfs than was given by the original mount of sysfs. >> >> As nosuid,nodev,noexec should not impair the operation of sysfs >> operation it looks like you can always specify those options and just >> make this concern go away. > > Linus is pretty strict about not breaking the ABI, and this definitely > counts as breaking the ABI. There's an exception for security issues, > but is there really a security issue here? That is, do we lose > anything important if we just drop the offending part of the patch > set? As you've said, there shouldn't be sensitive device nodes, > executables, or setuid files in proc or sysfs in the first place. Speaking as a user of the mount() interfaces, I really think it would be less confusing overall if mount() simply ignored the requested flags when the caller doesn't have a choice. That is, in cases where mount() currently fails with EPERM when not given, say, MS_NOSUID, it should instead just pretend the caller actually set MS_NOSUID and go ahead with a nosuid mount. Or put another way, the absence of MS_NOSUID should not be interpreted as "remove the nosuid bit" but rather "don't set the nosuid bit if not required". Consider: - This approach will actually cause lxc to have the correct behavior, without any changes to lxc. I suspect that this generalizes: In the vast majority of cases, when users have failed to set MS_NOSUID, it's not because they are explicitly requesting that the flag be turned off, but rather that they didn't know it mattered. - If a user actually *does* expect not passing MS_NOSUID to remove the nosuid bit, and they find instead that the nosuid bit is silently kept, I don't think they'll be confused: it's pretty obvious in context that this must be for security reasons. - On the other hand, the current behavior *is* very confusing: mount() returns EPERM because of rules the caller probably doesn't know anything about. I've spent a fair amount of time frustrated by this sort of thing. -Kenton -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kenton Varda <kenton@sandstorm.io> writes: > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote: >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman >> <ebiederm@xmission.com> wrote: >>> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >>> >>>> Quoting Andy Lutomirski (luto@amacapital.net): >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >>>>> <ebiederm@xmission.com> wrote: >>>>> > I had hoped to get some Tested-By's on that patch series. >>>>> >>>>> Sorry, I've been totally swamped. >>>>> >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test >>>>> it for real. Sandstorm makes only limited use of proc and sysfs in >>>>> containers, but I'll see if I can test it for real this weekend. >>>> >>>> Testing this with unprivileged containers, I get >>>> >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >>>> - error mounting sysfs on >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 >>> >>> Grr.. I was afraid this would break something. :( >>> >>> Looking at my system I see that sysfs is currently mounted >>> "nosuid,nodev,noexec" >>> >>> Looking at the lxc-start code I don't see it as including any of those >>> mount options. In practice for sysfs I think those options are >>> meaningless (as there should be no devices and nothing executable in >>> sysfs) but I can understand the past concerns with chmod on virtual >>> filesystems that would incline people to use them, so I think the >>> failure is reporting a legitimate security issue in the lxc userspace >>> code where the the unprivileged code is currently attempting to give >>> greater access to sysfs than was given by the original mount of sysfs. >>> >>> As nosuid,nodev,noexec should not impair the operation of sysfs >>> operation it looks like you can always specify those options and just >>> make this concern go away. >> >> Linus is pretty strict about not breaking the ABI, and this definitely >> counts as breaking the ABI. There's an exception for security issues, >> but is there really a security issue here? That is, do we lose >> anything important if we just drop the offending part of the patch >> set? As you've said, there shouldn't be sensitive device nodes, >> executables, or setuid files in proc or sysfs in the first place. We do need to enforce retaining the existing mount flags one way or another. Where this really matters is with MS_RDONLY. We don't want any old user to be able to mount /proc read-write when root mounted it read-only. There is a very real attack vector there. That attack almost works in docker container today and is avoided simply because docker mounts over a few files on proc. Which leads to the second side of the reason for these changes. I am fixing a very small but long standing ABI break. That is in some small ways I broke some sandboxes and when I realized they were broken I could not imagine think how to fix the code until now. It is the goal that user namespaces don't introduce anything for people to worry about security wise more than simply the ability to execute more kernel code. So at least when the kernel implementation is correct developers of existing applications simply do not need care. Sadly we are not quite there yet. > Speaking as a user of the mount() interfaces, I really think it would > be less confusing overall if mount() simply ignored the requested > flags when the caller doesn't have a choice. That is, in cases where > mount() currently fails with EPERM when not given, say, MS_NOSUID, it > should instead just pretend the caller actually set MS_NOSUID and go > ahead with a nosuid mount. Or put another way, the absence of > MS_NOSUID should not be interpreted as "remove the nosuid bit" but > rather "don't set the nosuid bit if not required". I am conflicted. Implicits are nice but confusing. If we can do something reliable and robust and maintainable here that is truly worth the cost I am all for it. If I mount proc read-write I likely want to be able to write to proc files, and I will be much happier if the mount fails than if a bazillion syscalls later something else fails when it tries to write to proc. > Consider: > > - This approach will actually cause lxc to have the correct behavior, > without any changes to lxc. I suspect that this generalizes: In the > vast majority of cases, when users have failed to set MS_NOSUID, it's > not because they are explicitly requesting that the flag be turned > off, but rather that they didn't know it mattered. > > - If a user actually *does* expect not passing MS_NOSUID to remove the > nosuid bit, and they find instead that the nosuid bit is silently > kept, I don't think they'll be confused: it's pretty obvious in > context that this must be for security reasons. > > - On the other hand, the current behavior *is* very confusing: mount() > returns EPERM because of rules the caller probably doesn't know > anything about. I've spent a fair amount of time frustrated by this > sort of thing. My sympathies. This all started with an oh crap we overlooked corner case X and it actually matters, and the fixes were quite likely a little bit hasty. The only case where this really shows up is remount insode of a user namespace of filesystems that were mounted outside of the user namespace is where this all actually matters. And mounting new instances of proc and sysfs wind up being weird instances of that nonsense. But please someone test sandstorm with this patchset and tell me if it bites you. The impetus to find a way to avoid breaking slightly buggy userspace is higher if it is more than unprivileged lxc that is broken. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 12:14 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > But please someone test sandstorm with this patchset and tell me if it > bites you. The impetus to find a way to avoid breaking slightly buggy > userspace is higher if it is more than unprivileged lxc that is broken. One of these days I'm going to learn how to compile and test kernels again (last time I did it was 1999). Unfortunately I don't think I have time at the moment, but hopefully Andy can do it. I note, though, that we only have two mount() calls in the sandstorm codebase that seem like they could be affected: run-bundle.c++:1264: KJ_SYSCALL(mount("proc", "proc", "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, "")); minibox.c++:251: KJ_SYSCALL(mount("proc", vpath.cStr(), "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, ""), supervisor.c++:921: KJ_SYSCALL(mount("/proc", "proc", nullptr, MS_BIND | MS_REC, nullptr)); The first two seem like they should be fine since they set all the flags (except readonly, which would be inappropriate for proc). I guess my habit of setting every security flag I see came in handy. The third case looks like it will be broken, BUT this line is in a debug-only code path, so I don't care. Also we have the ability to push any needed update within 24 hours, so we're generally in good shape. We never mount sysfs in Sandstorm. > If I mount proc read-write I likely want to be able to write to proc > files, and I will be much happier if the mount fails than if a bazillion > syscalls later something else fails when it tries to write to proc. I'm not sure that's true. Consider the broader context: 1) Your system's /proc is mounted read-only. 2) Now you're trying to mount a new proc in a new pid namespace, and you do *not* specify MS_READONLY. What should we expect here? Let's back off a bit and state user intent: 1) The system administrator has set a system-wide policy that /proc may only be read, not written. 2) You made a PID namespace and it needed its own proc. It seems intuitive here that the administrator's policy should apply in the namespace. Certainly everyone using the system and/or all software on the system already needs to be aware of this policy, since it's unusual and will break things. Running software on this system outside of any container already has the problem that syscalls randomly break, so why should it be surprising when this happens inside the container as well? Why do we need to go out of our way to break at mount() time? -Kenton -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am 28.05.2015 um 22:12 schrieb Kenton Varda:
> We never mount sysfs in Sandstorm.
sysfs is ABI and applications depend on it.
Even glibc is using sysfs. Currently it has
fallback paths but these may go away...
Thanks,
//richard
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote: > Serge Hallyn <serge.hallyn@ubuntu.com> writes: > > > Quoting Andy Lutomirski (luto@amacapital.net): > >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > >> <ebiederm@xmission.com> wrote: > >> > I had hoped to get some Tested-By's on that patch series. > >> > >> Sorry, I've been totally swamped. > >> > >> I suspect that Sandstorm is okay, but I haven't had a chance to test > >> it for real. Sandstorm makes only limited use of proc and sysfs in > >> containers, but I'll see if I can test it for real this weekend. > > > > Testing this with unprivileged containers, I get > > > > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > > - error mounting sysfs on > > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > > Grr.. I was afraid this would break something. :( > > Looking at my system I see that sysfs is currently mounted > "nosuid,nodev,noexec" > > Looking at the lxc-start code I don't see it as including any of those > mount options. In practice for sysfs I think those options are > meaningless (as there should be no devices and nothing executable in > sysfs) but I can understand the past concerns with chmod on virtual > filesystems that would incline people to use them, so I think the > failure is reporting a legitimate security issue in the lxc userspace > code where the the unprivileged code is currently attempting to give > greater access to sysfs than was given by the original mount of sysfs. > > As nosuid,nodev,noexec should not impair the operation of sysfs > operation it looks like you can always specify those options and just > make this concern go away. > > Something like the untested patch below I expect. > > diff --git a/src/lxc/conf.c b/src/lxc/conf.c > index 9870455b3cae..d9ccd03afe68 100644 > --- a/src/lxc/conf.c > +++ b/src/lxc/conf.c > @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha > { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, > { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, > - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, > + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, > { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, > { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, fwiw - the first one works, the second one does not due to an apparent inability to statvfs the origin. > Alternately you can read the flags off of the original mount of proc or sysfs. > > diff --git a/src/lxc/conf.c b/src/lxc/conf.c > index 9870455b3cae..50ea49973e80 100644 > --- a/src/lxc/conf.c > +++ b/src/lxc/conf.c > @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d, > struct statvfs sb; > unsigned long required_flags = 0; > > - if (!(flags & MS_REMOUNT)) > + if (!(flags & MS_REMOUNT) && > + (strcmp(s, "proc") != 0) && > + (strcmp(s, "sysfs") != 0)) > return flags; > > if (!s) > > Eric > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard@nod.at> wrote: > Am 28.05.2015 um 22:12 schrieb Kenton Varda: >> We never mount sysfs in Sandstorm. > > sysfs is ABI and applications depend on it. > Even glibc is using sysfs. Currently it has > fallback paths but these may go away... Off-topic, but Sandstorm isn't intended to provide a full Linux ABI. It is intended to provide a secure sandbox that can run apps that have been explicitly ported to Sandstorm. More background if you're interested: https://github.com/sandstorm-io/sandstorm/wiki/Security-Practices-Overview#server-sandboxing https://blog.sandstorm.io/news/2014-08-13-sandbox-security.html -Kenton -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am 28.05.2015 um 23:07 schrieb Kenton Varda: > On Thu, May 28, 2015 at 1:47 PM, Richard Weinberger <richard@nod.at> wrote: >> Am 28.05.2015 um 22:12 schrieb Kenton Varda: >>> We never mount sysfs in Sandstorm. >> >> sysfs is ABI and applications depend on it. >> Even glibc is using sysfs. Currently it has >> fallback paths but these may go away... > > Off-topic, but Sandstorm isn't intended to provide a full Linux ABI. > It is intended to provide a secure sandbox that can run apps that have > been explicitly ported to Sandstorm. More background if you're interested: Ahh, the application needs to be Sandstorm aware. I was missing that detail. Thanks for pointing that out! Thanks, //richard -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
"Serge E. Hallyn" <serge@hallyn.com> writes: > On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote: >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: >> >> > Quoting Andy Lutomirski (luto@amacapital.net): >> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman >> >> <ebiederm@xmission.com> wrote: >> >> > I had hoped to get some Tested-By's on that patch series. >> >> >> >> Sorry, I've been totally swamped. >> >> >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test >> >> it for real. Sandstorm makes only limited use of proc and sysfs in >> >> containers, but I'll see if I can test it for real this weekend. >> > >> > Testing this with unprivileged containers, I get >> > >> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted >> > - error mounting sysfs on >> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 >> >> Grr.. I was afraid this would break something. :( >> >> Looking at my system I see that sysfs is currently mounted >> "nosuid,nodev,noexec" >> >> Looking at the lxc-start code I don't see it as including any of those >> mount options. In practice for sysfs I think those options are >> meaningless (as there should be no devices and nothing executable in >> sysfs) but I can understand the past concerns with chmod on virtual >> filesystems that would incline people to use them, so I think the >> failure is reporting a legitimate security issue in the lxc userspace >> code where the the unprivileged code is currently attempting to give >> greater access to sysfs than was given by the original mount of sysfs. >> >> As nosuid,nodev,noexec should not impair the operation of sysfs >> operation it looks like you can always specify those options and just >> make this concern go away. >> >> Something like the untested patch below I expect. >> >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c >> index 9870455b3cae..d9ccd03afe68 100644 >> --- a/src/lxc/conf.c >> +++ b/src/lxc/conf.c >> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > > fwiw - the first one works, the second one does not due to an apparent > inability to statvfs the origin. Good to hear. That confirms there are no other gotchas waiting in the wings. Apparently my second suggested patch is buggy due to an invalid source string. The source would need to be %r/proc or %r/sysfs to use statvfs productively. >> Alternately you can read the flags off of the original mount of proc or sysfs. >> >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c >> index 9870455b3cae..50ea49973e80 100644 >> --- a/src/lxc/conf.c >> +++ b/src/lxc/conf.c >> @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d, >> struct statvfs sb; >> unsigned long required_flags = 0; >> >> - if (!(flags & MS_REMOUNT)) >> + if (!(flags & MS_REMOUNT) && >> + (strcmp(s, "proc") != 0) && >> + (strcmp(s, "sysfs") != 0)) >> return flags; >> >> if (!s) >> >> Eric >> _______________________________________________ >> Containers mailing list >> Containers@lists.linux-foundation.org >> https://lists.linuxfoundation.org/mailman/listinfo/containers -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 04:42:34PM -0500, Eric W. Biederman wrote: > "Serge E. Hallyn" <serge@hallyn.com> writes: > > > On Thu, May 28, 2015 at 10:03:28AM -0500, Eric W. Biederman wrote: > >> Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >> > >> > Quoting Andy Lutomirski (luto@amacapital.net): > >> >> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > >> >> <ebiederm@xmission.com> wrote: > >> >> > I had hoped to get some Tested-By's on that patch series. > >> >> > >> >> Sorry, I've been totally swamped. > >> >> > >> >> I suspect that Sandstorm is okay, but I haven't had a chance to test > >> >> it for real. Sandstorm makes only limited use of proc and sysfs in > >> >> containers, but I'll see if I can test it for real this weekend. > >> > > >> > Testing this with unprivileged containers, I get > >> > > >> > lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > >> > - error mounting sysfs on > >> > /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > >> > >> Grr.. I was afraid this would break something. :( > >> > >> Looking at my system I see that sysfs is currently mounted > >> "nosuid,nodev,noexec" > >> > >> Looking at the lxc-start code I don't see it as including any of those > >> mount options. In practice for sysfs I think those options are > >> meaningless (as there should be no devices and nothing executable in > >> sysfs) but I can understand the past concerns with chmod on virtual > >> filesystems that would incline people to use them, so I think the > >> failure is reporting a legitimate security issue in the lxc userspace > >> code where the the unprivileged code is currently attempting to give > >> greater access to sysfs than was given by the original mount of sysfs. > >> > >> As nosuid,nodev,noexec should not impair the operation of sysfs > >> operation it looks like you can always specify those options and just > >> make this concern go away. > >> > >> Something like the untested patch below I expect. > >> > >> diff --git a/src/lxc/conf.c b/src/lxc/conf.c > >> index 9870455b3cae..d9ccd03afe68 100644 > >> --- a/src/lxc/conf.c > >> +++ b/src/lxc/conf.c > >> @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha > >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, > >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > >> { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, > >> - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, > >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > >> + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, > >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, > >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, > >> { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, > > > > fwiw - the first one works, the second one does not due to an apparent > > inability to statvfs the origin. > > Good to hear. That confirms there are no other gotchas waiting in the > wings. > > Apparently my second suggested patch is buggy due to an invalid source > string. The source would need to be %r/proc or %r/sysfs to use statvfs > productively. Right, in these cases they're only passing in "sysfs". The first way is more explicit anyway (though may not help some people who have a "lxc.mount.entry = sysfs sys sysfs ro 0 0" line in their configuration instead, so maybe we'll have to go with the second after all, d'oh. I'll have to look into it next week) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[resend due to HTML. Sorry.] On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: > > Kenton Varda <kenton@sandstorm.io> writes: > > > On Thu, May 28, 2015 at 10:33 AM, Andy Lutomirski <luto@amacapital.net> wrote: > >> On Thu, May 28, 2015 at 8:03 AM, Eric W. Biederman > >> <ebiederm@xmission.com> wrote: > >>> Serge Hallyn <serge.hallyn@ubuntu.com> writes: > >>> > >>>> Quoting Andy Lutomirski (luto@amacapital.net): > >>>>> On Fri, May 22, 2015 at 10:39 AM, Eric W. Biederman > >>>>> <ebiederm@xmission.com> wrote: > >>>>> > I had hoped to get some Tested-By's on that patch series. > >>>>> > >>>>> Sorry, I've been totally swamped. > >>>>> > >>>>> I suspect that Sandstorm is okay, but I haven't had a chance to test > >>>>> it for real. Sandstorm makes only limited use of proc and sysfs in > >>>>> containers, but I'll see if I can test it for real this weekend. > >>>> > >>>> Testing this with unprivileged containers, I get > >>>> > >>>> lxc-start: conf.c: lxc_mount_auto_mounts: 808 Operation not permitted > >>>> - error mounting sysfs on > >>>> /usr/lib/x86_64-linux-gnu/lxc/sys/devices/virtual/net flags 0 > >>> > >>> Grr.. I was afraid this would break something. :( > >>> > >>> Looking at my system I see that sysfs is currently mounted > >>> "nosuid,nodev,noexec" > >>> > >>> Looking at the lxc-start code I don't see it as including any of those > >>> mount options. In practice for sysfs I think those options are > >>> meaningless (as there should be no devices and nothing executable in > >>> sysfs) but I can understand the past concerns with chmod on virtual > >>> filesystems that would incline people to use them, so I think the > >>> failure is reporting a legitimate security issue in the lxc userspace > >>> code where the the unprivileged code is currently attempting to give > >>> greater access to sysfs than was given by the original mount of sysfs. > >>> > >>> As nosuid,nodev,noexec should not impair the operation of sysfs > >>> operation it looks like you can always specify those options and just > >>> make this concern go away. > >> > >> Linus is pretty strict about not breaking the ABI, and this definitely > >> counts as breaking the ABI. There's an exception for security issues, > >> but is there really a security issue here? That is, do we lose > >> anything important if we just drop the offending part of the patch > >> set? As you've said, there shouldn't be sensitive device nodes, > >> executables, or setuid files in proc or sysfs in the first place. > > We do need to enforce retaining the existing mount flags one way or > another. Where this really matters is with MS_RDONLY. We don't want > any old user to be able to mount /proc read-write when root mounted it > read-only. There is a very real attack vector there. That attack > almost works in docker container today and is avoided simply because > docker mounts over a few files on proc. You could drop the nosuid, noexec, and nodev changes and keep just the ro part. The ro part is probably not an ABI break in the sense of something that actually breaks real programs. > > Which leads to the second side of the reason for these changes. I am > fixing a very small but long standing ABI break. That is in some small > ways I broke some sandboxes and when I realized they were broken I could > not imagine think how to fix the code until now. > > It is the goal that user namespaces don't introduce anything for people > to worry about security wise more than simply the ability to execute > more kernel code. So at least when the kernel implementation is correct > developers of existing applications simply do not need care. Sadly we are > not quite there yet. > > > Speaking as a user of the mount() interfaces, I really think it would > > be less confusing overall if mount() simply ignored the requested > > flags when the caller doesn't have a choice. That is, in cases where > > mount() currently fails with EPERM when not given, say, MS_NOSUID, it > > should instead just pretend the caller actually set MS_NOSUID and go > > ahead with a nosuid mount. Or put another way, the absence of > > MS_NOSUID should not be interpreted as "remove the nosuid bit" but > > rather "don't set the nosuid bit if not required". > > I am conflicted. Implicits are nice but confusing. If we can do > something reliable and robust and maintainable here that is truly worth > the cost I am all for it. > > If I mount proc read-write I likely want to be able to write to proc > files, and I will be much happier if the mount fails than if a bazillion > syscalls later something else fails when it tries to write to proc. I agree. I don't like the implicit thing. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andy Lutomirski <luto@amacapital.net> writes: > On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >> Kenton Varda <kenton@sandstorm.io> writes: >> >> We do need to enforce retaining the existing mount flags one way or >> another. Where this really matters is with MS_RDONLY. We don't want >> any old user to be able to mount /proc read-write when root mounted it >> read-only. There is a very real attack vector there. That attack >> almost works in docker container today and is avoided simply because >> docker mounts over a few files on proc. > > You could drop the nosuid, noexec, and nodev changes and keep just the > ro part. The ro part is probably not an ABI break in the sense of > something that actually breaks real programs. As a change simply removing the code from the existing patches that worries about nosuid, noexec, and the nodev flags is certainly doable. It is the best proposal I have heard so far. I remain unconvinced about ignoring those flags: - There are clearly people who think it matters (or else proc and sysfs would not have those flags specified). - There have been times when it actually has mattered. Aka when files like /proc/self/env could be chmodded and used for privilege escalation. - The code in lxc and libvirt-lxc so far has been clearly buggy. * lxc only has problems with sysfs (in some configurations). * libvirt-lxc only has problems on a bind mount remount of proc after remounting proc properly. So I am leaning towards enforcing all of the mount flags including nosuid, noexec, and nodev. Then when the next subtle bug in proc or sysfs with respect to chmod shows up I will be able to sleep soundly at night because the mount flags of those filesystems allow a mitigation, and I did not sabatage the mitigation. Plus contemplating code that just enforces a couple of mount flags but not all of the feels wrong. I don't think it is actually a maintainable position to just enforce a couple of those flags. If nothing else I would expect someone to look at the code and to generate a bug fix to start enforcing the rest of the flags. Or perhaps it is in a few years time and something gets refactored and the enforcing starts happening by virtue of using a new common function that no-one realizes will be a problem. Additionally if we don't enforce nosuid, noexec, and nodev people are going to ask questions, that will be hard to explain. When what is truly desirable is to say that sysfs and proc are a little odd but they don't allow anything that a bind mount won't. I can be persuaded otherwise but right now I do think the kernel code needs to enforce nosuid, noexec, and nodev as it is a security issue (if only a defence in depth one), and a maintenance issue as I do not believe in the long term it is a maintanable or an explicable position. >> > Speaking as a user of the mount() interfaces, I really think it would >> > be less confusing overall if mount() simply ignored the requested >> > flags when the caller doesn't have a choice. That is, in cases where >> > mount() currently fails with EPERM when not given, say, MS_NOSUID, it >> > should instead just pretend the caller actually set MS_NOSUID and go >> > ahead with a nosuid mount. Or put another way, the absence of >> > MS_NOSUID should not be interpreted as "remove the nosuid bit" but >> > rather "don't set the nosuid bit if not required". >> >> I am conflicted. Implicits are nice but confusing. If we can do >> something reliable and robust and maintainable here that is truly worth >> the cost I am all for it. >> >> If I mount proc read-write I likely want to be able to write to proc >> files, and I will be much happier if the mount fails than if a bazillion >> syscalls later something else fails when it tries to write to proc. > > I agree. I don't like the implicit thing. My memory returns of our last round of looking at this and for whatever it's warts the existing mount API for remounting filesystems needs to have the flags have exactly the same meaning as at mount time. There are existing userspace applications that depend on that behavior. Implicits for only the locked mount flags is a little different but still ick. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Implicits for only the locked mount flags is a little different but > still ick. FWIW, I only ever meant to advocate for this for locked flags, i.e. cases where the only other option is to throw EPERM. Clearly when the user has permission, the exact requested flags should be applied, or all kinds of things break. It seems to me that if we can fix the security issue without breaking userspace, we should. Sometimes we end up with icky APIs to avoid breaking userspace. (Though IMO implicitly preserving locked bits is not icky at all.) -Kenton -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 28, 2015 at 9:36 PM, Eric W. Biederman <ebiederm@xmission.com> wrote: > Andy Lutomirski <luto@amacapital.net> writes: >> On May 28, 2015 12:19 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote: >>> Kenton Varda <kenton@sandstorm.io> writes: >>> >>> We do need to enforce retaining the existing mount flags one way or >>> another. Where this really matters is with MS_RDONLY. We don't want >>> any old user to be able to mount /proc read-write when root mounted it >>> read-only. There is a very real attack vector there. That attack >>> almost works in docker container today and is avoided simply because >>> docker mounts over a few files on proc. >> >> You could drop the nosuid, noexec, and nodev changes and keep just the >> ro part. The ro part is probably not an ABI break in the sense of >> something that actually breaks real programs. > > As a change simply removing the code from the existing patches that > worries about nosuid, noexec, and the nodev flags is certainly doable. > It is the best proposal I have heard so far. > > I remain unconvinced about ignoring those flags: > - There are clearly people who think it matters (or else proc and sysfs > would not have those flags specified). > > - There have been times when it actually has mattered. > Aka when files like /proc/self/env could be chmodded and used for > privilege escalation. > > - The code in lxc and libvirt-lxc so far has been clearly buggy. > * lxc only has problems with sysfs (in some configurations). > * libvirt-lxc only has problems on a bind mount remount of > proc after remounting proc properly. > > So I am leaning towards enforcing all of the mount flags including > nosuid, noexec, and nodev. Then when the next subtle bug in proc or > sysfs with respect to chmod shows up I will be able to sleep soundly at > night because the mount flags of those filesystems allow a mitigation, > and I did not sabatage the mitigation. One option would be to break the nosuid, nodev, and noexec parts into their own patch and then avoid tagging that patch for -stable if at all possible. It would be nice to avoid another -stable ABI break if at all possible. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andy Lutomirski <luto@amacapital.net> writes: > One option would be to break the nosuid, nodev, and noexec parts into > their own patch and then avoid tagging that patch for -stable if at > all possible. It would be nice to avoid another -stable ABI break if > at all possible. So I don't think we actually have anything that could be called an ABI break in the whole mess, but it is definitely a behavioral change that is a regression for lxc and libvirt-lxc that prevents them from starting. nodev does not actually matter because of the implicit silliness that is being added right now. We do want those programs fixed and after those programs are fixed we can safely begin failing mount when those attributes are being cleared in a fresh mount. So it looks to me like the best thing to do is to print a warning whenever lxc or libvirt-lxc gets it wrong, which should ensure the authors are sufficiently pestered that in a kernel release or 3 we can begin enforcing those attributes. Especially as the discussion on the fix for those applications has already begun. And if folks would double check the patch I am going to post in a moment to ensure that lxc and libvirt-lxc continue to start I would appreciate it. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: > Andy Lutomirski <luto@amacapital.net> writes: > > > One option would be to break the nosuid, nodev, and noexec parts into > > their own patch and then avoid tagging that patch for -stable if at > > all possible. It would be nice to avoid another -stable ABI break if > > at all possible. > > So I don't think we actually have anything that could be called an ABI > break in the whole mess, but it is definitely a behavioral change that > is a regression for lxc and libvirt-lxc that prevents them from starting. > > nodev does not actually matter because of the implicit silliness that > is being added right now. > > We do want those programs fixed and after those programs are fixed we > can safely begin failing mount when those attributes are being cleared > in a fresh mount. > > So it looks to me like the best thing to do is to print a warning > whenever lxc or libvirt-lxc gets it wrong, which should ensure the > authors are sufficiently pestered that in a kernel release or 3 we can > begin enforcing those attributes. Especially as the discussion on the > fix for those applications has already begun. "pestering" never works, look at some of the SCSI drivers for examples of how a distro will just patch out the "warning this driver is using an old api and needs to be fixed" messages. You can't break stuff like this, people will get upset :( greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: >> Andy Lutomirski <luto@amacapital.net> writes: >> >> > One option would be to break the nosuid, nodev, and noexec parts into >> > their own patch and then avoid tagging that patch for -stable if at >> > all possible. It would be nice to avoid another -stable ABI break if >> > at all possible. >> >> So I don't think we actually have anything that could be called an ABI >> break in the whole mess, but it is definitely a behavioral change that >> is a regression for lxc and libvirt-lxc that prevents them from starting. >> >> nodev does not actually matter because of the implicit silliness that >> is being added right now. >> >> We do want those programs fixed and after those programs are fixed we >> can safely begin failing mount when those attributes are being cleared >> in a fresh mount. >> >> So it looks to me like the best thing to do is to print a warning >> whenever lxc or libvirt-lxc gets it wrong, which should ensure the >> authors are sufficiently pestered that in a kernel release or 3 we can >> begin enforcing those attributes. Especially as the discussion on the >> fix for those applications has already begun. > > "pestering" never works, look at some of the SCSI drivers for examples > of how a distro will just patch out the "warning this driver is using an > old api and needs to be fixed" messages. > You can't break stuff like this, people will get upset :( A) To the best of my knowledge there are two programs on the face of the planet where this matters. (lxc and libvirt-lxc) B) The code in those two programs is buggy. That is the code in those two programs does not do what the authors intended. That is fixing those programs is something that should be done regardless of what I do in the kernel. I have already reached out to the developers of those programs. The pestering in the kernel is a form of reminder, not the primary source of communication. C) These bugs really are security holes. Currently they do not appear exploitable (thank goodness) but they are security holes. Since they are not currently exploitable it does make sense to give people a little time to get their act together. The bugs are larger then the case that is being hit here, this is just where they are noticed. D) Letting people know that there is a problem as part of a larger effort has actually worked for me. Distro's have stopped enabling the sysctl system call. E) Given that I have not audited sysfs and proc closely in recent years I may actually be wrong. Those bugs may actually be exploitable. All it takes is chmod to be supported on one file that can be made executable. That bug has existed in the past and I don't doubt someone will overlook something and we will see the bug again in the future. So it is my best judgment that I disable the code that stops containers from starting and just making it a warning (for now). Then in a release or so I start failing these operations instead of warning. This is the most fair and reasonable I can see to be. The only other choice I can see is to say I don't care it is a security issue I am breaking your sloopy insecure code. Am I being too nice with these security bugs? Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ebiederm@xmission.com (Eric W. Biederman) writes: > Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > >> On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: >>> Andy Lutomirski <luto@amacapital.net> writes: >>> >>> > One option would be to break the nosuid, nodev, and noexec parts into >>> > their own patch and then avoid tagging that patch for -stable if at >>> > all possible. It would be nice to avoid another -stable ABI break if >>> > at all possible. >>> >>> So I don't think we actually have anything that could be called an ABI >>> break in the whole mess, but it is definitely a behavioral change that >>> is a regression for lxc and libvirt-lxc that prevents them from starting. >>> >>> nodev does not actually matter because of the implicit silliness that >>> is being added right now. >>> >>> We do want those programs fixed and after those programs are fixed we >>> can safely begin failing mount when those attributes are being cleared >>> in a fresh mount. >>> >>> So it looks to me like the best thing to do is to print a warning >>> whenever lxc or libvirt-lxc gets it wrong, which should ensure the >>> authors are sufficiently pestered that in a kernel release or 3 we can >>> begin enforcing those attributes. Especially as the discussion on the >>> fix for those applications has already begun. >> >> "pestering" never works, look at some of the SCSI drivers for examples >> of how a distro will just patch out the "warning this driver is using an >> old api and needs to be fixed" messages. > >> You can't break stuff like this, people will get upset :( > > A) To the best of my knowledge there are two programs on the face of the > planet where this matters. (lxc and libvirt-lxc) > > B) The code in those two programs is buggy. That is the code in those > two programs does not do what the authors intended. That is fixing > those programs is something that should be done regardless of what > I do in the kernel. I have already reached out to the developers of > those programs. The pestering in the kernel is a form of reminder, > not the primary source of communication. > > C) These bugs really are security holes. Currently they do not appear > exploitable (thank goodness) but they are security holes. > > Since they are not currently exploitable it does make sense > to give people a little time to get their act together. > > The bugs are larger then the case that is being hit here, > this is just where they are noticed. > > D) Letting people know that there is a problem as part of a larger > effort has actually worked for me. Distro's have stopped enabling > the sysctl system call. > > E) Given that I have not audited sysfs and proc closely in recent years > I may actually be wrong. Those bugs may actually be exploitable. > All it takes is chmod to be supported on one file that can be made > executable. That bug has existed in the past and I don't doubt > someone will overlook something and we will see the bug again in the > future. > > So it is my best judgment that I disable the code that stops > containers from starting and just making it a warning (for now). > Then in a release or so I start failing these operations instead of > warning. > > This is the most fair and reasonable I can see to be. > > The only other choice I can see is to say I don't care it is a security > issue I am breaking your sloopy insecure code. > > Am I being too nice with these security bugs? Thinking about it a little more. There is a possibility that sometime in the future that someone will deliberately add a suid executable as a file in proc or sysfs and have a good reason for doing so. Some sysadmin or sandbox builder with special requirements may then disable suid and exec on proc because in their sandbox (not linux in general) having access to that executable is a bad thing. At which we have an exploitable security issue if nosuid and noexec are not enforced. Or in other words I am not smarter than the bad guys. This is a security issue. I can not ignore nosuid and noexec indefinitely. I have to make those cases fail at some point. At that point current unfixed versions of lxc and libvirt-lxc will break. A warning is the nicest I can imagine being. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 04, 2015 at 01:27:10AM -0500, Eric W. Biederman wrote: > Greg Kroah-Hartman <gregkh@linuxfoundation.org> writes: > > > On Wed, Jun 03, 2015 at 04:13:21PM -0500, Eric W. Biederman wrote: > >> Andy Lutomirski <luto@amacapital.net> writes: > >> > >> > One option would be to break the nosuid, nodev, and noexec parts into > >> > their own patch and then avoid tagging that patch for -stable if at > >> > all possible. It would be nice to avoid another -stable ABI break if > >> > at all possible. > >> > >> So I don't think we actually have anything that could be called an ABI > >> break in the whole mess, but it is definitely a behavioral change that > >> is a regression for lxc and libvirt-lxc that prevents them from starting. > >> > >> nodev does not actually matter because of the implicit silliness that > >> is being added right now. > >> > >> We do want those programs fixed and after those programs are fixed we > >> can safely begin failing mount when those attributes are being cleared > >> in a fresh mount. > >> > >> So it looks to me like the best thing to do is to print a warning > >> whenever lxc or libvirt-lxc gets it wrong, which should ensure the > >> authors are sufficiently pestered that in a kernel release or 3 we can > >> begin enforcing those attributes. Especially as the discussion on the > >> fix for those applications has already begun. > > > > "pestering" never works, look at some of the SCSI drivers for examples > > of how a distro will just patch out the "warning this driver is using an > > old api and needs to be fixed" messages. > > > You can't break stuff like this, people will get upset :( > > A) To the best of my knowledge there are two programs on the face of the > planet where this matters. (lxc and libvirt-lxc) > > B) The code in those two programs is buggy. That is the code in those > two programs does not do what the authors intended. That is fixing > those programs is something that should be done regardless of what > I do in the kernel. I have already reached out to the developers of > those programs. The pestering in the kernel is a form of reminder, > not the primary source of communication. > > C) These bugs really are security holes. Currently they do not appear > exploitable (thank goodness) but they are security holes. > > Since they are not currently exploitable it does make sense > to give people a little time to get their act together. > > The bugs are larger then the case that is being hit here, > this is just where they are noticed. > > D) Letting people know that there is a problem as part of a larger > effort has actually worked for me. Distro's have stopped enabling > the sysctl system call. > > E) Given that I have not audited sysfs and proc closely in recent years > I may actually be wrong. Those bugs may actually be exploitable. > All it takes is chmod to be supported on one file that can be made > executable. That bug has existed in the past and I don't doubt > someone will overlook something and we will see the bug again in the > future. > > So it is my best judgment that I disable the code that stops > containers from starting and just making it a warning (for now). > Then in a release or so I start failing these operations instead of > warning. > > This is the most fair and reasonable I can see to be. While I generally like & support the kernel standard that userspace must never be broken, as libvirt LXC maintainer I think what Eric proposes is acceptable from the libvirt POV. We'll get the fix into libvirt LXC in this month's release and backport it to our stable branches. So as long as there are a few months/releases grace period between this being a kernel warning and it turning into a hard error, libvirt users will have the fix already, or at least have it easily available to them. Regards, Daniel
diff --git a/src/lxc/conf.c b/src/lxc/conf.c index 9870455b3cae..d9ccd03afe68 100644 --- a/src/lxc/conf.c +++ b/src/lxc/conf.c @@ -770,8 +770,8 @@ static int lxc_mount_auto_mounts(struct lxc_conf *conf, int flags, struct lxc_ha { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, "%r/proc/sysrq-trigger", "%r/proc/sysrq-trigger", NULL, MS_BIND, NULL }, { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_MIXED, NULL, "%r/proc/sysrq-trigger", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, { LXC_AUTO_PROC_MASK, LXC_AUTO_PROC_RW, "proc", "%r/proc", "proc", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", 0, NULL }, - { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_RDONLY, NULL }, + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RW, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, + { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_RO, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID|MS_RDONLY, NULL }, { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "sysfs", "%r/sys", "sysfs", MS_NODEV|MS_NOEXEC|MS_NOSUID, NULL }, { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, "%r/sys", "%r/sys", NULL, MS_BIND, NULL }, { LXC_AUTO_SYS_MASK, LXC_AUTO_SYS_MIXED, NULL, "%r/sys", NULL, MS_REMOUNT|MS_BIND|MS_RDONLY, NULL }, Alternately you can read the flags off of the original mount of proc or sysfs. diff --git a/src/lxc/conf.c b/src/lxc/conf.c index 9870455b3cae..50ea49973e80 100644 --- a/src/lxc/conf.c +++ b/src/lxc/conf.c @@ -712,7 +712,9 @@ static unsigned long add_required_remount_flags(const char *s, const char *d, struct statvfs sb; unsigned long required_flags = 0; - if (!(flags & MS_REMOUNT)) + if (!(flags & MS_REMOUNT) && + (strcmp(s, "proc") != 0) && + (strcmp(s, "sysfs") != 0)) return flags; if (!s)