Message ID | 20200423002632.224776-3-dancol@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Control over userfaultfd kernel-fault handling | expand |
On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote: > +unprivileged_userfaultfd_user_mode_only > +======================================== > + > +This flag controls whether unprivileged users can use the userfaultfd > +system calls to handle page faults in kernel mode. If set to zero, > +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo > +unprivileged_userfaultfd above. If set to one, users without > +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd > +to succeed. Prohibiting use of userfaultfd for handling faults from > +kernel mode may make certain vulnerabilities more difficult > +to exploit. > + > +The default value is 0. If this is going to be added... I am thinking whether it should be easier to add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.: "0": unprivileged userfaultfd forbidden "1": unprivileged userfaultfd allowed (both user/kernel faults) "2": unprivileged userfaultfd allowed (only user faults) Because after all unprivileged_userfaultfd_user_mode_only will be meaningless (iiuc) if unprivileged_userfaultfd=0. The default value will also be the same as before ("1") then. Thanks,
On Wed, 6 May 2020 15:38:16 -0400 Peter Xu <peterx@redhat.com> wrote: > If this is going to be added... I am thinking whether it should be easier to > add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.: > > "0": unprivileged userfaultfd forbidden > "1": unprivileged userfaultfd allowed (both user/kernel faults) > "2": unprivileged userfaultfd allowed (only user faults) > > Because after all unprivileged_userfaultfd_user_mode_only will be meaningless > (iiuc) if unprivileged_userfaultfd=0. The default value will also be the same > as before ("1") then. It occurs to me to wonder whether this interface should also let an admin block *privileged* user from handling kernel-space faults? In a secure-boot/lockdown setting, this could be a hardening measure that keeps a (somewhat) restricted root user from expanding their privilege...? jon
On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote: > This sysctl can be set to either zero or one. When zero (the default) > the system lets all users call userfaultfd with or without > UFFD_USER_MODE_ONLY, modulo other access controls. When > unprivileged_userfaultfd_user_mode_only is set to one, users without > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API > will fail with EPERM. This facility allows administrators to reduce > the likelihood that an attacker with access to userfaultfd can delay > faulting kernel code to widen timing windows for other exploits. > > Signed-off-by: Daniel Colascione <dancol@google.com> The approach taken looks like a hard-coded security policy. For example, it won't be possible to set the sysctl knob in question on any sytem running kvm. So this is no good for any general purpose system. What's wrong with using a security policy for this instead? > --- > Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++ > fs/userfaultfd.c | 11 ++++++++++- > include/linux/userfaultfd_k.h | 1 + > kernel/sysctl.c | 9 +++++++++ > 4 files changed, 33 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index 0329a4d3fa9e..4296b508ab74 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability). > > The default value is 1. > > +unprivileged_userfaultfd_user_mode_only > +======================================== > + > +This flag controls whether unprivileged users can use the userfaultfd > +system calls to handle page faults in kernel mode. If set to zero, > +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo > +unprivileged_userfaultfd above. If set to one, users without > +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd > +to succeed. Prohibiting use of userfaultfd for handling faults from > +kernel mode may make certain vulnerabilities more difficult > +to exploit. > + > +The default value is 0. > > user_reserve_kbytes > =================== > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index 21378abe8f7b..85cc1ab74361 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -29,6 +29,7 @@ > #include <linux/hugetlb.h> > > int sysctl_unprivileged_userfaultfd __read_mostly = 1; > +int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0; > > static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly; > > @@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) > static const int uffd_flags = UFFD_USER_MODE_ONLY; > struct userfaultfd_ctx *ctx; > int fd; > + bool need_cap_check = false; > > - if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE)) > + if (!sysctl_unprivileged_userfaultfd) > + need_cap_check = true; > + > + if (sysctl_unprivileged_userfaultfd_user_mode_only && > + (flags & UFFD_USER_MODE_ONLY) == 0) > + need_cap_check = true; > + > + if (need_cap_check && !capable(CAP_SYS_PTRACE)) > return -EPERM; > > BUG_ON(!current->mm); > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h > index a8e5f3ea9bb2..d81e30074bf5 100644 > --- a/include/linux/userfaultfd_k.h > +++ b/include/linux/userfaultfd_k.h > @@ -31,6 +31,7 @@ > #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS) > > extern int sysctl_unprivileged_userfaultfd; > +extern int sysctl_unprivileged_userfaultfd_user_mode_only; > > extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 8a176d8727a3..9cbdf4483961 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = { > .extra1 = SYSCTL_ZERO, > .extra2 = SYSCTL_ONE, > }, > + { > + .procname = "unprivileged_userfaultfd_user_mode_only", > + .data = &sysctl_unprivileged_userfaultfd_user_mode_only, > + .maxlen = sizeof(sysctl_unprivileged_userfaultfd_user_mode_only), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_ONE, > + }, > #endif > { } > }; > -- > 2.26.2.303.gf8c07b1a785-goog >
On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote: > On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote: > > This sysctl can be set to either zero or one. When zero (the default) > > the system lets all users call userfaultfd with or without > > UFFD_USER_MODE_ONLY, modulo other access controls. When > > unprivileged_userfaultfd_user_mode_only is set to one, users without > > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API > > will fail with EPERM. This facility allows administrators to reduce > > the likelihood that an attacker with access to userfaultfd can delay > > faulting kernel code to widen timing windows for other exploits. > > > > Signed-off-by: Daniel Colascione <dancol@google.com> > > The approach taken looks like a hard-coded security policy. > For example, it won't be possible to set the sysctl knob > in question on any sytem running kvm. So this is > no good for any general purpose system. > > What's wrong with using a security policy for this instead? In fact I see the original thread already mentions selinux, so it's just a question of making this controllable by selinux. > > > > --- > > Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++ > > fs/userfaultfd.c | 11 ++++++++++- > > include/linux/userfaultfd_k.h | 1 + > > kernel/sysctl.c | 9 +++++++++ > > 4 files changed, 33 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > > index 0329a4d3fa9e..4296b508ab74 100644 > > --- a/Documentation/admin-guide/sysctl/vm.rst > > +++ b/Documentation/admin-guide/sysctl/vm.rst > > @@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability). > > > > The default value is 1. > > > > +unprivileged_userfaultfd_user_mode_only > > +======================================== > > + > > +This flag controls whether unprivileged users can use the userfaultfd > > +system calls to handle page faults in kernel mode. If set to zero, > > +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo > > +unprivileged_userfaultfd above. If set to one, users without > > +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd > > +to succeed. Prohibiting use of userfaultfd for handling faults from > > +kernel mode may make certain vulnerabilities more difficult > > +to exploit. > > + > > +The default value is 0. > > > > user_reserve_kbytes > > =================== > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index 21378abe8f7b..85cc1ab74361 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > @@ -29,6 +29,7 @@ > > #include <linux/hugetlb.h> > > > > int sysctl_unprivileged_userfaultfd __read_mostly = 1; > > +int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0; > > > > static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly; > > > > @@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) > > static const int uffd_flags = UFFD_USER_MODE_ONLY; > > struct userfaultfd_ctx *ctx; > > int fd; > > + bool need_cap_check = false; > > > > - if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE)) > > + if (!sysctl_unprivileged_userfaultfd) > > + need_cap_check = true; > > + > > + if (sysctl_unprivileged_userfaultfd_user_mode_only && > > + (flags & UFFD_USER_MODE_ONLY) == 0) > > + need_cap_check = true; > > + > > + if (need_cap_check && !capable(CAP_SYS_PTRACE)) > > return -EPERM; > > > > BUG_ON(!current->mm); > > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h > > index a8e5f3ea9bb2..d81e30074bf5 100644 > > --- a/include/linux/userfaultfd_k.h > > +++ b/include/linux/userfaultfd_k.h > > @@ -31,6 +31,7 @@ > > #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS) > > > > extern int sysctl_unprivileged_userfaultfd; > > +extern int sysctl_unprivileged_userfaultfd_user_mode_only; > > > > extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); > > > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > > index 8a176d8727a3..9cbdf4483961 100644 > > --- a/kernel/sysctl.c > > +++ b/kernel/sysctl.c > > @@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = { > > .extra1 = SYSCTL_ZERO, > > .extra2 = SYSCTL_ONE, > > }, > > + { > > + .procname = "unprivileged_userfaultfd_user_mode_only", > > + .data = &sysctl_unprivileged_userfaultfd_user_mode_only, > > + .maxlen = sizeof(sysctl_unprivileged_userfaultfd_user_mode_only), > > + .mode = 0644, > > + .proc_handler = proc_dointvec_minmax, > > + .extra1 = SYSCTL_ZERO, > > + .extra2 = SYSCTL_ONE, > > + }, > > #endif > > { } > > }; > > -- > > 2.26.2.303.gf8c07b1a785-goog > >
Hello Jonathan and everyone, On Thu, May 07, 2020 at 01:15:03PM -0600, Jonathan Corbet wrote: > On Wed, 6 May 2020 15:38:16 -0400 > Peter Xu <peterx@redhat.com> wrote: > > > If this is going to be added... I am thinking whether it should be easier to > > add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.: > > > > "0": unprivileged userfaultfd forbidden > > "1": unprivileged userfaultfd allowed (both user/kernel faults) > > "2": unprivileged userfaultfd allowed (only user faults) > > > > Because after all unprivileged_userfaultfd_user_mode_only will be meaningless > > (iiuc) if unprivileged_userfaultfd=0. The default value will also be the same > > as before ("1") > It occurs to me to wonder whether this interface should also let an admin > block *privileged* user from handling kernel-space faults? In a > secure-boot/lockdown setting, this could be a hardening measure that keeps > a (somewhat) restricted root user from expanding their privilege...? That's a good question. In my view if as root in lockdown mode you can still run the swapon syscall and setup nfs or other network devices and load userland fuse filesystems or cuse chardev in userland, even if you prevent userfaultfd from blocking kernel faults, kernel faults can still be blocked by other means. That in fact tends to be true also as non root (so regardless of lockdown settings) since luser can generally load fuse filesystems. There is no fundamental integrity breakage or privilege escalation originating in userfaultfd. The only concern here is about this: "after a new use-after-free is discovered in some other part of the kernel (not related to userfaultfd), how easy it is to turn the use-after-free from a mere DoS to a more concerning privilege escalation?". userfaultfd might facilitate the exploitation, but even if you remove userfaultfd from the equation, there's still no guarantee an user-after-free won't materialize as a privilege escalation by other means. So to express it in another way: unless lockdown (no matter in which mode) is a weak probabilistic based feature and in turn it cannot provide any guarantee to begin with, userfaultfd sysctl set to 0|1|2 can't possibly make any difference to it. The best mitigation for those kind of exploits remains to randomize all kernel memory allocations, so even if the attacker can block the fault, when it's unblocked it'll pick another page, not the one that the attacker can predict it will use, so the attacker needs to repeat the race many more times and hopefully it'll DoS and destabilize the kernel before it can reproduce a privilege escalation. We got many of those randomization features in the current kernel and it's probably more important to enable those than to worry about this sysctl value. One way to have a peace of mind against all use-after-free regardless of this sysctl value, is to run each pod in a KVM instance, that's safer than disabling syscalls or kernel features. The default seccomp profiles of podman already block userfaultfd too, so there's no need of virt to get extra safety if you use containers: containers need to explicitly opt-in to enable userfaultfd through the OCI schema seccomp object. If userfaultfd is being explicitly whitelisted in the OCI schema of the container, well then you know there is a good reason for it. As a matter of fact some things are only possible to achieve with userfaultfd fully enabled. The big value uffd brings compared to trapping sigsegv is precisely to be able to handle kernel faults transparently. sigsegv can't do that because every syscall would return 1) an inconsistent retval and 2) no fault address along with the retval. The possible future uffd userland users could be: dropping JVM dirty bit, redis snapshot using pthread_create() instead of fork(), distributed shared memory on pmem, new malloc() implementation never taking mmap_sem for writing in the kernel and never modifying any vma to allocate and free anon memory, etc.. I don't think any of them would work with the sysctl set to "2". The next kernel feature in uffd land that I was discussing with Peter, is an async uffd event model to further optimize the replacement of soft-dirty (which uffd already provides in O(1) instead of O(N)), so the wrprotect fault won't have to block anymore until the uffd async queue overflows. That also is unlikely to work with the sysctl set to "2" without adding extra constraints that soft-dirty doesn't currently have. It would also be possible to implement the value "2" to work like /proc/sys/kernel/unprivileged_bpf_disabled, so when you set it to "1" as root, you can't set it to "2" or "0" and when you set it to "2" you can't set it to "0", but personally I think it's unnecessary. Thanks, Andrea
Hello everyone, On Fri, May 08, 2020 at 12:54:03PM -0400, Michael S. Tsirkin wrote: > On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote: > > On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote: > > > This sysctl can be set to either zero or one. When zero (the default) > > > the system lets all users call userfaultfd with or without > > > UFFD_USER_MODE_ONLY, modulo other access controls. When > > > unprivileged_userfaultfd_user_mode_only is set to one, users without > > > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API > > > will fail with EPERM. This facility allows administrators to reduce > > > the likelihood that an attacker with access to userfaultfd can delay > > > faulting kernel code to widen timing windows for other exploits. > > > > > > Signed-off-by: Daniel Colascione <dancol@google.com> > > > > The approach taken looks like a hard-coded security policy. > > For example, it won't be possible to set the sysctl knob > > in question on any sytem running kvm. So this is > > no good for any general purpose system. > > > > What's wrong with using a security policy for this instead? > > In fact I see the original thread already mentions selinux, > so it's just a question of making this controllable by > selinux. I agree it'd be preferable if it was not hardcoded, but then this patchset is also much simpler than the previous controlling it through selinux.. I was thinking, an alternative policy that could control it without hard-coding it, is a seccomp-bpf filter, then you can drop 2/2 as well, not just 1/6-4/6. If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be always called with flags==0x1 without requiring extra modifications in the kernel? Can't you get the feature party with the CAP_SYS_PTRACE capability too, if you don't wrap those tasks with the ptrace capability under that seccomp filter? As far as I can tell, it's unprecedented to create a flag for a syscall API, with the only purpose of implementing a seccomp-bpf filter verifying such flag is set, but then if you want to control it with LSM it's even more complex than doing it with seccomp-bpf, and it requires more kernel code too. We could always add 2/2 later, such possibility won't disappear, in fact we could also add 1/6-4/6 later too if that is not enough. If we could begin by merging only 1/2 from this new series and be done with the kernel changes, because we offload the rest of the work to the kernel eBPF JIT, I think it'd be ideal. Thanks, Andrea
On Wed, May 20, 2020 at 12:59:38AM -0400, Andrea Arcangeli wrote: > Hello everyone, > > On Fri, May 08, 2020 at 12:54:03PM -0400, Michael S. Tsirkin wrote: > > On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote: > > > On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote: > > > > This sysctl can be set to either zero or one. When zero (the default) > > > > the system lets all users call userfaultfd with or without > > > > UFFD_USER_MODE_ONLY, modulo other access controls. When > > > > unprivileged_userfaultfd_user_mode_only is set to one, users without > > > > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API > > > > will fail with EPERM. This facility allows administrators to reduce > > > > the likelihood that an attacker with access to userfaultfd can delay > > > > faulting kernel code to widen timing windows for other exploits. > > > > > > > > Signed-off-by: Daniel Colascione <dancol@google.com> > > > > > > The approach taken looks like a hard-coded security policy. > > > For example, it won't be possible to set the sysctl knob > > > in question on any sytem running kvm. So this is > > > no good for any general purpose system. Not all systems run unprivileged KVM. :) > > > What's wrong with using a security policy for this instead? > > > > In fact I see the original thread already mentions selinux, > > so it's just a question of making this controllable by > > selinux. > > I agree it'd be preferable if it was not hardcoded, but then this > patchset is also much simpler than the previous controlling it through > selinux.. > > I was thinking, an alternative policy that could control it without > hard-coding it, is a seccomp-bpf filter, then you can drop 2/2 as > well, not just 1/6-4/6. Err, did I miss a separate 6-patch series? I can't find anything on lore. > > If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be > always called with flags==0x1 without requiring extra modifications in > the kernel? Please no. This is way too much overhead for something that a system owner wants to enforce globally. A sysctl is the correct option here, IMO. If it needs to be a per-userns sysctl, that would be fine too. > Can't you get the feature party with the CAP_SYS_PTRACE capability > too, if you don't wrap those tasks with the ptrace capability under > that seccomp filter? > > As far as I can tell, it's unprecedented to create a flag for a > syscall API, with the only purpose of implementing a seccomp-bpf > filter verifying such flag is set, but then if you want to control it > with LSM it's even more complex than doing it with seccomp-bpf, and it > requires more kernel code too. We could always add 2/2 later, such > possibility won't disappear, in fact we could also add 1/6-4/6 later > too if that is not enough. > > If we could begin by merging only 1/2 from this new series and be done > with the kernel changes, because we offload the rest of the work to > the kernel eBPF JIT, I think it'd be ideal. I'd agree that patch 1 should land, as it appears to be required for any further policy considerations. I'm still a big fan of a sysctl since this is the kind of thing I would absolutely turn on globally for all my systems.
Hello Kees, On Wed, May 20, 2020 at 11:03:39AM -0700, Kees Cook wrote: > Err, did I miss a separate 6-patch series? I can't find anything on lore. Daniel included the link of the previous series I referred to is the cover letter 0/2: https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/ > > If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be > > always called with flags==0x1 without requiring extra modifications in > > the kernel? > > Please no. This is way too much overhead for something that a system > owner wants to enforce globally. A sysctl is the correct option here, > IMO. If it needs to be a per-userns sysctl, that would be fine too. The question is who could be this system owner who prefers "2" to "0"? per-ns I don't see the point either when all containers already run with default policies enforcing the same behavior as if the sysctl is set to "0". Why exactly is it preferable to enlarge the surface of attack of the kernel and take the risk there is a real bug in userfaultfd code (not just a facilitation of exploiting some other kernel bug) that leads to a privilege escalation, when you still break 99% of userfaultfd users, if you set with option "2"? Is the system owner really going to purely run on his systems CRIU postcopy live migration (which already runs with CAP_SYS_PTRACE) and nothing else that could break? Option "2" to me looks with a single possible user, and incidentally this single user can already enforce model "2" by only tweaking its seccomp-bpf filters without applying 2/2. It'd be a bug if android apps runs unprotected by seccomp regardless of 2/2. System owners I think would be better of to stick to "0" or "1" a far as I can tell, "2" looks a bad tradeoff as system value, with nearly all cons of "0", but less secure than "0". > I'd agree that patch 1 should land, as it appears to be required for any Agreed about merging 1/2. > further policy considerations. I'm still a big fan of a sysctl since > this is the kind of thing I would absolutely turn on globally for all my > systems. The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there upstream and you should have already set it to "0" in all your systems if you can cope with some app features not working. 2/2 as modified with Peter's suggestion, would add a new value "2" (in addition of "1" and "0"), that still breaks the majority of all possible users, just like value "0", but that gives less security than value "0". It all boils down of how peculiar it is to be able to leverage only the acceleration (reduction in context switches and enter/exit kernel) and the vma fragmentation avoidance (to avoid running of vmas in /proc/sys/vm/max_map_count) provided by userfaultfd (vs sigsegv trapping) but not the transparency in handling faults in kernel (which sigsegv can't possibly achieve). Right now there's a single user that can cope with that limitation, and it's not your running on your servers but on your phone. Even CRIU cannot cope with such limitation, the only reason it would cope with value "2" is that it already runs with CAP_SYS_PTRACE for other reasons. If there will be more users that can cope with handling only user initiated page faults, then yes, I think it'd be fine to add a value "2" later. The other benefit of enforcing the policy with seccomp-bpf if that if you'd run Android userland on a container on a ARM server on top of an enterprise kernel, it'd already run as safe as if the sysctl value was tweaked to "2", but without having to also add extra kernel code for per-ns sysctl. It just looks simpler. The rest of the containers on the same host are already running today as if the sysctl is set to 0 regardless of this patchset. If you want to enforce maximum security and override any possible opt-out of the default podman seccomp profile that blocks userfaultfd, you already can upstream with the global sysctl by setting it to "0". Thanks, Andrea
On Wed, May 20, 2020 at 03:48:04PM -0400, Andrea Arcangeli wrote:
> The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there
Oops I picked the wrong unprivileged_* :) of course I meant:
/proc/sys/vm/unprivileged_userfaultfd
Adding the Android kernel team in the discussion. On Wed, May 20, 2020 at 12:51 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Wed, May 20, 2020 at 03:48:04PM -0400, Andrea Arcangeli wrote: > > The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there > > Oops I picked the wrong unprivileged_* :) of course I meant: > /proc/sys/vm/unprivileged_userfaultfd >
On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote:
> Adding the Android kernel team in the discussion.
Unless I'm mistaken that you can already enforce bit 1 of the second
parameter of the userfaultfd syscall to be set with seccomp-bpf, this
would be more a question to the Android userland team.
The question would be: does it ever happen that a seccomp filter isn't
already applied to unprivileged software running without
SYS_CAP_PTRACE capability?
If answer is "no" the behavior of the new sysctl in patch 2/2 (in
subject) should be enforceable with minor changes to the BPF
assembly. Otherwise it'd require more changes.
Thanks!
Andrea
On Wed, May 20, 2020 at 11:17 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote: > > Adding the Android kernel team in the discussion. > > Unless I'm mistaken that you can already enforce bit 1 of the second > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > would be more a question to the Android userland team. > > The question would be: does it ever happen that a seccomp filter isn't > already applied to unprivileged software running without > SYS_CAP_PTRACE capability? Yes. Android uses selinux as our primary sandboxing mechanism. We do use seccomp on a few processes, but we have found that it has a surprisingly high performance cost [1] on arm64 devices so turning it on system wide is not a good option. [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > subject) should be enforceable with minor changes to the BPF > assembly. Otherwise it'd require more changes. > > Thanks! > Andrea >
Daniel, the original contributor of this patchset, has moved to another company. Adding his personal email, in case he still wants to be involved. From the discussion so far it seems that there is a consensus that patch 1/2 in this series should be upstreamed in any case. Is there anything that is pending on that patch? On Fri, Jul 17, 2020 at 5:57 AM Jeffrey Vander Stoep <jeffv@google.com> wrote: > > On Wed, May 20, 2020 at 11:17 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote: > > > Adding the Android kernel team in the discussion. > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > would be more a question to the Android userland team. > > > > The question would be: does it ever happen that a seccomp filter isn't > > already applied to unprivileged software running without > > SYS_CAP_PTRACE capability? > > Yes. > > Android uses selinux as our primary sandboxing mechanism. We do use > seccomp on a few processes, but we have found that it has a > surprisingly high performance cost [1] on arm64 devices so turning it > on system wide is not a good option. > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > subject) should be enforceable with minor changes to the BPF > > assembly. Otherwise it'd require more changes. > > Adding Nick (Jeff is already here) to respond to Andrea's concerns about adding option '2' to sysctl knob. > > Thanks! > > Andrea > >
On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > From the discussion so far it seems that there is a consensus that > patch 1/2 in this series should be upstreamed in any case. Is there > anything that is pending on that patch? That's my reading of this thread too. > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > would be more a question to the Android userland team. > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > already applied to unprivileged software running without > > > SYS_CAP_PTRACE capability? > > > > Yes. > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > seccomp on a few processes, but we have found that it has a > > surprisingly high performance cost [1] on arm64 devices so turning it > > on system wide is not a good option. > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad As Jeff mentioned, seccomp is used strategically on Android, but is not applied to all processes. It's too expensive and impractical when simpler implementations (such as this sysctl) can exist. It's also significantly simpler to test a sysctl value for correctness as opposed to a seccomp filter. > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > subject) should be enforceable with minor changes to the BPF > > > assembly. Otherwise it'd require more changes. It would be good to understand what these changes are. > > > Why exactly is it preferable to enlarge the surface of attack of the > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > just a facilitation of exploiting some other kernel bug) that leads to > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > if you set with option "2"? I can see your point if you think about the feature as a whole. However, distributions (such as Android) have specialized knowledge of their security environments, and may not want to support the typical usages of userfaultfd. For such distributions, providing a mechanism to prevent userfaultfd from being useful as an exploit primitive, while still allowing the very limited use of userfaultfd for userspace faults only, is desirable. Distributions shouldn't be forced into supporting 100% of the use cases envisioned by userfaultfd when their needs may be more specialized, and this sysctl knob empowers distributions to make this choice for themselves. > > > Is the system owner really going to purely run on his systems CRIU > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > nothing else that could break? This is a great example of a capability which a distribution may not want to support, due to distribution specific security policies. > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > this single user can already enforce model "2" by only tweaking its > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > apps runs unprotected by seccomp regardless of 2/2. Can you elaborate on what bug is present by processes being unprotected by seccomp? Seccomp cannot be universally applied on Android due to previously mentioned performance concerns. Seccomp is used in Android primarily as a tool to enforce the list of allowed syscalls, so that such syscalls can be audited before being included as part of the Android API. -- Nick
On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > From the discussion so far it seems that there is a consensus that > > patch 1/2 in this series should be upstreamed in any case. Is there > > anything that is pending on that patch? > > That's my reading of this thread too. > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > would be more a question to the Android userland team. > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > already applied to unprivileged software running without > > > > SYS_CAP_PTRACE capability? > > > > > > Yes. > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > seccomp on a few processes, but we have found that it has a > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > on system wide is not a good option. > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > As Jeff mentioned, seccomp is used strategically on Android, but is > not applied to all processes. It's too expensive and impractical when > simpler implementations (such as this sysctl) can exist. It's also > significantly simpler to test a sysctl value for correctness as > opposed to a seccomp filter. Given that selinux is already used system-wide on Android, what is wrong with using selinux to control userfaultfd as opposed to seccomp? > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > subject) should be enforceable with minor changes to the BPF > > > > assembly. Otherwise it'd require more changes. > > It would be good to understand what these changes are. > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > if you set with option "2"? > > I can see your point if you think about the feature as a whole. > However, distributions (such as Android) have specialized knowledge of > their security environments, and may not want to support the typical > usages of userfaultfd. For such distributions, providing a mechanism > to prevent userfaultfd from being useful as an exploit primitive, > while still allowing the very limited use of userfaultfd for userspace > faults only, is desirable. Distributions shouldn't be forced into > supporting 100% of the use cases envisioned by userfaultfd when their > needs may be more specialized, and this sysctl knob empowers > distributions to make this choice for themselves. > > > > > Is the system owner really going to purely run on his systems CRIU > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > nothing else that could break? > > This is a great example of a capability which a distribution may not > want to support, due to distribution specific security policies. > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > this single user can already enforce model "2" by only tweaking its > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > apps runs unprotected by seccomp regardless of 2/2. > > Can you elaborate on what bug is present by processes being > unprotected by seccomp? > > Seccomp cannot be universally applied on Android due to previously > mentioned performance concerns. Seccomp is used in Android primarily > as a tool to enforce the list of allowed syscalls, so that such > syscalls can be audited before being included as part of the Android > API. > > -- Nick > > -- > Nick Kralevich | nnk@google.com
On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > From the discussion so far it seems that there is a consensus that > > > patch 1/2 in this series should be upstreamed in any case. Is there > > > anything that is pending on that patch? > > > > That's my reading of this thread too. > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > > would be more a question to the Android userland team. > > > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > > already applied to unprivileged software running without > > > > > SYS_CAP_PTRACE capability? > > > > > > > > Yes. > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > > seccomp on a few processes, but we have found that it has a > > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > > on system wide is not a good option. > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > not applied to all processes. It's too expensive and impractical when > > simpler implementations (such as this sysctl) can exist. It's also > > significantly simpler to test a sysctl value for correctness as > > opposed to a seccomp filter. > > Given that selinux is already used system-wide on Android, what is wrong > with using selinux to control userfaultfd as opposed to seccomp? Userfaultfd file descriptors will be generally controlled by SELinux. You can see the patchset at https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ (which is also referenced in the original commit message for this patchset). However, the SELinux patchset doesn't include the ability to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. SELinux already has the ability to control who gets CAP_SYS_PTRACE, which combined with this patch, is largely equivalent to direct UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch above, movement of userfaultfd file descriptors can be mediated by SELinux, preventing one process from acquiring userfaultfd descriptors of other processes unless allowed by security policy. It's an interesting question whether finer-grain SELinux support for controlling UFFD_USER_MODE_ONLY should be added. I can see some advantages to implementing this. However, we don't need to decide that now. Kernel security checks generally break down into DAC (discretionary access control) and MAC (mandatory access control) controls. Most kernel security features check via both of these mechanisms. Security attributes of the system should be settable without necessarily relying on an LSM such as SELinux. This patch follows the same basic model -- system wide control of a hardening feature is provided by the unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, SELinux support for this can also be implemented on top of the DAC controls. This DAC/MAC split has been successful in several other security features. For example, the ability to map at page zero is controlled in DAC via the mmap_min_addr sysctl [1], and via SELinux via the mmap_zero access vector [2]. Similarly, access to the kernel ring buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict sysctl is very similar to this patch -- it introduces a capability (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. If we want to ensure that a security feature will be well tested and vetted, it's important to not limit its use to LSMs only. This ensures that kernel and application developers will always be able to test the effects of a security feature, without relying on LSMs like SELinux. It also ensures that all distributions can enable this security mitigation should it be necessary for their unique environments, without introducing an SELinux dependency. And this patch does not preclude an SELinux implementation should it be necessary. Even if we decide to implement fine-grain SELinux controls on UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this an either/or choice between SELinux and this patch. Both are necessary. -- Nick [1] https://wiki.debian.org/mmap_min_addr [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > > subject) should be enforceable with minor changes to the BPF > > > > > assembly. Otherwise it'd require more changes. > > > > It would be good to understand what these changes are. > > > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > > if you set with option "2"? > > > > I can see your point if you think about the feature as a whole. > > However, distributions (such as Android) have specialized knowledge of > > their security environments, and may not want to support the typical > > usages of userfaultfd. For such distributions, providing a mechanism > > to prevent userfaultfd from being useful as an exploit primitive, > > while still allowing the very limited use of userfaultfd for userspace > > faults only, is desirable. Distributions shouldn't be forced into > > supporting 100% of the use cases envisioned by userfaultfd when their > > needs may be more specialized, and this sysctl knob empowers > > distributions to make this choice for themselves. > > > > > > > Is the system owner really going to purely run on his systems CRIU > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > > nothing else that could break? > > > > This is a great example of a capability which a distribution may not > > want to support, due to distribution specific security policies. > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > > this single user can already enforce model "2" by only tweaking its > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > Can you elaborate on what bug is present by processes being > > unprotected by seccomp? > > > > Seccomp cannot be universally applied on Android due to previously > > mentioned performance concerns. Seccomp is used in Android primarily > > as a tool to enforce the list of allowed syscalls, so that such > > syscalls can be audited before being included as part of the Android > > API. > > > > -- Nick > > > > -- > > Nick Kralevich | nnk@google.com >
On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote: > On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > From the discussion so far it seems that there is a consensus that > > > > patch 1/2 in this series should be upstreamed in any case. Is there > > > > anything that is pending on that patch? > > > > > > That's my reading of this thread too. > > > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > > > would be more a question to the Android userland team. > > > > > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > > > already applied to unprivileged software running without > > > > > > SYS_CAP_PTRACE capability? > > > > > > > > > > Yes. > > > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > > > seccomp on a few processes, but we have found that it has a > > > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > > > on system wide is not a good option. > > > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > > not applied to all processes. It's too expensive and impractical when > > > simpler implementations (such as this sysctl) can exist. It's also > > > significantly simpler to test a sysctl value for correctness as > > > opposed to a seccomp filter. > > > > Given that selinux is already used system-wide on Android, what is wrong > > with using selinux to control userfaultfd as opposed to seccomp? > > Userfaultfd file descriptors will be generally controlled by SELinux. > You can see the patchset at > https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ > (which is also referenced in the original commit message for this > patchset). However, the SELinux patchset doesn't include the ability > to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. > > SELinux already has the ability to control who gets CAP_SYS_PTRACE, > which combined with this patch, is largely equivalent to direct > UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch > above, movement of userfaultfd file descriptors can be mediated by > SELinux, preventing one process from acquiring userfaultfd descriptors > of other processes unless allowed by security policy. > > It's an interesting question whether finer-grain SELinux support for > controlling UFFD_USER_MODE_ONLY should be added. I can see some > advantages to implementing this. However, we don't need to decide that > now. > > Kernel security checks generally break down into DAC (discretionary > access control) and MAC (mandatory access control) controls. Most > kernel security features check via both of these mechanisms. Security > attributes of the system should be settable without necessarily > relying on an LSM such as SELinux. This patch follows the same basic > model -- system wide control of a hardening feature is provided by the > unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, > SELinux support for this can also be implemented on top of the DAC > controls. > > This DAC/MAC split has been successful in several other security > features. For example, the ability to map at page zero is controlled > in DAC via the mmap_min_addr sysctl [1], and via SELinux via the > mmap_zero access vector [2]. Similarly, access to the kernel ring > buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as > well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict > sysctl is very similar to this patch -- it introduces a capability > (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. > > If we want to ensure that a security feature will be well tested and > vetted, it's important to not limit its use to LSMs only. This ensures > that kernel and application developers will always be able to test the > effects of a security feature, without relying on LSMs like SELinux. > It also ensures that all distributions can enable this security > mitigation should it be necessary for their unique environments, > without introducing an SELinux dependency. And this patch does not > preclude an SELinux implementation should it be necessary. > > Even if we decide to implement fine-grain SELinux controls on > UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this > an either/or choice between SELinux and this patch. Both are > necessary. > > -- Nick > > [1] https://wiki.debian.org/mmap_min_addr > [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions > [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt I am not sure I agree this is similar to dmesg access. The reason I say it is this: it is pretty easy for admins to know whether they run something that needs to access the kernel ring buffer. Or if it's a tool developer poking at dmesg, they can tell admins "we need these permissions". But it seems impossible for either an admin to know that a userfaultfd page e.g. used with shared memory is accessed from the kernel. So I guess the question is: how does anyone not running Android know to set this flag? I got the feeling it's not really possible, and so for a single-user feature like this a single API seems enough. Given a choice between a knob an admin is supposed to set and selinux policy written by presumably knowledgeable OS vendors, I'd opt for a second option. Hope this helps. > > > > > > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > > > subject) should be enforceable with minor changes to the BPF > > > > > > assembly. Otherwise it'd require more changes. > > > > > > It would be good to understand what these changes are. > > > > > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > > > if you set with option "2"? > > > > > > I can see your point if you think about the feature as a whole. > > > However, distributions (such as Android) have specialized knowledge of > > > their security environments, and may not want to support the typical > > > usages of userfaultfd. For such distributions, providing a mechanism > > > to prevent userfaultfd from being useful as an exploit primitive, > > > while still allowing the very limited use of userfaultfd for userspace > > > faults only, is desirable. Distributions shouldn't be forced into > > > supporting 100% of the use cases envisioned by userfaultfd when their > > > needs may be more specialized, and this sysctl knob empowers > > > distributions to make this choice for themselves. > > > > > > > > > Is the system owner really going to purely run on his systems CRIU > > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > > > nothing else that could break? > > > > > > This is a great example of a capability which a distribution may not > > > want to support, due to distribution specific security policies. > > > > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > > > this single user can already enforce model "2" by only tweaking its > > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > > > Can you elaborate on what bug is present by processes being > > > unprotected by seccomp? > > > > > > Seccomp cannot be universally applied on Android due to previously > > > mentioned performance concerns. Seccomp is used in Android primarily > > > as a tool to enforce the list of allowed syscalls, so that such > > > syscalls can be audited before being included as part of the Android > > > API. > > > > > > -- Nick > > > > > > -- > > > Nick Kralevich | nnk@google.com > > > > > -- > Nick Kralevich | nnk@google.com
On Wed, Aug 5, 2020 at 10:44 PM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote: > > On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote: > > > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > > From the discussion so far it seems that there is a consensus that > > > > > patch 1/2 in this series should be upstreamed in any case. Is there > > > > > anything that is pending on that patch? > > > > > > > > That's my reading of this thread too. > > > > > > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second > > > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this > > > > > > > would be more a question to the Android userland team. > > > > > > > > > > > > > > The question would be: does it ever happen that a seccomp filter isn't > > > > > > > already applied to unprivileged software running without > > > > > > > SYS_CAP_PTRACE capability? > > > > > > > > > > > > Yes. > > > > > > > > > > > > Android uses selinux as our primary sandboxing mechanism. We do use > > > > > > seccomp on a few processes, but we have found that it has a > > > > > > surprisingly high performance cost [1] on arm64 devices so turning it > > > > > > on system wide is not a good option. > > > > > > > > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad > > > > > > > > As Jeff mentioned, seccomp is used strategically on Android, but is > > > > not applied to all processes. It's too expensive and impractical when > > > > simpler implementations (such as this sysctl) can exist. It's also > > > > significantly simpler to test a sysctl value for correctness as > > > > opposed to a seccomp filter. > > > > > > Given that selinux is already used system-wide on Android, what is wrong > > > with using selinux to control userfaultfd as opposed to seccomp? > > > > Userfaultfd file descriptors will be generally controlled by SELinux. > > You can see the patchset at > > https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/ > > (which is also referenced in the original commit message for this > > patchset). However, the SELinux patchset doesn't include the ability > > to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly. > > > > SELinux already has the ability to control who gets CAP_SYS_PTRACE, > > which combined with this patch, is largely equivalent to direct > > UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch > > above, movement of userfaultfd file descriptors can be mediated by > > SELinux, preventing one process from acquiring userfaultfd descriptors > > of other processes unless allowed by security policy. > > > > It's an interesting question whether finer-grain SELinux support for > > controlling UFFD_USER_MODE_ONLY should be added. I can see some > > advantages to implementing this. However, we don't need to decide that > > now. > > > > Kernel security checks generally break down into DAC (discretionary > > access control) and MAC (mandatory access control) controls. Most > > kernel security features check via both of these mechanisms. Security > > attributes of the system should be settable without necessarily > > relying on an LSM such as SELinux. This patch follows the same basic > > model -- system wide control of a hardening feature is provided by the > > unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed, > > SELinux support for this can also be implemented on top of the DAC > > controls. > > > > This DAC/MAC split has been successful in several other security > > features. For example, the ability to map at page zero is controlled > > in DAC via the mmap_min_addr sysctl [1], and via SELinux via the > > mmap_zero access vector [2]. Similarly, access to the kernel ring > > buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as > > well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict > > sysctl is very similar to this patch -- it introduces a capability > > (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource. > > > > If we want to ensure that a security feature will be well tested and > > vetted, it's important to not limit its use to LSMs only. This ensures > > that kernel and application developers will always be able to test the > > effects of a security feature, without relying on LSMs like SELinux. > > It also ensures that all distributions can enable this security > > mitigation should it be necessary for their unique environments, > > without introducing an SELinux dependency. And this patch does not > > preclude an SELinux implementation should it be necessary. > > > > Even if we decide to implement fine-grain SELinux controls on > > UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this > > an either/or choice between SELinux and this patch. Both are > > necessary. > > > > -- Nick > > > > [1] https://wiki.debian.org/mmap_min_addr > > [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions > > [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt > > I am not sure I agree this is similar to dmesg access. > > The reason I say it is this: it is pretty easy for admins to know > whether they run something that needs to access the kernel ring buffer. > Or if it's a tool developer poking at dmesg, they can tell admins "we > need these permissions". But it seems impossible for either an admin to > know that a userfaultfd page e.g. used with shared memory is accessed > from the kernel. > > So I guess the question is: how does anyone not running Android > know to set this flag? > > I got the feeling it's not really possible, and so for a single-user > feature like this a single API seems enough. Given a choice between a > knob an admin is supposed to set and selinux policy written by > presumably knowledgeable OS vendors, I'd opt for a second option. > > Hope this helps. > There has been an emphasis that Android is probably the only user for the restriction of userfaults from kernel-space and that it wouldn’t be useful anywhere else. I humbly disagree! There are various areas where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely user-space setting. Basically, any lazy, on-demand, initialization/decompression/loading could be a good candidate for this trick. My project happens to be one of them. In fact, in Android we are also thinking of using it in some other places, all in user-space. And given that userfaultfd is an efficient replacement for this trick [1], there are various scenarios which would benefit from the restriction of userfaults from kernel-space, provided the admins care about security on such devices. IIUC, a security admin would never trust an unprivileged process with userfaults from kernel space. Therefore, a sysctl knob restriction with CAP_SYS_PTRACE for privileged processes seems like the right choice to me. Coming to sysctl vs. SELinux debate, I think wherever the role of OS vendor and admin is played by different people, I doubt a generic SELinux policy set by the former will be blindly acceptable to the latter. Furthermore, I’m not sure if an admin is expected to even know which packages running on their system are using userfaultfd. So they anyway have to rely on developers reaching out to get the required permission. With the new sysctl knob enabled, the number of such requests is only going to decrease. [1] https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt > > > > > > > > > > > > > > > > > > > > > > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in > > > > > > > subject) should be enforceable with minor changes to the BPF > > > > > > > assembly. Otherwise it'd require more changes. > > > > > > > > It would be good to understand what these changes are. > > > > > > > > > > > Why exactly is it preferable to enlarge the surface of attack of the > > > > > > > kernel and take the risk there is a real bug in userfaultfd code (not > > > > > > > just a facilitation of exploiting some other kernel bug) that leads to > > > > > > > a privilege escalation, when you still break 99% of userfaultfd users, > > > > > > > if you set with option "2"? > > > > > > > > I can see your point if you think about the feature as a whole. > > > > However, distributions (such as Android) have specialized knowledge of > > > > their security environments, and may not want to support the typical > > > > usages of userfaultfd. For such distributions, providing a mechanism > > > > to prevent userfaultfd from being useful as an exploit primitive, > > > > while still allowing the very limited use of userfaultfd for userspace > > > > faults only, is desirable. Distributions shouldn't be forced into > > > > supporting 100% of the use cases envisioned by userfaultfd when their > > > > needs may be more specialized, and this sysctl knob empowers > > > > distributions to make this choice for themselves. > > > > > > > > > > > Is the system owner really going to purely run on his systems CRIU > > > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and > > > > > > > nothing else that could break? > > > > > > > > This is a great example of a capability which a distribution may not > > > > want to support, due to distribution specific security policies. > > > > > > > > > > > > > > > > > > Option "2" to me looks with a single possible user, and incidentally > > > > > > > this single user can already enforce model "2" by only tweaking its > > > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android > > > > > > > apps runs unprotected by seccomp regardless of 2/2. > > > > > > > > Can you elaborate on what bug is present by processes being > > > > unprotected by seccomp? > > > > > > > > Seccomp cannot be universally applied on Android due to previously > > > > mentioned performance concerns. Seccomp is used in Android primarily > > > > as a tool to enforce the list of allowed syscalls, so that such > > > > syscalls can be audited before being included as part of the Android > > > > API. > > > > > > > > -- Nick > > > > > > > > -- > > > > Nick Kralevich | nnk@google.com > > > > > > > > > -- > > Nick Kralevich | nnk@google.com >
Hello, On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote: > There has been an emphasis that Android is probably the only user for > the restriction of userfaults from kernel-space and that it wouldn’t > be useful anywhere else. I humbly disagree! There are various areas > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely > user-space setting. Basically, any lazy, on-demand, For the record what I said is quoted below https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com : """It all boils down of how peculiar it is to be able to leverage only the acceleration [..] Right now there's a single user that can cope with that limitation [..] If there will be more users [..] it'd be fine to add a value "2" later.""" Specifically I never said "that it wouldn’t be useful anywhere else.". Also I'm only arguing about the sysctl visible kABI change in patch 2/2: the flag passed as parameter to the syscall in patch 1/2 is all great, because seccomp needs it in the scalar parameter of the syscall to implement a filter equivalent to your sysctl "2" policy with only patch 1/2 applied. I've two more questions now: 1) why don't you enforce the block of kernel initiated faults with seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just an optimization to remove a few instructions per syscall in the bpf execution of Android unprivileged apps? You should block a lot of other syscalls by default to all unprivileged processes, including vmsplice. In other words if it's just for Android, why can't Android solve it with only patch 1/2 by tweaking the seccomp filter? 2) given that Android is secure enough with the sysctl at value 2, why should we even retain the current sysctl 0 semantics? Why can't more secure systems just use seccomp and block userfaultfd, as it is already happens by default in the podman default seccomp whitelist (for those containers that don't define a new json whitelist in the OCI schema)? Shouldn't we focus our energy in making containers more secure by preventing the OCI schema of a random container to re-enable userfaultfd in the container seccomp filter instead of trying to solve this with a global sysctl? What's missing in my view is a kubernetes hard allowlist/denylist that cannot be overridden with the OCI schema in case people has the bad idea of running containers downloaded from a not fully trusted source, without adding virt isolation and that's an userland problem to be solved in the container runtime, not a kernel issue. Then you'd just add userfaultfd to the json of the k8s hard seccomp denylist instead of going around tweaking sysctl. What's your take in changing your 2/2 patch to just replace value "0" and avoid introducing a new value "2"? The value "0" was motivated by the concern that uffd can enlarge the race window for use after free by providing one more additional way to block kernel faults, but value "2" is already enough to solve that concern completely and it'll be the default on all Android. In other words by adding "2" you're effectively doing a more finegrined and more optimal implementation of "0" that remains useful and available to unprivileged apps and it already resolves all "robustness against side effects other kernel bugs" concerns. Clearly "0" is even more secure statistically but that would apply to every other syscall including vmsplice, and there's no /proc/sys/vm/unprivileged_vmsplice sysctl out there. The next issue we have now is with the pipe mutex (which is not a major concern but we need to solve it somehow for correctness). So I wonder if should make the default value to be "0" (or "2" if think we should not replace "0") and to allow only user initiated faults by default. Changing the sysctl default to be 0, will make live migration fail to switch to postcopy which will be (unnoticeable to the guest), instead of risking the VM to be killed because of network latency outlier. Then we wouldn't need to change the pipe code at all. Alternatively we could still fix the pipe code so it runs better (but it'll be more complex) or to disable uffd faults only in the pipe code. One thing to keep in mind is that if we change the default, then hypervisor hosts running QEMU would need to set: vm.userfaultfd = 1 in /etc/sysctl.conf if postcopy live migration is required, that's not particularly concerning constraint for qemu (there are likely other tweaks required and it looks less risky than an arbitrary timeout which could kill the VM: if the above is forgotten the postcopy live migration won't even start and it'll be unnoticeable to the guest). The main concern really are future apps that may want to use uffd for kernel initiated faults won't be allowed to do so by default anymore, those apps will be heavily incentivated to use bounce buffers before passing data to syscalls, similarly to the current use case of patch 2/2. Comments welcome, Andrea PS. Another usage of uffd that remains possible without privilege with the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib will remain possible without requiring extra privileges, by adding a UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be indistinguishable to UFFDIO_ZEROPAGE but it will solve the last performance bottleneck by avoiding a wrprotect fault after the allocation and it will be THP capable too). Memory will be freed with MADV_DONTNEED, without ever having to call mmap/mumap. It could move memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP which already exists.
On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > Hello, > > On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote: > > There has been an emphasis that Android is probably the only user for > > the restriction of userfaults from kernel-space and that it wouldn’t > > be useful anywhere else. I humbly disagree! There are various areas > > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely > > user-space setting. Basically, any lazy, on-demand, > > For the record what I said is quoted below > https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com : > > """It all boils down of how peculiar it is to be able to leverage only > the acceleration [..] Right now there's a single user that can cope > with that limitation [..] If there will be more users [..] it'd be > fine to add a value "2" later.""" > > Specifically I never said "that it wouldn’t be useful anywhere else.". > Thanks a lot for clarifying. > Also I'm only arguing about the sysctl visible kABI change in patch > 2/2: the flag passed as parameter to the syscall in patch 1/2 is all > great, because seccomp needs it in the scalar parameter of the syscall > to implement a filter equivalent to your sysctl "2" policy with only > patch 1/2 applied. > > I've two more questions now: > > 1) why don't you enforce the block of kernel initiated faults with > seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just > an optimization to remove a few instructions per syscall in the bpf > execution of Android unprivileged apps? You should block a lot of > other syscalls by default to all unprivileged processes, including > vmsplice. > > In other words if it's just for Android, why can't Android solve it > with only patch 1/2 by tweaking the seccomp filter? I would let Nick (nnk@) and Jeff (jeffv@) respond to this. The previous responses from both of them on this email thread (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/ and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/) suggest that the performance overhead of seccomp-bpf is too much. Kees also objected to it (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/) I'm not familiar with how seccomp-bpf works. All that I can add here is that userfaultfd syscall is usually not invoked in a performance critical code path. So, if the performance overhead of seccomp-bpf (if enabled) is observed on all syscalls originating from a process, then I'd say patch 2/2 is essential. Otherwise, it should be ok to let seccomp perform the same functionality instead. > > 2) given that Android is secure enough with the sysctl at value 2, why > should we even retain the current sysctl 0 semantics? Why can't > more secure systems just use seccomp and block userfaultfd, as it > is already happens by default in the podman default seccomp > whitelist (for those containers that don't define a new json > whitelist in the OCI schema)? Shouldn't we focus our energy in > making containers more secure by preventing the OCI schema of a > random container to re-enable userfaultfd in the container seccomp > filter instead of trying to solve this with a global sysctl? > > What's missing in my view is a kubernetes hard allowlist/denylist > that cannot be overridden with the OCI schema in case people has > the bad idea of running containers downloaded from a not fully > trusted source, without adding virt isolation and that's an > userland problem to be solved in the container runtime, not a > kernel issue. Then you'd just add userfaultfd to the json of the > k8s hard seccomp denylist instead of going around tweaking sysctl. > > What's your take in changing your 2/2 patch to just replace value "0" > and avoid introducing a new value "2"? SGTM. Disabling uffd completely for unprivileged processes can be achieved either using seccomp-bpf, or via SELinux, once the following patch series is upstreamed https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@google.com/ > > The value "0" was motivated by the concern that uffd can enlarge the > race window for use after free by providing one more additional way to > block kernel faults, but value "2" is already enough to solve that > concern completely and it'll be the default on all Android. > > In other words by adding "2" you're effectively doing a more > finegrined and more optimal implementation of "0" that remains useful > and available to unprivileged apps and it already resolves all > "robustness against side effects other kernel bugs" concerns. Clearly > "0" is even more secure statistically but that would apply to every > other syscall including vmsplice, and there's no > /proc/sys/vm/unprivileged_vmsplice sysctl out there. > > The next issue we have now is with the pipe mutex (which is not a > major concern but we need to solve it somehow for correctness). So I > wonder if should make the default value to be "0" (or "2" if think we > should not replace "0") and to allow only user initiated faults by > default. > > Changing the sysctl default to be 0, will make live migration fail to > switch to postcopy which will be (unnoticeable to the guest), instead > of risking the VM to be killed because of network latency > outlier. Then we wouldn't need to change the pipe code at all. > SGTM. I can change the default value to '0' (or '2') in the next revision of patch 2/2, unless somebody objects to this. > Alternatively we could still fix the pipe code so it runs better (but > it'll be more complex) or to disable uffd faults only in the pipe > code. > > One thing to keep in mind is that if we change the default, then > hypervisor hosts running QEMU would need to set: > > vm.userfaultfd = 1 > > in /etc/sysctl.conf if postcopy live migration is required, that's not > particularly concerning constraint for qemu (there are likely other > tweaks required and it looks less risky than an arbitrary timeout > which could kill the VM: if the above is forgotten the postcopy live > migration won't even start and it'll be unnoticeable to the guest). > > The main concern really are future apps that may want to use uffd for > kernel initiated faults won't be allowed to do so by default anymore, > those apps will be heavily incentivated to use bounce buffers before > passing data to syscalls, similarly to the current use case of patch 2/2. > > Comments welcome, > Andrea > > PS. Another usage of uffd that remains possible without privilege with > the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV > acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib > will remain possible without requiring extra privileges, by adding a > UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS > (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be > indistinguishable to UFFDIO_ZEROPAGE but it will solve the last > performance bottleneck by avoiding a wrprotect fault after the > allocation and it will be THP capable too). Memory will be freed with > MADV_DONTNEED, without ever having to call mmap/mumap. It could move > memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP > which already exists. > UFFDIO_POPULATE sounds like a really useful feature. I don't see it in the kernel yet. Is there a patch under work on this? If so, kindly share.
On Fri, Sep 4, 2020 at 5:36 PM Lokesh Gidra <lokeshgidra@google.com> wrote: > > On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > 1) why don't you enforce the block of kernel initiated faults with > > seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just > > an optimization to remove a few instructions per syscall in the bpf > > execution of Android unprivileged apps? You should block a lot of > > other syscalls by default to all unprivileged processes, including > > vmsplice. > > > > In other words if it's just for Android, why can't Android solve it > > with only patch 1/2 by tweaking the seccomp filter? > > I would let Nick (nnk@) and Jeff (jeffv@) respond to this. > > The previous responses from both of them on this email thread > (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/ > and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/) > suggest that the performance overhead of seccomp-bpf is too much. Kees > also objected to it > (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/) > > I'm not familiar with how seccomp-bpf works. All that I can add here > is that userfaultfd syscall is usually not invoked in a performance > critical code path. So, if the performance overhead of seccomp-bpf (if > enabled) is observed on all syscalls originating from a process, then > I'd say patch 2/2 is essential. Otherwise, it should be ok to let > seccomp perform the same functionality instead. > There are two primary reasons why seccomp isn't viable here. 1) Seccomp was never designed for whole-of-system protections, and is impractical to deploy for anything other than "leaf" processes. 2) Attempts to enable seccomp on Android have run into performance problems, even for trivial seccomp filters. Let's go into each one. Issue #1: Seccomp was never designed for whole-of-system protections, and is impractical to deploy for anything other than "leaf" processes. Andrea suggests deploying a seccomp filter purely focused on Android unprivileged[1] (third party installed) apps. However, the intention is for this security control to be used system-wide[2]. Only processes which have a need for kernel initiated faults should be allowed to use them; all other processes should be denied by default. And when I say "all' processes, I mean "all" processes, even those which run with UID=0. Andrea's proposal is akin to a denylist, where many modern distributions (such as Android) use allowlists. The seemingly obvious solution is to apply a global seccomp filter in init (PID=1), but it falls down in practice. Seccomp is an incredibly useful tool, but it wasn't designed to be applied system-wide. Seccomp is fundamentally hierarchical in nature. A seccomp filter, once applied, cannot be subsequently relaxed or removed in child processes. While this restriction is great for leaf processes, it causes problems for OS designers - a parent process must maintain an unused capability if any process in the parent's process tree uses that capability. This makes applying a userfaultfd seccomp filter in init impossible, since we expect a few of init's children (but not init itself or most of init's children) to use userfaultfd kernel faults. We end up back to a wack-a-mole (denylist) problem of trying to modify each individual process to block userfaultfd kernel faults, defeating the goals of system-wide protection, and introducing significant complexity into the system design. Seccomp should be used in the context where it provides the most value -- process leaf nodes. But trying to apply seccomp as a system-wide control just isn't viable. Lokesh's sysctl proposal doesn't have these problems. When the sysctl is set to 2 by the OS distributor, all processes which don't have CAP_SYS_PTRACE are denied kernel generated faults, making the system safe-by-default. Only processes which are on the OS distributor's CAP_SYS_PTRACE allowlist (see Android's allowlist at [3]) can generate these faults, and capabilities can be managed without regards to process hierarchy. This keeps the system minimally privileged and safe. Seccomp isn't a viable solution here. Issue #2: Attempts to enable seccomp on Android globally have run into performance problems, even for trivial seccomp filters. Android has tried a few times to enable seccomp globally, but even excluding the above-mentioned hierarchical process problems, we've seen performance regressions across the board. Imposing a seccomp filter merely for userfaultfd imposes a tax on every syscall, even if the process never makes use of userfaultfd. Lokesh's sysctl proposal avoids this tax and places the check where it's most effective, with the rest of the userfaultfd functionality. See also the threads that Lokesh mentioned above: * https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/ * https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/ * https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/ Thanks, -- Nick [1] The use of the term "unprivileged" is unfortunate. In Android, there's no coarse-grain privileged vs unprivileged process. Each process, including root processes, have only the privileges they need, and not a bit more. As a concrete example, Android's init process (PID=1) is not allowed to open TCP/UDP sockets, but is allowed to spawn children which can do so. Having each process be differently privileged, and ensuring that functionality is only given out on a need-to-have basis, is an important part of modern OS design. [2] The trend in modern exploits isn't to perform attacks directly from untrusted code to the kernel. A lot of the attack surface needed by an attacker isn't reachable directly from untrusted code, but only indirectly through other processes. The attacker moves laterally through the system, exploiting a process which has the necessary capabilities, then escalating to the kernel. Enforcing security controls system-wide is an important part of denying an attacker the tools for an effective exploit and preventing this kind of lateral movement from being useful. Denying an attacker access to kernel initiated faults in userfaultfd system-wide (except for authorized processes) is doubly important, as these kinds of faults are extremely valuable to an exploit writer (see explanation at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cefdca0a86be517bc390fc4541e3674b8e7803b0 or https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit) [3] https://android.googlesource.com/platform/system/sepolicy/+/7be9e9e372c70a5518f729a0cdcb0d39a28be377/private/domain.te#107 line 107
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 0329a4d3fa9e..4296b508ab74 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability). The default value is 1. +unprivileged_userfaultfd_user_mode_only +======================================== + +This flag controls whether unprivileged users can use the userfaultfd +system calls to handle page faults in kernel mode. If set to zero, +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo +unprivileged_userfaultfd above. If set to one, users without +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd +to succeed. Prohibiting use of userfaultfd for handling faults from +kernel mode may make certain vulnerabilities more difficult +to exploit. + +The default value is 0. user_reserve_kbytes =================== diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 21378abe8f7b..85cc1ab74361 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -29,6 +29,7 @@ #include <linux/hugetlb.h> int sysctl_unprivileged_userfaultfd __read_mostly = 1; +int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0; static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly; @@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags) static const int uffd_flags = UFFD_USER_MODE_ONLY; struct userfaultfd_ctx *ctx; int fd; + bool need_cap_check = false; - if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE)) + if (!sysctl_unprivileged_userfaultfd) + need_cap_check = true; + + if (sysctl_unprivileged_userfaultfd_user_mode_only && + (flags & UFFD_USER_MODE_ONLY) == 0) + need_cap_check = true; + + if (need_cap_check && !capable(CAP_SYS_PTRACE)) return -EPERM; BUG_ON(!current->mm); diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index a8e5f3ea9bb2..d81e30074bf5 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -31,6 +31,7 @@ #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS) extern int sysctl_unprivileged_userfaultfd; +extern int sysctl_unprivileged_userfaultfd_user_mode_only; extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason); diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 8a176d8727a3..9cbdf4483961 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, + { + .procname = "unprivileged_userfaultfd_user_mode_only", + .data = &sysctl_unprivileged_userfaultfd_user_mode_only, + .maxlen = sizeof(sysctl_unprivileged_userfaultfd_user_mode_only), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, #endif { } };
This sysctl can be set to either zero or one. When zero (the default) the system lets all users call userfaultfd with or without UFFD_USER_MODE_ONLY, modulo other access controls. When unprivileged_userfaultfd_user_mode_only is set to one, users without CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API will fail with EPERM. This facility allows administrators to reduce the likelihood that an attacker with access to userfaultfd can delay faulting kernel code to widen timing windows for other exploits. Signed-off-by: Daniel Colascione <dancol@google.com> --- Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++ fs/userfaultfd.c | 11 ++++++++++- include/linux/userfaultfd_k.h | 1 + kernel/sysctl.c | 9 +++++++++ 4 files changed, 33 insertions(+), 1 deletion(-)