diff mbox series

[2/2] Add a new sysctl knob: unprivileged_userfaultfd_user_mode_only

Message ID 20200423002632.224776-3-dancol@google.com (mailing list archive)
State New, archived
Headers show
Series Control over userfaultfd kernel-fault handling | expand

Commit Message

Daniel Colascione April 23, 2020, 12:26 a.m. UTC
This sysctl can be set to either zero or one. When zero (the default)
the system lets all users call userfaultfd with or without
UFFD_USER_MODE_ONLY, modulo other access controls. When
unprivileged_userfaultfd_user_mode_only is set to one, users without
CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
will fail with EPERM. This facility allows administrators to reduce
the likelihood that an attacker with access to userfaultfd can delay
faulting kernel code to widen timing windows for other exploits.

Signed-off-by: Daniel Colascione <dancol@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
 fs/userfaultfd.c                        | 11 ++++++++++-
 include/linux/userfaultfd_k.h           |  1 +
 kernel/sysctl.c                         |  9 +++++++++
 4 files changed, 33 insertions(+), 1 deletion(-)

Comments

Peter Xu May 6, 2020, 7:38 p.m. UTC | #1
On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> +unprivileged_userfaultfd_user_mode_only
> +========================================
> +
> +This flag controls whether unprivileged users can use the userfaultfd
> +system calls to handle page faults in kernel mode.  If set to zero,
> +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
> +unprivileged_userfaultfd above.  If set to one, users without
> +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
> +to succeed.  Prohibiting use of userfaultfd for handling faults from
> +kernel mode may make certain vulnerabilities more difficult
> +to exploit.
> +
> +The default value is 0.

If this is going to be added... I am thinking whether it should be easier to
add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.:

  "0": unprivileged userfaultfd forbidden
  "1": unprivileged userfaultfd allowed (both user/kernel faults)
  "2": unprivileged userfaultfd allowed (only user faults)

Because after all unprivileged_userfaultfd_user_mode_only will be meaningless
(iiuc) if unprivileged_userfaultfd=0.  The default value will also be the same
as before ("1") then.

Thanks,
Jonathan Corbet May 7, 2020, 7:15 p.m. UTC | #2
On Wed, 6 May 2020 15:38:16 -0400
Peter Xu <peterx@redhat.com> wrote:

> If this is going to be added... I am thinking whether it should be easier to
> add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.:
> 
>   "0": unprivileged userfaultfd forbidden
>   "1": unprivileged userfaultfd allowed (both user/kernel faults)
>   "2": unprivileged userfaultfd allowed (only user faults)
> 
> Because after all unprivileged_userfaultfd_user_mode_only will be meaningless
> (iiuc) if unprivileged_userfaultfd=0.  The default value will also be the same
> as before ("1") then.

It occurs to me to wonder whether this interface should also let an admin
block *privileged* user from handling kernel-space faults?  In a
secure-boot/lockdown setting, this could be a hardening measure that keeps
a (somewhat) restricted root user from expanding their privilege...?

jon
Michael S. Tsirkin May 8, 2020, 4:52 p.m. UTC | #3
On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> This sysctl can be set to either zero or one. When zero (the default)
> the system lets all users call userfaultfd with or without
> UFFD_USER_MODE_ONLY, modulo other access controls. When
> unprivileged_userfaultfd_user_mode_only is set to one, users without
> CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> will fail with EPERM. This facility allows administrators to reduce
> the likelihood that an attacker with access to userfaultfd can delay
> faulting kernel code to widen timing windows for other exploits.
> 
> Signed-off-by: Daniel Colascione <dancol@google.com>

The approach taken looks like a hard-coded security policy.
For example, it won't be possible to set the sysctl knob
in question on any sytem running kvm. So this is
no good for any general purpose system.

What's wrong with using a security policy for this instead?



> ---
>  Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
>  fs/userfaultfd.c                        | 11 ++++++++++-
>  include/linux/userfaultfd_k.h           |  1 +
>  kernel/sysctl.c                         |  9 +++++++++
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> index 0329a4d3fa9e..4296b508ab74 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability).
>  
>  The default value is 1.
>  
> +unprivileged_userfaultfd_user_mode_only
> +========================================
> +
> +This flag controls whether unprivileged users can use the userfaultfd
> +system calls to handle page faults in kernel mode.  If set to zero,
> +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
> +unprivileged_userfaultfd above.  If set to one, users without
> +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
> +to succeed.  Prohibiting use of userfaultfd for handling faults from
> +kernel mode may make certain vulnerabilities more difficult
> +to exploit.
> +
> +The default value is 0.
>  
>  user_reserve_kbytes
>  ===================
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 21378abe8f7b..85cc1ab74361 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -29,6 +29,7 @@
>  #include <linux/hugetlb.h>
>  
>  int sysctl_unprivileged_userfaultfd __read_mostly = 1;
> +int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
>  
>  static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
>  
> @@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
>  	static const int uffd_flags = UFFD_USER_MODE_ONLY;
>  	struct userfaultfd_ctx *ctx;
>  	int fd;
> +	bool need_cap_check = false;
>  
> -	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
> +	if (!sysctl_unprivileged_userfaultfd)
> +		need_cap_check = true;
> +
> +	if (sysctl_unprivileged_userfaultfd_user_mode_only &&
> +	    (flags & UFFD_USER_MODE_ONLY) == 0)
> +		need_cap_check = true;
> +
> +	if (need_cap_check && !capable(CAP_SYS_PTRACE))
>  		return -EPERM;
>  
>  	BUG_ON(!current->mm);
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index a8e5f3ea9bb2..d81e30074bf5 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -31,6 +31,7 @@
>  #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
>  
>  extern int sysctl_unprivileged_userfaultfd;
> +extern int sysctl_unprivileged_userfaultfd_user_mode_only;
>  
>  extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a176d8727a3..9cbdf4483961 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= SYSCTL_ONE,
>  	},
> +	{
> +		.procname	= "unprivileged_userfaultfd_user_mode_only",
> +		.data		= &sysctl_unprivileged_userfaultfd_user_mode_only,
> +		.maxlen		= sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
>  #endif
>  	{ }
>  };
> -- 
> 2.26.2.303.gf8c07b1a785-goog
>
Michael S. Tsirkin May 8, 2020, 4:54 p.m. UTC | #4
On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote:
> On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> > This sysctl can be set to either zero or one. When zero (the default)
> > the system lets all users call userfaultfd with or without
> > UFFD_USER_MODE_ONLY, modulo other access controls. When
> > unprivileged_userfaultfd_user_mode_only is set to one, users without
> > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> > will fail with EPERM. This facility allows administrators to reduce
> > the likelihood that an attacker with access to userfaultfd can delay
> > faulting kernel code to widen timing windows for other exploits.
> > 
> > Signed-off-by: Daniel Colascione <dancol@google.com>
> 
> The approach taken looks like a hard-coded security policy.
> For example, it won't be possible to set the sysctl knob
> in question on any sytem running kvm. So this is
> no good for any general purpose system.
> 
> What's wrong with using a security policy for this instead?

In fact I see the original thread already mentions selinux,
so it's just a question of making this controllable by
selinux.

> 
> 
> > ---
> >  Documentation/admin-guide/sysctl/vm.rst | 13 +++++++++++++
> >  fs/userfaultfd.c                        | 11 ++++++++++-
> >  include/linux/userfaultfd_k.h           |  1 +
> >  kernel/sysctl.c                         |  9 +++++++++
> >  4 files changed, 33 insertions(+), 1 deletion(-)
> > 
> > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
> > index 0329a4d3fa9e..4296b508ab74 100644
> > --- a/Documentation/admin-guide/sysctl/vm.rst
> > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > @@ -850,6 +850,19 @@ privileged users (with SYS_CAP_PTRACE capability).
> >  
> >  The default value is 1.
> >  
> > +unprivileged_userfaultfd_user_mode_only
> > +========================================
> > +
> > +This flag controls whether unprivileged users can use the userfaultfd
> > +system calls to handle page faults in kernel mode.  If set to zero,
> > +userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
> > +unprivileged_userfaultfd above.  If set to one, users without
> > +SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
> > +to succeed.  Prohibiting use of userfaultfd for handling faults from
> > +kernel mode may make certain vulnerabilities more difficult
> > +to exploit.
> > +
> > +The default value is 0.
> >  
> >  user_reserve_kbytes
> >  ===================
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 21378abe8f7b..85cc1ab74361 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -29,6 +29,7 @@
> >  #include <linux/hugetlb.h>
> >  
> >  int sysctl_unprivileged_userfaultfd __read_mostly = 1;
> > +int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
> >  
> >  static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
> >  
> > @@ -2009,8 +2010,16 @@ SYSCALL_DEFINE1(userfaultfd, int, flags)
> >  	static const int uffd_flags = UFFD_USER_MODE_ONLY;
> >  	struct userfaultfd_ctx *ctx;
> >  	int fd;
> > +	bool need_cap_check = false;
> >  
> > -	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
> > +	if (!sysctl_unprivileged_userfaultfd)
> > +		need_cap_check = true;
> > +
> > +	if (sysctl_unprivileged_userfaultfd_user_mode_only &&
> > +	    (flags & UFFD_USER_MODE_ONLY) == 0)
> > +		need_cap_check = true;
> > +
> > +	if (need_cap_check && !capable(CAP_SYS_PTRACE))
> >  		return -EPERM;
> >  
> >  	BUG_ON(!current->mm);
> > diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> > index a8e5f3ea9bb2..d81e30074bf5 100644
> > --- a/include/linux/userfaultfd_k.h
> > +++ b/include/linux/userfaultfd_k.h
> > @@ -31,6 +31,7 @@
> >  #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
> >  
> >  extern int sysctl_unprivileged_userfaultfd;
> > +extern int sysctl_unprivileged_userfaultfd_user_mode_only;
> >  
> >  extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
> >  
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a176d8727a3..9cbdf4483961 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1719,6 +1719,15 @@ static struct ctl_table vm_table[] = {
> >  		.extra1		= SYSCTL_ZERO,
> >  		.extra2		= SYSCTL_ONE,
> >  	},
> > +	{
> > +		.procname	= "unprivileged_userfaultfd_user_mode_only",
> > +		.data		= &sysctl_unprivileged_userfaultfd_user_mode_only,
> > +		.maxlen		= sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec_minmax,
> > +		.extra1		= SYSCTL_ZERO,
> > +		.extra2		= SYSCTL_ONE,
> > +	},
> >  #endif
> >  	{ }
> >  };
> > -- 
> > 2.26.2.303.gf8c07b1a785-goog
> >
Andrea Arcangeli May 20, 2020, 4:06 a.m. UTC | #5
Hello Jonathan and everyone,

On Thu, May 07, 2020 at 01:15:03PM -0600, Jonathan Corbet wrote:
> On Wed, 6 May 2020 15:38:16 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > If this is going to be added... I am thinking whether it should be easier to
> > add another value for unprivileged_userfaultfd, rather than a new sysctl. E.g.:
> > 
> >   "0": unprivileged userfaultfd forbidden
> >   "1": unprivileged userfaultfd allowed (both user/kernel faults)
> >   "2": unprivileged userfaultfd allowed (only user faults)
> > 
> > Because after all unprivileged_userfaultfd_user_mode_only will be meaningless
> > (iiuc) if unprivileged_userfaultfd=0.  The default value will also be the same
> > as before ("1") 
> It occurs to me to wonder whether this interface should also let an admin
> block *privileged* user from handling kernel-space faults?  In a
> secure-boot/lockdown setting, this could be a hardening measure that keeps
> a (somewhat) restricted root user from expanding their privilege...?

That's a good question. In my view if as root in lockdown mode you can
still run the swapon syscall and setup nfs or other network devices
and load userland fuse filesystems or cuse chardev in userland, even
if you prevent userfaultfd from blocking kernel faults, kernel faults
can still be blocked by other means.

That in fact tends to be true also as non root (so regardless of
lockdown settings) since luser can generally load fuse filesystems.
There is no fundamental integrity breakage or privilege escalation
originating in userfaultfd.

The only concern here is about this: "after a new use-after-free is
discovered in some other part of the kernel (not related to
userfaultfd), how easy it is to turn the use-after-free from a mere
DoS to a more concerning privilege escalation?". userfaultfd might
facilitate the exploitation, but even if you remove userfaultfd from
the equation, there's still no guarantee an user-after-free won't
materialize as a privilege escalation by other means.

So to express it in another way: unless lockdown (no matter in which
mode) is a weak probabilistic based feature and in turn it cannot
provide any guarantee to begin with, userfaultfd sysctl set to 0|1|2
can't possibly make any difference to it.

The best mitigation for those kind of exploits remains to randomize
all kernel memory allocations, so even if the attacker can block the
fault, when it's unblocked it'll pick another page, not the one that
the attacker can predict it will use, so the attacker needs to repeat
the race many more times and hopefully it'll DoS and destabilize the
kernel before it can reproduce a privilege escalation. We got many of
those randomization features in the current kernel and it's probably
more important to enable those than to worry about this sysctl value.

One way to have a peace of mind against all use-after-free regardless
of this sysctl value, is to run each pod in a KVM instance, that's
safer than disabling syscalls or kernel features.

The default seccomp profiles of podman already block userfaultfd too,
so there's no need of virt to get extra safety if you use containers:
containers need to explicitly opt-in to enable userfaultfd through the
OCI schema seccomp object. If userfaultfd is being explicitly
whitelisted in the OCI schema of the container, well then you know
there is a good reason for it. As a matter of fact some things are
only possible to achieve with userfaultfd fully enabled.

The big value uffd brings compared to trapping sigsegv is precisely to
be able to handle kernel faults transparently. sigsegv can't do that
because every syscall would return 1) an inconsistent retval and 2) no
fault address along with the retval.

The possible future uffd userland users could be: dropping JVM dirty
bit, redis snapshot using pthread_create() instead of fork(),
distributed shared memory on pmem, new malloc() implementation never
taking mmap_sem for writing in the kernel and never modifying any vma
to allocate and free anon memory, etc.. I don't think any of them
would work with the sysctl set to "2".

The next kernel feature in uffd land that I was discussing with Peter,
is an async uffd event model to further optimize the replacement of
soft-dirty (which uffd already provides in O(1) instead of O(N)), so
the wrprotect fault won't have to block anymore until the uffd async
queue overflows. That also is unlikely to work with the sysctl set to
"2" without adding extra constraints that soft-dirty doesn't currently
have.

It would also be possible to implement the value "2" to work like
/proc/sys/kernel/unprivileged_bpf_disabled, so when you set it to "1"
as root, you can't set it to "2" or "0" and when you set it to "2" you
can't set it to "0", but personally I think it's unnecessary.

Thanks,
Andrea
Andrea Arcangeli May 20, 2020, 4:59 a.m. UTC | #6
Hello everyone,

On Fri, May 08, 2020 at 12:54:03PM -0400, Michael S. Tsirkin wrote:
> On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote:
> > On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> > > This sysctl can be set to either zero or one. When zero (the default)
> > > the system lets all users call userfaultfd with or without
> > > UFFD_USER_MODE_ONLY, modulo other access controls. When
> > > unprivileged_userfaultfd_user_mode_only is set to one, users without
> > > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> > > will fail with EPERM. This facility allows administrators to reduce
> > > the likelihood that an attacker with access to userfaultfd can delay
> > > faulting kernel code to widen timing windows for other exploits.
> > > 
> > > Signed-off-by: Daniel Colascione <dancol@google.com>
> > 
> > The approach taken looks like a hard-coded security policy.
> > For example, it won't be possible to set the sysctl knob
> > in question on any sytem running kvm. So this is
> > no good for any general purpose system.
> > 
> > What's wrong with using a security policy for this instead?
> 
> In fact I see the original thread already mentions selinux,
> so it's just a question of making this controllable by
> selinux.

I agree it'd be preferable if it was not hardcoded, but then this
patchset is also much simpler than the previous controlling it through
selinux..

I was thinking, an alternative policy that could control it without
hard-coding it, is a seccomp-bpf filter, then you can drop 2/2 as
well, not just 1/6-4/6.

If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
always called with flags==0x1 without requiring extra modifications in
the kernel?

Can't you get the feature party with the CAP_SYS_PTRACE capability
too, if you don't wrap those tasks with the ptrace capability under
that seccomp filter?

As far as I can tell, it's unprecedented to create a flag for a
syscall API, with the only purpose of implementing a seccomp-bpf
filter verifying such flag is set, but then if you want to control it
with LSM it's even more complex than doing it with seccomp-bpf, and it
requires more kernel code too. We could always add 2/2 later, such
possibility won't disappear, in fact we could also add 1/6-4/6 later
too if that is not enough.

If we could begin by merging only 1/2 from this new series and be done
with the kernel changes, because we offload the rest of the work to
the kernel eBPF JIT, I think it'd be ideal.

Thanks,
Andrea
Kees Cook May 20, 2020, 6:03 p.m. UTC | #7
On Wed, May 20, 2020 at 12:59:38AM -0400, Andrea Arcangeli wrote:
> Hello everyone,
> 
> On Fri, May 08, 2020 at 12:54:03PM -0400, Michael S. Tsirkin wrote:
> > On Fri, May 08, 2020 at 12:52:34PM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Apr 22, 2020 at 05:26:32PM -0700, Daniel Colascione wrote:
> > > > This sysctl can be set to either zero or one. When zero (the default)
> > > > the system lets all users call userfaultfd with or without
> > > > UFFD_USER_MODE_ONLY, modulo other access controls. When
> > > > unprivileged_userfaultfd_user_mode_only is set to one, users without
> > > > CAP_SYS_PTRACE must pass UFFD_USER_MODE_ONLY to userfaultfd or the API
> > > > will fail with EPERM. This facility allows administrators to reduce
> > > > the likelihood that an attacker with access to userfaultfd can delay
> > > > faulting kernel code to widen timing windows for other exploits.
> > > > 
> > > > Signed-off-by: Daniel Colascione <dancol@google.com>
> > > 
> > > The approach taken looks like a hard-coded security policy.
> > > For example, it won't be possible to set the sysctl knob
> > > in question on any sytem running kvm. So this is
> > > no good for any general purpose system.

Not all systems run unprivileged KVM. :)

> > > What's wrong with using a security policy for this instead?
> > 
> > In fact I see the original thread already mentions selinux,
> > so it's just a question of making this controllable by
> > selinux.
> 
> I agree it'd be preferable if it was not hardcoded, but then this
> patchset is also much simpler than the previous controlling it through
> selinux..
> 
> I was thinking, an alternative policy that could control it without
> hard-coding it, is a seccomp-bpf filter, then you can drop 2/2 as
> well, not just 1/6-4/6.

Err, did I miss a separate 6-patch series? I can't find anything on lore.

> 
> If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
> always called with flags==0x1 without requiring extra modifications in
> the kernel?

Please no. This is way too much overhead for something that a system
owner wants to enforce globally. A sysctl is the correct option here,
IMO. If it needs to be a per-userns sysctl, that would be fine too.

> Can't you get the feature party with the CAP_SYS_PTRACE capability
> too, if you don't wrap those tasks with the ptrace capability under
> that seccomp filter?
> 
> As far as I can tell, it's unprecedented to create a flag for a
> syscall API, with the only purpose of implementing a seccomp-bpf
> filter verifying such flag is set, but then if you want to control it
> with LSM it's even more complex than doing it with seccomp-bpf, and it
> requires more kernel code too. We could always add 2/2 later, such
> possibility won't disappear, in fact we could also add 1/6-4/6 later
> too if that is not enough.
> 
> If we could begin by merging only 1/2 from this new series and be done
> with the kernel changes, because we offload the rest of the work to
> the kernel eBPF JIT, I think it'd be ideal.

I'd agree that patch 1 should land, as it appears to be required for any
further policy considerations. I'm still a big fan of a sysctl since
this is the kind of thing I would absolutely turn on globally for all my
systems.
Andrea Arcangeli May 20, 2020, 7:48 p.m. UTC | #8
Hello Kees,

On Wed, May 20, 2020 at 11:03:39AM -0700, Kees Cook wrote:
> Err, did I miss a separate 6-patch series? I can't find anything on lore.

Daniel included the link of the previous series I referred to is the
cover letter 0/2:

https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/

> > If you keep only 1/2, can't seccomp-bpf enforce userfaultfd to be
> > always called with flags==0x1 without requiring extra modifications in
> > the kernel?
> 
> Please no. This is way too much overhead for something that a system
> owner wants to enforce globally. A sysctl is the correct option here,
> IMO. If it needs to be a per-userns sysctl, that would be fine too.

The question is who could be this system owner who prefers "2" to "0"?

per-ns I don't see the point either when all containers already run
with default policies enforcing the same behavior as if the sysctl is
set to "0".

Why exactly is it preferable to enlarge the surface of attack of the
kernel and take the risk there is a real bug in userfaultfd code (not
just a facilitation of exploiting some other kernel bug) that leads to
a privilege escalation, when you still break 99% of userfaultfd users,
if you set with option "2"?

Is the system owner really going to purely run on his systems CRIU
postcopy live migration (which already runs with CAP_SYS_PTRACE) and
nothing else that could break?

Option "2" to me looks with a single possible user, and incidentally
this single user can already enforce model "2" by only tweaking its
seccomp-bpf filters without applying 2/2. It'd be a bug if android
apps runs unprotected by seccomp regardless of 2/2.

System owners I think would be better of to stick to "0" or "1" a far
as I can tell, "2" looks a bad tradeoff as system value, with nearly
all cons of "0", but less secure than "0".

> I'd agree that patch 1 should land, as it appears to be required for any

Agreed about merging 1/2.

> further policy considerations. I'm still a big fan of a sysctl since
> this is the kind of thing I would absolutely turn on globally for all my
> systems.

The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there
upstream and you should have already set it to "0" in all your systems
if you can cope with some app features not working.

2/2 as modified with Peter's suggestion, would add a new value "2" (in
addition of "1" and "0"), that still breaks the majority of all
possible users, just like value "0", but that gives less security than
value "0".

It all boils down of how peculiar it is to be able to leverage only
the acceleration (reduction in context switches and enter/exit kernel)
and the vma fragmentation avoidance (to avoid running of vmas in
/proc/sys/vm/max_map_count) provided by userfaultfd (vs sigsegv
trapping) but not the transparency in handling faults in kernel (which
sigsegv can't possibly achieve).

Right now there's a single user that can cope with that limitation,
and it's not your running on your servers but on your phone. Even CRIU
cannot cope with such limitation, the only reason it would cope with
value "2" is that it already runs with CAP_SYS_PTRACE for other
reasons.

If there will be more users that can cope with handling only user
initiated page faults, then yes, I think it'd be fine to add a value
"2" later.

The other benefit of enforcing the policy with seccomp-bpf if that if
you'd run Android userland on a container on a ARM server on top of an
enterprise kernel, it'd already run as safe as if the sysctl value was
tweaked to "2", but without having to also add extra kernel code for
per-ns sysctl. It just looks simpler. The rest of the containers on
the same host are already running today as if the sysctl is set to 0
regardless of this patchset.

If you want to enforce maximum security and override any possible
opt-out of the default podman seccomp profile that blocks userfaultfd,
you already can upstream with the global sysctl by setting it to "0".

Thanks,
Andrea
Andrea Arcangeli May 20, 2020, 7:51 p.m. UTC | #9
On Wed, May 20, 2020 at 03:48:04PM -0400, Andrea Arcangeli wrote:
> The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there

Oops I picked the wrong unprivileged_* :) of course I meant:
/proc/sys/vm/unprivileged_userfaultfd
Lokesh Gidra May 20, 2020, 8:17 p.m. UTC | #10
Adding the Android kernel team in the discussion.

On Wed, May 20, 2020 at 12:51 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Wed, May 20, 2020 at 03:48:04PM -0400, Andrea Arcangeli wrote:
> > The sysctl /proc/sys/kernel/unprivileged_bpf_disabled is already there
>
> Oops I picked the wrong unprivileged_* :) of course I meant:
> /proc/sys/vm/unprivileged_userfaultfd
>
Andrea Arcangeli May 20, 2020, 9:16 p.m. UTC | #11
On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote:
> Adding the Android kernel team in the discussion.

Unless I'm mistaken that you can already enforce bit 1 of the second
parameter of the userfaultfd syscall to be set with seccomp-bpf, this
would be more a question to the Android userland team.

The question would be: does it ever happen that a seccomp filter isn't
already applied to unprivileged software running without
SYS_CAP_PTRACE capability?

If answer is "no" the behavior of the new sysctl in patch 2/2 (in
subject) should be enforceable with minor changes to the BPF
assembly. Otherwise it'd require more changes.

Thanks!
Andrea
Jeffrey Vander Stoep July 17, 2020, 12:57 p.m. UTC | #12
On Wed, May 20, 2020 at 11:17 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote:
> > Adding the Android kernel team in the discussion.
>
> Unless I'm mistaken that you can already enforce bit 1 of the second
> parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> would be more a question to the Android userland team.
>
> The question would be: does it ever happen that a seccomp filter isn't
> already applied to unprivileged software running without
> SYS_CAP_PTRACE capability?

Yes.

Android uses selinux as our primary sandboxing mechanism. We do use
seccomp on a few processes, but we have found that it has a
surprisingly high performance cost [1] on arm64 devices so turning it
on system wide is not a good option.

[1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
>
>
> If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> subject) should be enforceable with minor changes to the BPF
> assembly. Otherwise it'd require more changes.
>
> Thanks!
> Andrea
>
Lokesh Gidra July 23, 2020, 5:30 p.m. UTC | #13
Daniel, the original contributor of this patchset, has moved to
another company. Adding his personal email, in case he still wants to
be involved.

From the discussion so far it seems that there is a consensus that
patch 1/2 in this series should be upstreamed in any case. Is there
anything that is pending on that patch?

On Fri, Jul 17, 2020 at 5:57 AM Jeffrey Vander Stoep <jeffv@google.com> wrote:
>
> On Wed, May 20, 2020 at 11:17 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > On Wed, May 20, 2020 at 01:17:20PM -0700, Lokesh Gidra wrote:
> > > Adding the Android kernel team in the discussion.
> >
> > Unless I'm mistaken that you can already enforce bit 1 of the second
> > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > would be more a question to the Android userland team.
> >
> > The question would be: does it ever happen that a seccomp filter isn't
> > already applied to unprivileged software running without
> > SYS_CAP_PTRACE capability?
>
> Yes.
>
> Android uses selinux as our primary sandboxing mechanism. We do use
> seccomp on a few processes, but we have found that it has a
> surprisingly high performance cost [1] on arm64 devices so turning it
> on system wide is not a good option.
>
> [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> >
> >
> > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > subject) should be enforceable with minor changes to the BPF
> > assembly. Otherwise it'd require more changes.
> >
Adding Nick (Jeff is already here) to respond to Andrea's concerns
about adding option '2' to sysctl knob.

> > Thanks!
> > Andrea
> >
Nick Kralevich July 24, 2020, 12:13 a.m. UTC | #14
On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> From the discussion so far it seems that there is a consensus that
> patch 1/2 in this series should be upstreamed in any case. Is there
> anything that is pending on that patch?

That's my reading of this thread too.

> > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > would be more a question to the Android userland team.
> > >
> > > The question would be: does it ever happen that a seccomp filter isn't
> > > already applied to unprivileged software running without
> > > SYS_CAP_PTRACE capability?
> >
> > Yes.
> >
> > Android uses selinux as our primary sandboxing mechanism. We do use
> > seccomp on a few processes, but we have found that it has a
> > surprisingly high performance cost [1] on arm64 devices so turning it
> > on system wide is not a good option.
> >
> > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad

As Jeff mentioned, seccomp is used strategically on Android, but is
not applied to all processes. It's too expensive and impractical when
simpler implementations (such as this sysctl) can exist. It's also
significantly simpler to test a sysctl value for correctness as
opposed to a seccomp filter.

> > >
> > >
> > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > subject) should be enforceable with minor changes to the BPF
> > > assembly. Otherwise it'd require more changes.

It would be good to understand what these changes are.

> > > Why exactly is it preferable to enlarge the surface of attack of the
> > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > just a facilitation of exploiting some other kernel bug) that leads to
> > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > if you set with option "2"?

I can see your point if you think about the feature as a whole.
However, distributions (such as Android) have specialized knowledge of
their security environments, and may not want to support the typical
usages of userfaultfd. For such distributions, providing a mechanism
to prevent userfaultfd from being useful as an exploit primitive,
while still allowing the very limited use of userfaultfd for userspace
faults only, is desirable. Distributions shouldn't be forced into
supporting 100% of the use cases envisioned by userfaultfd when their
needs may be more specialized, and this sysctl knob empowers
distributions to make this choice for themselves.

> > > Is the system owner really going to purely run on his systems CRIU
> > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > nothing else that could break?

This is a great example of a capability which a distribution may not
want to support, due to distribution specific security policies.

> > >
> > > Option "2" to me looks with a single possible user, and incidentally
> > > this single user can already enforce model "2" by only tweaking its
> > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > apps runs unprotected by seccomp regardless of 2/2.

Can you elaborate on what bug is present by processes being
unprotected by seccomp?

Seccomp cannot be universally applied on Android due to previously
mentioned performance concerns. Seccomp is used in Android primarily
as a tool to enforce the list of allowed syscalls, so that such
syscalls can be audited before being included as part of the Android
API.

-- Nick
Michael S. Tsirkin July 24, 2020, 1:40 p.m. UTC | #15
On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote:
> On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > From the discussion so far it seems that there is a consensus that
> > patch 1/2 in this series should be upstreamed in any case. Is there
> > anything that is pending on that patch?
> 
> That's my reading of this thread too.
> 
> > > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > > would be more a question to the Android userland team.
> > > >
> > > > The question would be: does it ever happen that a seccomp filter isn't
> > > > already applied to unprivileged software running without
> > > > SYS_CAP_PTRACE capability?
> > >
> > > Yes.
> > >
> > > Android uses selinux as our primary sandboxing mechanism. We do use
> > > seccomp on a few processes, but we have found that it has a
> > > surprisingly high performance cost [1] on arm64 devices so turning it
> > > on system wide is not a good option.
> > >
> > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> 
> As Jeff mentioned, seccomp is used strategically on Android, but is
> not applied to all processes. It's too expensive and impractical when
> simpler implementations (such as this sysctl) can exist. It's also
> significantly simpler to test a sysctl value for correctness as
> opposed to a seccomp filter.

Given that selinux is already used system-wide on Android, what is wrong
with using selinux to control userfaultfd as opposed to seccomp?


> > > >
> > > >
> > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > > subject) should be enforceable with minor changes to the BPF
> > > > assembly. Otherwise it'd require more changes.
> 
> It would be good to understand what these changes are.
> 
> > > > Why exactly is it preferable to enlarge the surface of attack of the
> > > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > > just a facilitation of exploiting some other kernel bug) that leads to
> > > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > > if you set with option "2"?
> 
> I can see your point if you think about the feature as a whole.
> However, distributions (such as Android) have specialized knowledge of
> their security environments, and may not want to support the typical
> usages of userfaultfd. For such distributions, providing a mechanism
> to prevent userfaultfd from being useful as an exploit primitive,
> while still allowing the very limited use of userfaultfd for userspace
> faults only, is desirable. Distributions shouldn't be forced into
> supporting 100% of the use cases envisioned by userfaultfd when their
> needs may be more specialized, and this sysctl knob empowers
> distributions to make this choice for themselves.
> 
> > > > Is the system owner really going to purely run on his systems CRIU
> > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > > nothing else that could break?
> 
> This is a great example of a capability which a distribution may not
> want to support, due to distribution specific security policies.
> 
> > > >
> > > > Option "2" to me looks with a single possible user, and incidentally
> > > > this single user can already enforce model "2" by only tweaking its
> > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > > apps runs unprotected by seccomp regardless of 2/2.
> 
> Can you elaborate on what bug is present by processes being
> unprotected by seccomp?
> 
> Seccomp cannot be universally applied on Android due to previously
> mentioned performance concerns. Seccomp is used in Android primarily
> as a tool to enforce the list of allowed syscalls, so that such
> syscalls can be audited before being included as part of the Android
> API.
> 
> -- Nick
> 
> -- 
> Nick Kralevich | nnk@google.com
Nick Kralevich Aug. 6, 2020, 12:43 a.m. UTC | #16
On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote:
> > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > From the discussion so far it seems that there is a consensus that
> > > patch 1/2 in this series should be upstreamed in any case. Is there
> > > anything that is pending on that patch?
> >
> > That's my reading of this thread too.
> >
> > > > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > > > would be more a question to the Android userland team.
> > > > >
> > > > > The question would be: does it ever happen that a seccomp filter isn't
> > > > > already applied to unprivileged software running without
> > > > > SYS_CAP_PTRACE capability?
> > > >
> > > > Yes.
> > > >
> > > > Android uses selinux as our primary sandboxing mechanism. We do use
> > > > seccomp on a few processes, but we have found that it has a
> > > > surprisingly high performance cost [1] on arm64 devices so turning it
> > > > on system wide is not a good option.
> > > >
> > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> >
> > As Jeff mentioned, seccomp is used strategically on Android, but is
> > not applied to all processes. It's too expensive and impractical when
> > simpler implementations (such as this sysctl) can exist. It's also
> > significantly simpler to test a sysctl value for correctness as
> > opposed to a seccomp filter.
>
> Given that selinux is already used system-wide on Android, what is wrong
> with using selinux to control userfaultfd as opposed to seccomp?

Userfaultfd file descriptors will be generally controlled by SELinux.
You can see the patchset at
https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/
(which is also referenced in the original commit message for this
patchset). However, the SELinux patchset doesn't include the ability
to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly.

SELinux already has the ability to control who gets CAP_SYS_PTRACE,
which combined with this patch, is largely equivalent to direct
UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch
above, movement of userfaultfd file descriptors can be mediated by
SELinux, preventing one process from acquiring userfaultfd descriptors
of other processes unless allowed by security policy.

It's an interesting question whether finer-grain SELinux support for
controlling UFFD_USER_MODE_ONLY should be added. I can see some
advantages to implementing this. However, we don't need to decide that
now.

Kernel security checks generally break down into DAC (discretionary
access control) and MAC (mandatory access control) controls. Most
kernel security features check via both of these mechanisms. Security
attributes of the system should be settable without necessarily
relying on an LSM such as SELinux. This patch follows the same basic
model -- system wide control of a hardening feature is provided by the
unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed,
SELinux support for this can also be implemented on top of the DAC
controls.

This DAC/MAC split has been successful in several other security
features. For example, the ability to map at page zero is controlled
in DAC via the mmap_min_addr sysctl [1], and via SELinux via the
mmap_zero access vector [2]. Similarly, access to the kernel ring
buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as
well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict
sysctl is very similar to this patch -- it introduces a capability
(CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource.

If we want to ensure that a security feature will be well tested and
vetted, it's important to not limit its use to LSMs only. This ensures
that kernel and application developers will always be able to test the
effects of a security feature, without relying on LSMs like SELinux.
It also ensures that all distributions can enable this security
mitigation should it be necessary for their unique environments,
without introducing an SELinux dependency. And this patch does not
preclude an SELinux implementation should it be necessary.

Even if we decide to implement fine-grain SELinux controls on
UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this
an either/or choice between SELinux and this patch. Both are
necessary.

-- Nick

[1] https://wiki.debian.org/mmap_min_addr
[2] https://selinuxproject.org/page/NB_ObjectClassesPermissions
[3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

>
>
> > > > >
> > > > >
> > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > > > subject) should be enforceable with minor changes to the BPF
> > > > > assembly. Otherwise it'd require more changes.
> >
> > It would be good to understand what these changes are.
> >
> > > > > Why exactly is it preferable to enlarge the surface of attack of the
> > > > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > > > just a facilitation of exploiting some other kernel bug) that leads to
> > > > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > > > if you set with option "2"?
> >
> > I can see your point if you think about the feature as a whole.
> > However, distributions (such as Android) have specialized knowledge of
> > their security environments, and may not want to support the typical
> > usages of userfaultfd. For such distributions, providing a mechanism
> > to prevent userfaultfd from being useful as an exploit primitive,
> > while still allowing the very limited use of userfaultfd for userspace
> > faults only, is desirable. Distributions shouldn't be forced into
> > supporting 100% of the use cases envisioned by userfaultfd when their
> > needs may be more specialized, and this sysctl knob empowers
> > distributions to make this choice for themselves.
> >
> > > > > Is the system owner really going to purely run on his systems CRIU
> > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > > > nothing else that could break?
> >
> > This is a great example of a capability which a distribution may not
> > want to support, due to distribution specific security policies.
> >
> > > > >
> > > > > Option "2" to me looks with a single possible user, and incidentally
> > > > > this single user can already enforce model "2" by only tweaking its
> > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > > > apps runs unprotected by seccomp regardless of 2/2.
> >
> > Can you elaborate on what bug is present by processes being
> > unprotected by seccomp?
> >
> > Seccomp cannot be universally applied on Android due to previously
> > mentioned performance concerns. Seccomp is used in Android primarily
> > as a tool to enforce the list of allowed syscalls, so that such
> > syscalls can be audited before being included as part of the Android
> > API.
> >
> > -- Nick
> >
> > --
> > Nick Kralevich | nnk@google.com
>
Michael S. Tsirkin Aug. 6, 2020, 5:44 a.m. UTC | #17
On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote:
> On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote:
> > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > From the discussion so far it seems that there is a consensus that
> > > > patch 1/2 in this series should be upstreamed in any case. Is there
> > > > anything that is pending on that patch?
> > >
> > > That's my reading of this thread too.
> > >
> > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > > > > would be more a question to the Android userland team.
> > > > > >
> > > > > > The question would be: does it ever happen that a seccomp filter isn't
> > > > > > already applied to unprivileged software running without
> > > > > > SYS_CAP_PTRACE capability?
> > > > >
> > > > > Yes.
> > > > >
> > > > > Android uses selinux as our primary sandboxing mechanism. We do use
> > > > > seccomp on a few processes, but we have found that it has a
> > > > > surprisingly high performance cost [1] on arm64 devices so turning it
> > > > > on system wide is not a good option.
> > > > >
> > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> > >
> > > As Jeff mentioned, seccomp is used strategically on Android, but is
> > > not applied to all processes. It's too expensive and impractical when
> > > simpler implementations (such as this sysctl) can exist. It's also
> > > significantly simpler to test a sysctl value for correctness as
> > > opposed to a seccomp filter.
> >
> > Given that selinux is already used system-wide on Android, what is wrong
> > with using selinux to control userfaultfd as opposed to seccomp?
> 
> Userfaultfd file descriptors will be generally controlled by SELinux.
> You can see the patchset at
> https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/
> (which is also referenced in the original commit message for this
> patchset). However, the SELinux patchset doesn't include the ability
> to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly.
> 
> SELinux already has the ability to control who gets CAP_SYS_PTRACE,
> which combined with this patch, is largely equivalent to direct
> UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch
> above, movement of userfaultfd file descriptors can be mediated by
> SELinux, preventing one process from acquiring userfaultfd descriptors
> of other processes unless allowed by security policy.
> 
> It's an interesting question whether finer-grain SELinux support for
> controlling UFFD_USER_MODE_ONLY should be added. I can see some
> advantages to implementing this. However, we don't need to decide that
> now.
>
> Kernel security checks generally break down into DAC (discretionary
> access control) and MAC (mandatory access control) controls. Most
> kernel security features check via both of these mechanisms. Security
> attributes of the system should be settable without necessarily
> relying on an LSM such as SELinux. This patch follows the same basic
> model -- system wide control of a hardening feature is provided by the
> unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed,
> SELinux support for this can also be implemented on top of the DAC
> controls.
> 
> This DAC/MAC split has been successful in several other security
> features. For example, the ability to map at page zero is controlled
> in DAC via the mmap_min_addr sysctl [1], and via SELinux via the
> mmap_zero access vector [2]. Similarly, access to the kernel ring
> buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as
> well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict
> sysctl is very similar to this patch -- it introduces a capability
> (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource.
> 
> If we want to ensure that a security feature will be well tested and
> vetted, it's important to not limit its use to LSMs only. This ensures
> that kernel and application developers will always be able to test the
> effects of a security feature, without relying on LSMs like SELinux.
> It also ensures that all distributions can enable this security
> mitigation should it be necessary for their unique environments,
> without introducing an SELinux dependency. And this patch does not
> preclude an SELinux implementation should it be necessary.
> 
> Even if we decide to implement fine-grain SELinux controls on
> UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this
> an either/or choice between SELinux and this patch. Both are
> necessary.
> 
> -- Nick
> 
> [1] https://wiki.debian.org/mmap_min_addr
> [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions
> [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt

I am not sure I agree this is similar to dmesg access.

The reason I say it is this: it is pretty easy for admins to know
whether they run something that needs to access the kernel ring buffer.
Or if it's a tool developer poking at dmesg, they can tell admins "we
need these permissions".  But it seems impossible for either an admin to
know that a userfaultfd page e.g. used with shared memory is accessed
from the kernel.

So I guess the question is: how does anyone not running Android
know to set this flag?

I got the feeling it's not really possible, and so for a single-user
feature like this a single API seems enough.  Given a choice between a
knob an admin is supposed to set and selinux policy written by
presumably knowledgeable OS vendors, I'd opt for a second option.

Hope this helps.

> >
> >
> > > > > >
> > > > > >
> > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > > > > subject) should be enforceable with minor changes to the BPF
> > > > > > assembly. Otherwise it'd require more changes.
> > >
> > > It would be good to understand what these changes are.
> > >
> > > > > > Why exactly is it preferable to enlarge the surface of attack of the
> > > > > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > > > > just a facilitation of exploiting some other kernel bug) that leads to
> > > > > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > > > > if you set with option "2"?
> > >
> > > I can see your point if you think about the feature as a whole.
> > > However, distributions (such as Android) have specialized knowledge of
> > > their security environments, and may not want to support the typical
> > > usages of userfaultfd. For such distributions, providing a mechanism
> > > to prevent userfaultfd from being useful as an exploit primitive,
> > > while still allowing the very limited use of userfaultfd for userspace
> > > faults only, is desirable. Distributions shouldn't be forced into
> > > supporting 100% of the use cases envisioned by userfaultfd when their
> > > needs may be more specialized, and this sysctl knob empowers
> > > distributions to make this choice for themselves.
> > >
> > > > > > Is the system owner really going to purely run on his systems CRIU
> > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > > > > nothing else that could break?
> > >
> > > This is a great example of a capability which a distribution may not
> > > want to support, due to distribution specific security policies.
> > >
> > > > > >
> > > > > > Option "2" to me looks with a single possible user, and incidentally
> > > > > > this single user can already enforce model "2" by only tweaking its
> > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > > > > apps runs unprotected by seccomp regardless of 2/2.
> > >
> > > Can you elaborate on what bug is present by processes being
> > > unprotected by seccomp?
> > >
> > > Seccomp cannot be universally applied on Android due to previously
> > > mentioned performance concerns. Seccomp is used in Android primarily
> > > as a tool to enforce the list of allowed syscalls, so that such
> > > syscalls can be audited before being included as part of the Android
> > > API.
> > >
> > > -- Nick
> > >
> > > --
> > > Nick Kralevich | nnk@google.com
> >
> 
> 
> -- 
> Nick Kralevich | nnk@google.com
Lokesh Gidra Aug. 17, 2020, 10:11 p.m. UTC | #18
On Wed, Aug 5, 2020 at 10:44 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Aug 05, 2020 at 05:43:02PM -0700, Nick Kralevich wrote:
> > On Fri, Jul 24, 2020 at 6:40 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Jul 23, 2020 at 05:13:28PM -0700, Nick Kralevich wrote:
> > > > On Thu, Jul 23, 2020 at 10:30 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > > > From the discussion so far it seems that there is a consensus that
> > > > > patch 1/2 in this series should be upstreamed in any case. Is there
> > > > > anything that is pending on that patch?
> > > >
> > > > That's my reading of this thread too.
> > > >
> > > > > > > Unless I'm mistaken that you can already enforce bit 1 of the second
> > > > > > > parameter of the userfaultfd syscall to be set with seccomp-bpf, this
> > > > > > > would be more a question to the Android userland team.
> > > > > > >
> > > > > > > The question would be: does it ever happen that a seccomp filter isn't
> > > > > > > already applied to unprivileged software running without
> > > > > > > SYS_CAP_PTRACE capability?
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > Android uses selinux as our primary sandboxing mechanism. We do use
> > > > > > seccomp on a few processes, but we have found that it has a
> > > > > > surprisingly high performance cost [1] on arm64 devices so turning it
> > > > > > on system wide is not a good option.
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-security-module/202006011116.3F7109A@keescook/T/#m82ace19539ac595682affabdf652c0ffa5d27dad
> > > >
> > > > As Jeff mentioned, seccomp is used strategically on Android, but is
> > > > not applied to all processes. It's too expensive and impractical when
> > > > simpler implementations (such as this sysctl) can exist. It's also
> > > > significantly simpler to test a sysctl value for correctness as
> > > > opposed to a seccomp filter.
> > >
> > > Given that selinux is already used system-wide on Android, what is wrong
> > > with using selinux to control userfaultfd as opposed to seccomp?
> >
> > Userfaultfd file descriptors will be generally controlled by SELinux.
> > You can see the patchset at
> > https://lore.kernel.org/lkml/20200401213903.182112-3-dancol@google.com/
> > (which is also referenced in the original commit message for this
> > patchset). However, the SELinux patchset doesn't include the ability
> > to control FAULT_FLAG_USER / UFFD_USER_MODE_ONLY directly.
> >
> > SELinux already has the ability to control who gets CAP_SYS_PTRACE,
> > which combined with this patch, is largely equivalent to direct
> > UFFD_USER_MODE_ONLY checks. Additionally, with the SELinux patch
> > above, movement of userfaultfd file descriptors can be mediated by
> > SELinux, preventing one process from acquiring userfaultfd descriptors
> > of other processes unless allowed by security policy.
> >
> > It's an interesting question whether finer-grain SELinux support for
> > controlling UFFD_USER_MODE_ONLY should be added. I can see some
> > advantages to implementing this. However, we don't need to decide that
> > now.
> >
> > Kernel security checks generally break down into DAC (discretionary
> > access control) and MAC (mandatory access control) controls. Most
> > kernel security features check via both of these mechanisms. Security
> > attributes of the system should be settable without necessarily
> > relying on an LSM such as SELinux. This patch follows the same basic
> > model -- system wide control of a hardening feature is provided by the
> > unprivileged_userfaultfd_user_mode_only sysctl (DAC), and if needed,
> > SELinux support for this can also be implemented on top of the DAC
> > controls.
> >
> > This DAC/MAC split has been successful in several other security
> > features. For example, the ability to map at page zero is controlled
> > in DAC via the mmap_min_addr sysctl [1], and via SELinux via the
> > mmap_zero access vector [2]. Similarly, access to the kernel ring
> > buffer is controlled both via DAC as the dmesg_restrict sysctl [3], as
> > well as the SELinux syslog_read [2] check. Indeed, the dmesg_restrict
> > sysctl is very similar to this patch -- it introduces a capability
> > (CAP_SYSLOG, CAP_SYS_PTRACE) check on access to a sensitive resource.
> >
> > If we want to ensure that a security feature will be well tested and
> > vetted, it's important to not limit its use to LSMs only. This ensures
> > that kernel and application developers will always be able to test the
> > effects of a security feature, without relying on LSMs like SELinux.
> > It also ensures that all distributions can enable this security
> > mitigation should it be necessary for their unique environments,
> > without introducing an SELinux dependency. And this patch does not
> > preclude an SELinux implementation should it be necessary.
> >
> > Even if we decide to implement fine-grain SELinux controls on
> > UFFD_USER_MODE_ONLY, we still need this patch. We shouldn't make this
> > an either/or choice between SELinux and this patch. Both are
> > necessary.
> >
> > -- Nick
> >
> > [1] https://wiki.debian.org/mmap_min_addr
> > [2] https://selinuxproject.org/page/NB_ObjectClassesPermissions
> > [3] https://www.kernel.org/doc/Documentation/sysctl/kernel.txt
>
> I am not sure I agree this is similar to dmesg access.
>
> The reason I say it is this: it is pretty easy for admins to know
> whether they run something that needs to access the kernel ring buffer.
> Or if it's a tool developer poking at dmesg, they can tell admins "we
> need these permissions".  But it seems impossible for either an admin to
> know that a userfaultfd page e.g. used with shared memory is accessed
> from the kernel.
>
> So I guess the question is: how does anyone not running Android
> know to set this flag?
>
> I got the feeling it's not really possible, and so for a single-user
> feature like this a single API seems enough.  Given a choice between a
> knob an admin is supposed to set and selinux policy written by
> presumably knowledgeable OS vendors, I'd opt for a second option.
>
> Hope this helps.
>
There has been an emphasis that Android is probably the only user for
the restriction of userfaults from kernel-space and that it wouldn’t
be useful anywhere else. I humbly disagree! There are various areas
where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
user-space setting. Basically, any lazy, on-demand,
initialization/decompression/loading could be a good candidate for
this trick. My project happens to be one of them. In fact, in Android
we are also thinking of using it in some other places, all in
user-space. And given that userfaultfd is an efficient replacement for
this trick [1], there are various scenarios which would benefit from
the restriction of userfaults from kernel-space, provided the admins
care about security on such devices. IIUC, a security admin would
never trust an unprivileged process with userfaults from kernel space.
Therefore, a sysctl knob restriction with CAP_SYS_PTRACE for
privileged processes seems like the right choice to me.

Coming to sysctl vs. SELinux debate, I think wherever the role of OS
vendor and admin is played by different people, I doubt a generic
SELinux policy set by the former will be blindly acceptable to the
latter. Furthermore, I’m not sure if an admin is expected to even know
which packages running on their system are using userfaultfd. So they
anyway have to rely on developers reaching out to get the required
permission. With the new sysctl knob enabled, the number of such
requests is only going to decrease.

[1] https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt

> > >
> > >
> > > > > > >
> > > > > > >
> > > > > > > If answer is "no" the behavior of the new sysctl in patch 2/2 (in
> > > > > > > subject) should be enforceable with minor changes to the BPF
> > > > > > > assembly. Otherwise it'd require more changes.
> > > >
> > > > It would be good to understand what these changes are.
> > > >
> > > > > > > Why exactly is it preferable to enlarge the surface of attack of the
> > > > > > > kernel and take the risk there is a real bug in userfaultfd code (not
> > > > > > > just a facilitation of exploiting some other kernel bug) that leads to
> > > > > > > a privilege escalation, when you still break 99% of userfaultfd users,
> > > > > > > if you set with option "2"?
> > > >
> > > > I can see your point if you think about the feature as a whole.
> > > > However, distributions (such as Android) have specialized knowledge of
> > > > their security environments, and may not want to support the typical
> > > > usages of userfaultfd. For such distributions, providing a mechanism
> > > > to prevent userfaultfd from being useful as an exploit primitive,
> > > > while still allowing the very limited use of userfaultfd for userspace
> > > > faults only, is desirable. Distributions shouldn't be forced into
> > > > supporting 100% of the use cases envisioned by userfaultfd when their
> > > > needs may be more specialized, and this sysctl knob empowers
> > > > distributions to make this choice for themselves.
> > > >
> > > > > > > Is the system owner really going to purely run on his systems CRIU
> > > > > > > postcopy live migration (which already runs with CAP_SYS_PTRACE) and
> > > > > > > nothing else that could break?
> > > >
> > > > This is a great example of a capability which a distribution may not
> > > > want to support, due to distribution specific security policies.
> > > >
> > > > > > >
> > > > > > > Option "2" to me looks with a single possible user, and incidentally
> > > > > > > this single user can already enforce model "2" by only tweaking its
> > > > > > > seccomp-bpf filters without applying 2/2. It'd be a bug if android
> > > > > > > apps runs unprotected by seccomp regardless of 2/2.
> > > >
> > > > Can you elaborate on what bug is present by processes being
> > > > unprotected by seccomp?
> > > >
> > > > Seccomp cannot be universally applied on Android due to previously
> > > > mentioned performance concerns. Seccomp is used in Android primarily
> > > > as a tool to enforce the list of allowed syscalls, so that such
> > > > syscalls can be audited before being included as part of the Android
> > > > API.
> > > >
> > > > -- Nick
> > > >
> > > > --
> > > > Nick Kralevich | nnk@google.com
> > >
> >
> >
> > --
> > Nick Kralevich | nnk@google.com
>
Andrea Arcangeli Sept. 4, 2020, 3:34 a.m. UTC | #19
Hello,

On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote:
> There has been an emphasis that Android is probably the only user for
> the restriction of userfaults from kernel-space and that it wouldn’t
> be useful anywhere else. I humbly disagree! There are various areas
> where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
> user-space setting. Basically, any lazy, on-demand,

For the record what I said is quoted below
https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com :

"""It all boils down of how peculiar it is to be able to leverage only
the acceleration [..] Right now there's a single user that can cope
with that limitation [..] If there will be more users [..]  it'd be
fine to add a value "2" later."""

Specifically I never said "that it wouldn’t be useful anywhere else.".

Also I'm only arguing about the sysctl visible kABI change in patch
2/2: the flag passed as parameter to the syscall in patch 1/2 is all
great, because seccomp needs it in the scalar parameter of the syscall
to implement a filter equivalent to your sysctl "2" policy with only
patch 1/2 applied.

I've two more questions now:

1) why don't you enforce the block of kernel initiated faults with
   seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
   an optimization to remove a few instructions per syscall in the bpf
   execution of Android unprivileged apps? You should block a lot of
   other syscalls by default to all unprivileged processes, including
   vmsplice.

   In other words if it's just for Android, why can't Android solve it
   with only patch 1/2 by tweaking the seccomp filter?

2) given that Android is secure enough with the sysctl at value 2, why
   should we even retain the current sysctl 0 semantics? Why can't
   more secure systems just use seccomp and block userfaultfd, as it
   is already happens by default in the podman default seccomp
   whitelist (for those containers that don't define a new json
   whitelist in the OCI schema)? Shouldn't we focus our energy in
   making containers more secure by preventing the OCI schema of a
   random container to re-enable userfaultfd in the container seccomp
   filter instead of trying to solve this with a global sysctl?

   What's missing in my view is a kubernetes hard allowlist/denylist
   that cannot be overridden with the OCI schema in case people has
   the bad idea of running containers downloaded from a not fully
   trusted source, without adding virt isolation and that's an
   userland problem to be solved in the container runtime, not a
   kernel issue. Then you'd just add userfaultfd to the json of the
   k8s hard seccomp denylist instead of going around tweaking sysctl.

What's your take in changing your 2/2 patch to just replace value "0"
and avoid introducing a new value "2"?

The value "0" was motivated by the concern that uffd can enlarge the
race window for use after free by providing one more additional way to
block kernel faults, but value "2" is already enough to solve that
concern completely and it'll be the default on all Android.

In other words by adding "2" you're effectively doing a more
finegrined and more optimal implementation of "0" that remains useful
and available to unprivileged apps and it already resolves all
"robustness against side effects other kernel bugs" concerns. Clearly
"0" is even more secure statistically but that would apply to every
other syscall including vmsplice, and there's no
/proc/sys/vm/unprivileged_vmsplice sysctl out there.

The next issue we have now is with the pipe mutex (which is not a
major concern but we need to solve it somehow for correctness). So I
wonder if should make the default value to be "0" (or "2" if think we
should not replace "0") and to allow only user initiated faults by
default.

Changing the sysctl default to be 0, will make live migration fail to
switch to postcopy which will be (unnoticeable to the guest), instead
of risking the VM to be killed because of network latency
outlier. Then we wouldn't need to change the pipe code at all.

Alternatively we could still fix the pipe code so it runs better (but
it'll be more complex) or to disable uffd faults only in the pipe
code.

One thing to keep in mind is that if we change the default, then
hypervisor hosts running QEMU would need to set:

vm.userfaultfd = 1

in /etc/sysctl.conf if postcopy live migration is required, that's not
particularly concerning constraint for qemu (there are likely other
tweaks required and it looks less risky than an arbitrary timeout
which could kill the VM: if the above is forgotten the postcopy live
migration won't even start and it'll be unnoticeable to the guest).

The main concern really are future apps that may want to use uffd for
kernel initiated faults won't be allowed to do so by default anymore,
those apps will be heavily incentivated to use bounce buffers before
passing data to syscalls, similarly to the current use case of patch 2/2.

Comments welcome,
Andrea

PS. Another usage of uffd that remains possible without privilege with
the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV
acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib
will remain possible without requiring extra privileges, by adding a
UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS
(UFFDIO_POPULATE just needs to zero out a page and map it, it'll be
indistinguishable to UFFDIO_ZEROPAGE but it will solve the last
performance bottleneck by avoiding a wrprotect fault after the
allocation and it will be THP capable too). Memory will be freed with
MADV_DONTNEED, without ever having to call mmap/mumap. It could move
memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP
which already exists.
Lokesh Gidra Sept. 5, 2020, 12:36 a.m. UTC | #20
On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello,
>
> On Mon, Aug 17, 2020 at 03:11:16PM -0700, Lokesh Gidra wrote:
> > There has been an emphasis that Android is probably the only user for
> > the restriction of userfaults from kernel-space and that it wouldn’t
> > be useful anywhere else. I humbly disagree! There are various areas
> > where the PROT_NONE+SIGSEGV trick is (and can be) used in a purely
> > user-space setting. Basically, any lazy, on-demand,
>
> For the record what I said is quoted below
> https://lkml.kernel.org/r/20200520194804.GJ26186@redhat.com :
>
> """It all boils down of how peculiar it is to be able to leverage only
> the acceleration [..] Right now there's a single user that can cope
> with that limitation [..] If there will be more users [..]  it'd be
> fine to add a value "2" later."""
>
> Specifically I never said "that it wouldn’t be useful anywhere else.".
>
Thanks a lot for clarifying.

> Also I'm only arguing about the sysctl visible kABI change in patch
> 2/2: the flag passed as parameter to the syscall in patch 1/2 is all
> great, because seccomp needs it in the scalar parameter of the syscall
> to implement a filter equivalent to your sysctl "2" policy with only
> patch 1/2 applied.
>
> I've two more questions now:
>
> 1) why don't you enforce the block of kernel initiated faults with
>    seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
>    an optimization to remove a few instructions per syscall in the bpf
>    execution of Android unprivileged apps? You should block a lot of
>    other syscalls by default to all unprivileged processes, including
>    vmsplice.
>
>    In other words if it's just for Android, why can't Android solve it
>    with only patch 1/2 by tweaking the seccomp filter?

I would let Nick (nnk@) and Jeff (jeffv@) respond to this.

The previous responses from both of them on this email thread
(https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/
and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/)
suggest that the performance overhead of seccomp-bpf is too much. Kees
also objected to it
(https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/)

I'm not familiar with how seccomp-bpf works. All that I can add here
is that userfaultfd syscall is usually not invoked in a performance
critical code path. So, if the performance overhead of seccomp-bpf (if
enabled) is observed on all syscalls originating from a process, then
I'd say patch 2/2 is essential. Otherwise, it should be ok to let
seccomp perform the same functionality instead.

>
> 2) given that Android is secure enough with the sysctl at value 2, why
>    should we even retain the current sysctl 0 semantics? Why can't
>    more secure systems just use seccomp and block userfaultfd, as it
>    is already happens by default in the podman default seccomp
>    whitelist (for those containers that don't define a new json
>    whitelist in the OCI schema)? Shouldn't we focus our energy in
>    making containers more secure by preventing the OCI schema of a
>    random container to re-enable userfaultfd in the container seccomp
>    filter instead of trying to solve this with a global sysctl?
>
>    What's missing in my view is a kubernetes hard allowlist/denylist
>    that cannot be overridden with the OCI schema in case people has
>    the bad idea of running containers downloaded from a not fully
>    trusted source, without adding virt isolation and that's an
>    userland problem to be solved in the container runtime, not a
>    kernel issue. Then you'd just add userfaultfd to the json of the
>    k8s hard seccomp denylist instead of going around tweaking sysctl.
>
> What's your take in changing your 2/2 patch to just replace value "0"
> and avoid introducing a new value "2"?

SGTM. Disabling uffd completely for unprivileged processes can be
achieved either using seccomp-bpf, or via SELinux, once the following
patch series is upstreamed
https://lore.kernel.org/lkml/20200827063522.2563293-1-lokeshgidra@google.com/

>
> The value "0" was motivated by the concern that uffd can enlarge the
> race window for use after free by providing one more additional way to
> block kernel faults, but value "2" is already enough to solve that
> concern completely and it'll be the default on all Android.
>
> In other words by adding "2" you're effectively doing a more
> finegrined and more optimal implementation of "0" that remains useful
> and available to unprivileged apps and it already resolves all
> "robustness against side effects other kernel bugs" concerns. Clearly
> "0" is even more secure statistically but that would apply to every
> other syscall including vmsplice, and there's no
> /proc/sys/vm/unprivileged_vmsplice sysctl out there.
>
> The next issue we have now is with the pipe mutex (which is not a
> major concern but we need to solve it somehow for correctness). So I
> wonder if should make the default value to be "0" (or "2" if think we
> should not replace "0") and to allow only user initiated faults by
> default.
>
> Changing the sysctl default to be 0, will make live migration fail to
> switch to postcopy which will be (unnoticeable to the guest), instead
> of risking the VM to be killed because of network latency
> outlier. Then we wouldn't need to change the pipe code at all.
>
SGTM. I can change the default value to '0' (or '2') in the next
revision of patch 2/2, unless somebody objects to this.

> Alternatively we could still fix the pipe code so it runs better (but
> it'll be more complex) or to disable uffd faults only in the pipe
> code.
>
> One thing to keep in mind is that if we change the default, then
> hypervisor hosts running QEMU would need to set:
>
> vm.userfaultfd = 1
>
> in /etc/sysctl.conf if postcopy live migration is required, that's not
> particularly concerning constraint for qemu (there are likely other
> tweaks required and it looks less risky than an arbitrary timeout
> which could kill the VM: if the above is forgotten the postcopy live
> migration won't even start and it'll be unnoticeable to the guest).
>
> The main concern really are future apps that may want to use uffd for
> kernel initiated faults won't be allowed to do so by default anymore,
> those apps will be heavily incentivated to use bounce buffers before
> passing data to syscalls, similarly to the current use case of patch 2/2.
>
> Comments welcome,
> Andrea
>
> PS. Another usage of uffd that remains possible without privilege with
> the 2/2 patch sysctl "2" behavior (besides the strict SIGSEGV
> acceleration) is the UFFD_FEATURE_SIGBUS. That's good so a malloc lib
> will remain possible without requiring extra privileges, by adding a
> UFFDIO_POPULATE to use in combination with UFFD_FEATURE_SIGBUS
> (UFFDIO_POPULATE just needs to zero out a page and map it, it'll be
> indistinguishable to UFFDIO_ZEROPAGE but it will solve the last
> performance bottleneck by avoiding a wrprotect fault after the
> allocation and it will be THP capable too). Memory will be freed with
> MADV_DONTNEED, without ever having to call mmap/mumap. It could move
> memory around with UFFDIO_COPY+MADV_DONTNEED or by adding UFFDIO_REMAP
> which already exists.
>
UFFDIO_POPULATE sounds like a really useful feature. I don't see it in
the kernel yet. Is there a patch under work on this? If so, kindly
share.
Nick Kralevich Sept. 19, 2020, 6:14 p.m. UTC | #21
On Fri, Sep 4, 2020 at 5:36 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Thu, Sep 3, 2020 at 8:34 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > 1) why don't you enforce the block of kernel initiated faults with
> >    seccomp-bpf instead of adding a sysctl value 2? Is the sysctl just
> >    an optimization to remove a few instructions per syscall in the bpf
> >    execution of Android unprivileged apps? You should block a lot of
> >    other syscalls by default to all unprivileged processes, including
> >    vmsplice.
> >
> >    In other words if it's just for Android, why can't Android solve it
> >    with only patch 1/2 by tweaking the seccomp filter?
>
> I would let Nick (nnk@) and Jeff (jeffv@) respond to this.
>
> The previous responses from both of them on this email thread
> (https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/
> and https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/)
> suggest that the performance overhead of seccomp-bpf is too much. Kees
> also objected to it
> (https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/)
>
> I'm not familiar with how seccomp-bpf works. All that I can add here
> is that userfaultfd syscall is usually not invoked in a performance
> critical code path. So, if the performance overhead of seccomp-bpf (if
> enabled) is observed on all syscalls originating from a process, then
> I'd say patch 2/2 is essential. Otherwise, it should be ok to let
> seccomp perform the same functionality instead.
>

There are two primary reasons why seccomp isn't viable here.

1) Seccomp was never designed for whole-of-system protections, and is
impractical to deploy for anything other than "leaf" processes.
2) Attempts to enable seccomp on Android have run into performance
problems, even for trivial seccomp filters.

Let's go into each one.

Issue #1: Seccomp was never designed for whole-of-system protections,
and is impractical to deploy for anything other than "leaf" processes.

Andrea suggests deploying a seccomp filter purely focused on Android
unprivileged[1] (third party installed) apps. However, the intention
is for this security control to be used system-wide[2]. Only processes
which have a need for kernel initiated faults should be allowed to use
them; all other processes should be denied by default. And when I say
"all' processes, I mean "all" processes, even those which run with
UID=0. Andrea's proposal is akin to a denylist, where many modern
distributions (such as Android) use allowlists.

The seemingly obvious solution is to apply a global seccomp filter in
init (PID=1), but it falls down in practice. Seccomp is an incredibly
useful tool, but it wasn't designed to be applied system-wide. Seccomp
is fundamentally hierarchical in nature. A seccomp filter, once
applied, cannot be subsequently relaxed or removed in child processes.
While this restriction is great for leaf processes, it causes problems
for OS designers - a parent process must maintain an unused capability
if any process in the parent's process tree uses that capability. This
makes applying a userfaultfd seccomp filter in init impossible, since
we expect a few of init's children (but not init itself or most of
init's children) to use userfaultfd kernel faults. We end up back to a
wack-a-mole (denylist) problem of trying to modify each individual
process to block userfaultfd kernel faults, defeating the goals of
system-wide protection, and introducing significant complexity into
the system design.

Seccomp should be used in the context where it provides the most value
-- process leaf nodes. But trying to apply seccomp as a system-wide
control just isn't viable.

Lokesh's sysctl proposal doesn't have these problems. When the sysctl
is set to 2 by the OS distributor, all processes which don't have
CAP_SYS_PTRACE are denied kernel generated faults, making the system
safe-by-default. Only processes which are on the OS distributor's
CAP_SYS_PTRACE allowlist (see Android's allowlist at [3]) can generate
these faults, and capabilities can be managed without regards to
process hierarchy. This keeps the system minimally privileged and
safe.

Seccomp isn't a viable solution here.

Issue #2: Attempts to enable seccomp on Android globally have run into
performance problems, even for trivial seccomp filters.

Android has tried a few times to enable seccomp globally, but even
excluding the above-mentioned hierarchical process problems, we've
seen performance regressions across the board. Imposing a seccomp
filter merely for userfaultfd imposes a tax on every syscall, even if
the process never makes use of userfaultfd. Lokesh's sysctl proposal
avoids this tax and places the check where it's most effective, with
the rest of the userfaultfd functionality.

See also the threads that Lokesh mentioned above:

* https://lore.kernel.org/lkml/CABXk95A-E4NYqA5qVrPgDF18YW-z4_udzLwa0cdo2OfqVsy=SQ@mail.gmail.com/
* https://lore.kernel.org/lkml/CAFJ0LnGfrzvVgtyZQ+UqRM6F3M7iXOhTkUBTc+9sV+=RrFntyQ@mail.gmail.com/
* https://lore.kernel.org/lkml/202005200921.2BD5A0ADD@keescook/

Thanks,
-- Nick

[1] The use of the term "unprivileged" is unfortunate. In Android,
there's no coarse-grain privileged vs unprivileged process. Each
process, including root processes, have only the privileges they need,
and not a bit more. As a concrete example, Android's init process
(PID=1) is not allowed to open TCP/UDP sockets, but is allowed to
spawn children which can do so. Having each process be differently
privileged, and ensuring that functionality is only given out on a
need-to-have basis, is an important part of modern OS design.

[2] The trend in modern exploits isn't to perform attacks directly
from untrusted code to the kernel. A lot of the attack surface needed
by an attacker isn't reachable directly from untrusted code, but only
indirectly through other processes. The attacker moves laterally
through the system, exploiting a process which has the necessary
capabilities, then escalating to the kernel. Enforcing security
controls system-wide is an important part of denying an attacker the
tools for an effective exploit and preventing this kind of lateral
movement from being useful. Denying an attacker access to kernel
initiated faults in userfaultfd system-wide (except for authorized
processes) is doubly important, as these kinds of faults are extremely
valuable to an exploit writer (see explanation at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cefdca0a86be517bc390fc4541e3674b8e7803b0
or https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit)

[3] https://android.googlesource.com/platform/system/sepolicy/+/7be9e9e372c70a5518f729a0cdcb0d39a28be377/private/domain.te#107
line 107
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 0329a4d3fa9e..4296b508ab74 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -850,6 +850,19 @@  privileged users (with SYS_CAP_PTRACE capability).
 
 The default value is 1.
 
+unprivileged_userfaultfd_user_mode_only
+========================================
+
+This flag controls whether unprivileged users can use the userfaultfd
+system calls to handle page faults in kernel mode.  If set to zero,
+userfaultfd works with or without UFFD_USER_MODE_ONLY, modulo
+unprivileged_userfaultfd above.  If set to one, users without
+SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd
+to succeed.  Prohibiting use of userfaultfd for handling faults from
+kernel mode may make certain vulnerabilities more difficult
+to exploit.
+
+The default value is 0.
 
 user_reserve_kbytes
 ===================
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 21378abe8f7b..85cc1ab74361 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -29,6 +29,7 @@ 
 #include <linux/hugetlb.h>
 
 int sysctl_unprivileged_userfaultfd __read_mostly = 1;
+int sysctl_unprivileged_userfaultfd_user_mode_only __read_mostly = 0;
 
 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
 
@@ -2009,8 +2010,16 @@  SYSCALL_DEFINE1(userfaultfd, int, flags)
 	static const int uffd_flags = UFFD_USER_MODE_ONLY;
 	struct userfaultfd_ctx *ctx;
 	int fd;
+	bool need_cap_check = false;
 
-	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
+	if (!sysctl_unprivileged_userfaultfd)
+		need_cap_check = true;
+
+	if (sysctl_unprivileged_userfaultfd_user_mode_only &&
+	    (flags & UFFD_USER_MODE_ONLY) == 0)
+		need_cap_check = true;
+
+	if (need_cap_check && !capable(CAP_SYS_PTRACE))
 		return -EPERM;
 
 	BUG_ON(!current->mm);
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index a8e5f3ea9bb2..d81e30074bf5 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -31,6 +31,7 @@ 
 #define UFFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS)
 
 extern int sysctl_unprivileged_userfaultfd;
+extern int sysctl_unprivileged_userfaultfd_user_mode_only;
 
 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason);
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8727a3..9cbdf4483961 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1719,6 +1719,15 @@  static struct ctl_table vm_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
 	},
+	{
+		.procname	= "unprivileged_userfaultfd_user_mode_only",
+		.data		= &sysctl_unprivileged_userfaultfd_user_mode_only,
+		.maxlen		= sizeof(sysctl_unprivileged_userfaultfd_user_mode_only),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 	{ }
 };