mbox series

[v2,0/5] mm/memfd: MFD_NOEXEC for memfd_create

Message ID 20220805222126.142525-1-jeffxu@google.com (mailing list archive)
Headers show
Series mm/memfd: MFD_NOEXEC for memfd_create | expand

Message

Jeff Xu Aug. 5, 2022, 10:21 p.m. UTC
From: Jeff Xu <jeffxu@chromium.org>

Hi,

This v2 series MFD_NOEXEC, this series includes:
1> address comments in V1
2> add sysctl (vm.mfd_noexec) to change the default file permissions
    of memfd_create to be non-executable.

Below are cover-level for v1:

The default file permissions on a memfd include execute bits, which
means that such a memfd can be filled with a executable and passed to
the exec() family of functions. This is undesirable on systems where all
code is verified and all filesystems are intended to be mounted noexec,
since an attacker may be able to use a memfd to load unverified code and
execute it.

Additionally, execution via memfd is a common way to avoid scrutiny for
malicious code, since it allows execution of a program without a file
ever appearing on disk. This attack vector is not totally mitigated with
this new flag, since the default memfd file permissions must remain
executable to avoid breaking existing legitimate uses, but it should be
possible to use other security mechanisms to prevent memfd_create calls
without MFD_NOEXEC on systems where it is known that executable memfds
are not necessary.

This patch series adds a new MFD_NOEXEC flag for memfd_create(), which
allows creation of non-executable memfds, and as part of the
implementation of this new flag, it also adds a new F_SEAL_EXEC seal,
which will prevent modification of any of the execute bits of a sealed
memfd.

I am not sure if this is the best way to implement the desired behavior
(for example, the F_SEAL_EXEC seal is really more of an implementation
detail and feels a bit clunky to expose), so suggestions are welcome
for alternate approaches.

v1: https://lwn.net/Articles/890096/

Daniel Verkamp (4):
  mm/memfd: add F_SEAL_EXEC
  mm/memfd: add MFD_NOEXEC flag to memfd_create
  selftests/memfd: add tests for F_SEAL_EXEC
  selftests/memfd: add tests for MFD_NOEXEC

Jeff Xu (1):
  sysctl: add support for mfd_noexec

 include/linux/mm.h                         |   4 +
 include/uapi/linux/fcntl.h                 |   1 +
 include/uapi/linux/memfd.h                 |   1 +
 kernel/sysctl.c                            |   9 ++
 mm/memfd.c                                 |  39 ++++-
 mm/shmem.c                                 |   6 +
 tools/testing/selftests/memfd/memfd_test.c | 163 ++++++++++++++++++++-
 7 files changed, 221 insertions(+), 2 deletions(-)


base-commit: 9e2f40233670c70c25e0681cb66d50d1e2742829

Comments

Kees Cook Aug. 8, 2022, 5:46 p.m. UTC | #1
On Fri, Aug 05, 2022 at 10:21:21PM +0000, jeffxu@google.com wrote:
> This v2 series MFD_NOEXEC, this series includes:
> 1> address comments in V1
> 2> add sysctl (vm.mfd_noexec) to change the default file permissions
>     of memfd_create to be non-executable.
> 
> Below are cover-level for v1:
> 
> The default file permissions on a memfd include execute bits, which
> means that such a memfd can be filled with a executable and passed to
> the exec() family of functions. This is undesirable on systems where all
> code is verified and all filesystems are intended to be mounted noexec,
> since an attacker may be able to use a memfd to load unverified code and
> execute it.

I would absolutely like to see some kind of protection here. However,
I'd like a more specific threat model. What are the cases where the X
bit has been abused (e.g.[1])? What are the cases where the X bit is
needed (e.g.[2])? With those in mind, it should be possible to draw
a clear line between the two cases. (e.g. we need to avoid a confused
deputy attack where an "unprivileged" user can pass an executable memfd
to a "privileged" user. How those privileges are defined may matter a
lot based on how memfds are being used. For example, can runc's use of
executable memfds be distinguished from an attacker's?)

> Additionally, execution via memfd is a common way to avoid scrutiny for
> malicious code, since it allows execution of a program without a file
> ever appearing on disk. This attack vector is not totally mitigated with
> this new flag, since the default memfd file permissions must remain
> executable to avoid breaking existing legitimate uses, but it should be
> possible to use other security mechanisms to prevent memfd_create calls
> without MFD_NOEXEC on systems where it is known that executable memfds
> are not necessary.

This reminds me of dealing with non-executable stacks. There ended up
being three states:

- requested to be executable (PT_GNU_STACK X)
- requested to be non-executable (PT_GNU_STACK NX)
- undefined (no PT_GNU_STACK)

The first two are clearly defined, but the third needed a lot of special
handling. For a "safe by default" world, the third should be "NX", but
old stuff depended on it being "X".

Here, we have a bit being present or not, so we only have a binary
state. I'd much rather the default be NX (no bit set) instead of making
every future (safe) user of memfd have to specify MFD_NOEXEC.

It's also easier on a filtering side to say "disallow memfd_create with
MFD_EXEC", but how do we deal with the older software?

If the default perms of memfd_create()'s exec bit is controlled by a
sysctl and the sysctl is set to "leave it executable", how does a user
create an NX memfd? (i.e. setting MFD_EXEC means "exec" and not setting
it means "exec" also.) Are two bits needed? Seems wasteful.
MFD_I_KNOW_HOW_TO_SET_EXEC | MFD_EXEC, etc...

For F_SEAL_EXEC, it seems this should imply F_SEAL_WRITE if forced
executable to avoid WX mappings (i.e. provide W^X from the start).

-Kees

[1] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20memfd%20escalation&can=1
[2] https://lwn.net/Articles/781013/
Jeff Xu Nov. 1, 2022, 11:14 p.m. UTC | #2
Hi Kees

Sorry for the long overdue reply.

Those questions are really helpful  to understand the usage of memfd_create,
I will try to answer them, please see below inline.

On Mon, Aug 8, 2022 at 10:46 AM Kees Cook <keescook@chromium.org> wrote:
>
> On Fri, Aug 05, 2022 at 10:21:21PM +0000, jeffxu@google.com wrote:
> > This v2 series MFD_NOEXEC, this series includes:
> > 1> address comments in V1
> > 2> add sysctl (vm.mfd_noexec) to change the default file permissions
> >     of memfd_create to be non-executable.
> >
> > Below are cover-level for v1:
> >
> > The default file permissions on a memfd include execute bits, which
> > means that such a memfd can be filled with a executable and passed to
> > the exec() family of functions. This is undesirable on systems where all
> > code is verified and all filesystems are intended to be mounted noexec,
> > since an attacker may be able to use a memfd to load unverified code and
> > execute it.
>
> I would absolutely like to see some kind of protection here. However,
> I'd like a more specific threat model. What are the cases where the X
> bit has been abused (e.g.[1])? What are the cases where the X bit is
> needed (e.g.[2])? With those in mind, it should be possible to draw
> a clear line between the two cases. (e.g. we need to avoid a confused
> deputy attack where an "unprivileged" user can pass an executable memfd
> to a "privileged" user. How those privileges are defined may matter a
> lot based on how memfds are being used. For example, can runc's use of
> executable memfds be distinguished from an attacker's?)
>
runc needs memfd to be executable, so the host with runc need to be able to
create both non-executable memfd and executable memfd.
memfd_create API itself can't enforce the security of how it is being used.

> > Additionally, execution via memfd is a common way to avoid scrutiny for
> > malicious code, since it allows execution of a program without a file
> > ever appearing on disk. This attack vector is not totally mitigated with
> > this new flag, since the default memfd file permissions must remain
> > executable to avoid breaking existing legitimate uses, but it should be
> > possible to use other security mechanisms to prevent memfd_create calls
> > without MFD_NOEXEC on systems where it is known that executable memfds
> > are not necessary.
>
> This reminds me of dealing with non-executable stacks. There ended up
> being three states:
>
> - requested to be executable (PT_GNU_STACK X)
> - requested to be non-executable (PT_GNU_STACK NX)
> - undefined (no PT_GNU_STACK)
>
> The first two are clearly defined, but the third needed a lot of special
> handling. For a "safe by default" world, the third should be "NX", but
> old stuff depended on it being "X".
>
> Here, we have a bit being present or not, so we only have a binary
> state. I'd much rather the default be NX (no bit set) instead of making
> every future (safe) user of memfd have to specify MFD_NOEXEC.
>
> It's also easier on a filtering side to say "disallow memfd_create with
> MFD_EXEC", but how do we deal with the older software?
>
> If the default perms of memfd_create()'s exec bit is controlled by a
> sysctl and the sysctl is set to "leave it executable", how does a user
> create an NX memfd? (i.e. setting MFD_EXEC means "exec" and not setting
> it means "exec" also.) Are two bits needed? Seems wasteful.
> MFD_I_KNOW_HOW_TO_SET_EXEC | MFD_EXEC, etc...
>
Great points,  with those questions and usages in mind, I m thinking below:

1> memfd_create:
Add two flags:
#define MFD_EXEC                      0x0008
#define MFD_NOEXEC _SEAL    0x0010
This lets application to set executable bit explicitly.
(If application set both, it will be rejected)

2> For old application that doesn't set executable bit:
Add a pid name-spaced sysctl.kernel.pid_mfd_noexec, with:
value = 0: Default_EXEC
     Honor MFD_EXEC and MFD_NOEXEC_SEAL
     When none is set, will fall back to original behavior (EXEC)
value = 1: Default_NOEXEC_SEAL
      Honor MFD_EXEC and MFD_NOEXEC_SEAL
      When none is set, will default to MFD_NOEXEC_SEAL

3> Add a pid name-spaced sysctl kernel.pid_mfd_noexec_enforced: with:
value = 0: default, not enforced.
value = 1: enforce NOEXEC_SEAL (overwrite everything)

Then we can use and secure memfd at host and container as below:
At host level:
Case A> In secure by default system where doesn't allow executable memfd:
sysctl.kernel.pid_mfd_noexec_enforced = 1
LSM to block creation of executable memfd  system wide.
This requires a new hook: secure_memfd_create

Case B> In system that need both (runc case),
use sysctl kernel.pid_mfd_noexec = 0/1 during converting application to new API.
SELINUX or landlock to sandbox the process.(requires work).

At container level:
It would be nice for container to control creation of executable memfd too.
This is through  sysctl kernel.pid_mfd_noexec_enforced
This lets runc to create two type of contains:
one with ability to create executable memfd, one without.

The sysctl.kernel.pid_mfd_noexec sets the default value, it is helpful
during  applications are being migrated to set the executable bit.
 Alternatively, we can have a new syscall: memfd_create2, where it is mandatary
to set executable bit (or default to NOEXEC_SEAL),  then
sysctl.kernel.pid_mfd_noexec
is not needed.

> For F_SEAL_EXEC, it seems this should imply F_SEAL_WRITE if forced
> executable to avoid WX mappings (i.e. provide W^X from the start).
>
Yes. I agree.

Thanks!
Best regards,
Jeff Xu


> -Kees
>
> [1] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20memfd%20escalation&can=1
> [2] https://lwn.net/Articles/781013/
>
> --
> Kees Cook
Kees Cook Nov. 2, 2022, 2:45 a.m. UTC | #3
On Tue, Nov 01, 2022 at 04:14:39PM -0700, Jeff Xu wrote:
> Sorry for the long overdue reply.

No worries! I am a fan of thread necromancy. :)

> [...]
> 1> memfd_create:
> Add two flags:
> #define MFD_EXEC                      0x0008
> #define MFD_NOEXEC_SEAL    0x0010
> This lets application to set executable bit explicitly.
> (If application set both, it will be rejected)

So no MFD_NOEXEC without seal? (I'm fine with that.)

> 2> For old application that doesn't set executable bit:
> Add a pid name-spaced sysctl.kernel.pid_mfd_noexec, with:

bikeshed: vm.memfd_noexec
(doesn't belong in "kernel", and seems better suited to "vm" than "fs")

> value = 0: Default_EXEC
>      Honor MFD_EXEC and MFD_NOEXEC_SEAL
>      When none is set, will fall back to original behavior (EXEC)

Yeah. Rephrasing for myself to understand more clearly:

"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like MFD_EXEC
was set."

> value = 1: Default_NOEXEC_SEAL
>       Honor MFD_EXEC and MFD_NOEXEC_SEAL
>       When none is set, will default to MFD_NOEXEC_SEAL

"memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like
MFD_NOEXEC_SEAL was set."

Also, I think there should be a pr_warn_ratelimited() when
memfd_create() is used without either bit, so that there is some
pressure to please adjust their API calls to explicitly set a bit.

> 3> Add a pid name-spaced sysctl kernel.pid_mfd_noexec_enforced: with:
> value = 0: default, not enforced.
> value = 1: enforce NOEXEC_SEAL (overwrite everything)

How about making this just mode "value 2" for the first sysctl?
"memfd_create() without MFD_NOEXEC_SEAL will be rejected."

-Kees
Jeff Xu Nov. 2, 2022, 5:18 p.m. UTC | #4
On Tue, Nov 1, 2022 at 7:45 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Tue, Nov 01, 2022 at 04:14:39PM -0700, Jeff Xu wrote:
> > Sorry for the long overdue reply.
>
> No worries! I am a fan of thread necromancy. :)
>
> > [...]
> > 1> memfd_create:
> > Add two flags:
> > #define MFD_EXEC                      0x0008
> > #define MFD_NOEXEC_SEAL    0x0010
> > This lets application to set executable bit explicitly.
> > (If application set both, it will be rejected)
>
> So no MFD_NOEXEC without seal? (I'm fine with that.)
>
no MFD_NOEXEC because memfd can be chmod to add x after creation,
it is not secure.

no MFD_EXEC_SEAL because it is better to apply both w and x seal
within the same function call, and w seal can't be applied at creation time.

> > 2> For old application that doesn't set executable bit:
> > Add a pid name-spaced sysctl.kernel.pid_mfd_noexec, with:
>
> bikeshed: vm.memfd_noexec
> (doesn't belong in "kernel", and seems better suited to "vm" than "fs")
>
SG, will use vm.memfd_noexec

> > value = 0: Default_EXEC
> >      Honor MFD_EXEC and MFD_NOEXEC_SEAL
> >      When none is set, will fall back to original behavior (EXEC)
>
> Yeah. Rephrasing for myself to understand more clearly:
>
> "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like MFD_EXEC
> was set."
>
> > value = 1: Default_NOEXEC_SEAL
> >       Honor MFD_EXEC and MFD_NOEXEC_SEAL
> >       When none is set, will default to MFD_NOEXEC_SEAL
>
> "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL act like
> MFD_NOEXEC_SEAL was set."
>
Copy, this is clearer. Thanks.

> Also, I think there should be a pr_warn_ratelimited() when
> memfd_create() is used without either bit, so that there is some
> pressure to please adjust their API calls to explicitly set a bit.
>
Sure

> > 3> Add a pid name-spaced sysctl kernel.pid_mfd_noexec_enforced: with:
> > value = 0: default, not enforced.
> > value = 1: enforce NOEXEC_SEAL (overwrite everything)
>
> How about making this just mode "value 2" for the first sysctl?
> "memfd_create() without MFD_NOEXEC_SEAL will be rejected."
>
Good point. Kernel overwriting  might not be a good practice.
I will add to vm.mfd_noexec.
value = 2: "memfd_create() without MFD_NOEXEC_SEAL will be rejected."

Thanks!
Jeff

> -Kees
>
> --
> Kees Cook