mbox series

[RFC,v1,0/8] Introduce mseal() syscall

Message ID 20231016143828.647848-1-jeffxu@chromium.org (mailing list archive)
Headers show
Series Introduce mseal() syscall | expand

Message

Jeff Xu Oct. 16, 2023, 2:38 p.m. UTC
From: Jeff Xu <jeffxu@google.com>

This patchset proposes a new mseal() syscall for the Linux kernel. 

Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen.

Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.

The new mseal() is an architecture independent syscall, and with
following signature:

mseal(void addr, size_t len, unsigned int types, unsigned int flags)

addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.

types: bit mask to specify which syscall to seal, currently they are:
MM_SEAL_MSEAL 0x1
MM_SEAL_MPROTECT 0x2
MM_SEAL_MUNMAP 0x4
MM_SEAL_MMAP 0x8
MM_SEAL_MREMAP 0x10

Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.

The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal().  Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.

MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.

Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks. 

For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated. It might be worth noting that this is different from the
rest of mprotect(), where some updates can happen even when mprotect
returns fail. Consider can_modify_mm only checks vm_seals in
vm_area_struct, and it is not going deeper in the page table or updating
any HW, success or none behavior might fit better here. I would like to
listen to the community's feedback on this.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.

In addition, Stephen is working on glibc change to add sealing support
into the dynamic linker to seal all non-writable segments at startup. When
that work is completed, all applications can automatically benefit from
these new protections.

[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc

Jeff Xu (8):
  Add mseal syscall
  Wire up mseal syscall
  mseal: add can_modify_mm and can_modify_vma
  mseal: seal mprotect
  mseal munmap
  mseal mremap
  mseal mmap
  selftest mm/mseal mprotect/munmap/mremap/mmap

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 fs/aio.c                                    |    5 +-
 include/linux/mm.h                          |   55 +-
 include/linux/mm_types.h                    |    7 +
 include/linux/syscalls.h                    |    2 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 include/uapi/linux/mman.h                   |    6 +
 ipc/shm.c                                   |    3 +-
 kernel/sys_ni.c                             |    1 +
 mm/Kconfig                                  |    8 +
 mm/Makefile                                 |    1 +
 mm/internal.h                               |    4 +-
 mm/mmap.c                                   |   49 +-
 mm/mprotect.c                               |    6 +
 mm/mremap.c                                 |   19 +-
 mm/mseal.c                                  |  328 +++++
 mm/nommu.c                                  |    6 +-
 mm/util.c                                   |    8 +-
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
 37 files changed, 1934 insertions(+), 28 deletions(-)
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

Comments

Matthew Wilcox (Oracle) Oct. 16, 2023, 3:18 p.m. UTC | #1
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> Modern CPUs support memory permissions such as RW and NX bits. Linux has
> supported NX since the release of kernel version 2.6.8 in August 2004 [1].

This seems like a confusing way to introduce the subject.  Here, you're
talking about page permissions, whereas (as far as I can tell), mseal() is
about making _virtual_ addresses immutable, for some value of immutable.

> Memory sealing additionally protects the mapping itself against
> modifications. This is useful to mitigate memory corruption issues where
> a corrupted pointer is passed to a memory management syscall. For example,
> such an attacker primitive can break control-flow integrity guarantees
> since read-only memory that is supposed to be trusted can become writable
> or .text pages can get remapped. Memory sealing can automatically be
> applied by the runtime loader to seal .text and .rodata pages and
> applications can additionally seal security critical data at runtime.
> A similar feature already exists in the XNU kernel with the
> VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.

This [2] seems very generic and wide-ranging, not helpful.  [5] was more
useful to understand what you're trying to do.

> The new mseal() is an architecture independent syscall, and with
> following signature:
> 
> mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> 
> addr/len: memory range.  Must be continuous/allocated memory, or else
> mseal() will fail and no VMA is updated. For details on acceptable
> arguments, please refer to comments in mseal.c. Those are also fully
> covered by the selftest.

Mmm.  So when you say "continuous/allocated" what you really mean is
"Must have contiguous VMAs" rather than "All pages in this range must
be populated", yes?

> types: bit mask to specify which syscall to seal, currently they are:
> MM_SEAL_MSEAL 0x1
> MM_SEAL_MPROTECT 0x2
> MM_SEAL_MUNMAP 0x4
> MM_SEAL_MMAP 0x8
> MM_SEAL_MREMAP 0x10

I don't understand why we want this level of granularity.  The OpenBSD
and XNU examples just say "This must be immutable*".  For values of
immutable that allow downgrading access (eg RW to RO or RX to RO),
but not upgrading access (RW->RX, RO->*, RX->RW).

> Each bit represents sealing for one specific syscall type, e.g.
> MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> is that the API is extendable, i.e. when needed, the sealing can be
> extended to madvise, mlock, etc. Backward compatibility is also easy.

Honestly, it feels too flexible.  Why not just two flags to mprotect()
-- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
maybe for some things we want to be able to downgrade and for other
things, we don't.

I'd like to see some discussion of how this interacts with mprotect().
As far as I can tell, the intent is to lock the protections/existance
of the mapping, and not to force memory to stay in core.  So it's fine
for the kernel to swap out the page and set up a PTE as a swap entry.
It's also fine for the kernel to mark PTEs as RO to catch page faults;
we're concerned with the LOGICAL permissions, and not the page tables.
Linus Torvalds Oct. 16, 2023, 5:23 p.m. UTC | #2
On Mon, 16 Oct 2023 at 07:38, <jeffxu@chromium.org> wrote:
>
> This patchset proposes a new mseal() syscall for the Linux kernel.

So I have no objections to adding some kind of "lock down memory
mappings" model, but this isn't it.

First off, the simple stuff: the commit messages are worthless. Having

   check seal for mmap(2)

as the commit message is not even remotely acceptable, to pick one
random example from the series (7/8).

But that doesn't matter much, since I think the more fundamental
problems are much worse:

 - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is
just complete noise and totally illogical. The whole concept needs to
be redone.

Example of complete nonsense (again, picking 7/8, but that's again
just a random example):

> @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages,
>                 flags |= MAP_LOCKED;
>
>         file = get_file(vma->vm_file);
> -       ret = do_mmap(vma->vm_file, start, size,
> -                       prot, flags, pgoff, &populate, NULL);
> +       ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
> +                       &populate, NULL, ON_BEHALF_OF_KERNEL);
>         fput(file);
>  out:
>         mmap_write_unlock(mm);

Christ. That's *literally* the remap_file_pages() system call
definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense
in this context.

It's not the only situation. "mremap() as the same thing. vm_munmap()
has the same thing. vm_brk_flags() has the same thing. None of them
make any sense.

But even if those obvious kinds of complete mis-designs were to be
individually fixed, the whole naming and concept is bogus. The
"ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check
sealing". Naming should describe what the thing *means*, not some
random policy thing that may or may not be at all relevant.

 - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away.

Not only is it unacceptable to pointlessly create two different name
spaces for no obvious reason, code like this (from 1/8) should not
exist:

> +       if (types & MM_SEAL_MSEAL)
> +               newtypes |= VM_SEAL_MSEAL;
> +
> +       if (types & MM_SEAL_MPROTECT)
> +               newtypes |= VM_SEAL_MPROTECT;
> +
> +       if (types & MM_SEAL_MUNMAP)
> +               newtypes |= VM_SEAL_MUNMAP;
> +
> +       if (types & MM_SEAL_MMAP)
> +               newtypes |= VM_SEAL_MMAP;
> +
> +       if (types & MM_SEAL_MREMAP)
> +               newtypes |= VM_SEAL_MREMAP;

Sure, we have that in some cases when there was a *reason* for
namespacing imposed on us from an API standpoint (ie the "open()"
flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags
being turned into MNT_xyz flags), but those tend to be signs of
problems in the system call API where some mixup made it impossible to
use the flags directly (most commonly because there is some extended
internal representation of said things).

For example, the MS_xyz namespace is a combination of "these are flags
for the new mount" (like MS_RDONLY) and "this is how you should mount
it" (like MS_REMOUNT), and so the user interface has that odd mixing
of different things, and the MNT_xyz namespace is a distillation of
the former.

But we certainly should not strive to introduce *new* interfaces that
start out with this kind of mixup and pointless "translate from one
bit to another" code.

 - And finally (for now), I hate that MM_ACTION_xyz thing too!

Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8):

> +       switch (action) {
> +       case MM_ACTION_MPROTECT:
> +               if (vma->vm_seals & VM_SEAL_MPROTECT)
> +                       return false;
> +               break;

when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and
sealing bitmask and just say

        if (vma->vm_seal & vm_action)
                return -EPERM;

IOW, you have pointlessly introduced not *two* different namespaces,
but *three*. All doing the exact same thing, and all just causing
pointless and ugly code to "translate" the actions of one into the
model of another.

So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz"
thing. Use *one* single "these are the sealing bits".

And get rid of "enum caller_origin" entirely. I don't know what the
right model for that thing is, but that isn't it.

*Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space
cannot set, but that the kernel can use internally - and if that is
the right model, then dammit, the *uses* should be very very obvious
why the override is fine, because that remap_file_pages() use sure as
hell was *not* obvious.

In fact, it's not at all obvious why anything should override the
sealing bits - EVER. So I'm not convinced that the right model is
"replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we
*ever* want to override sealing? It sounds like complete garbage. Most
of the users seem to be things like "execve()", which is nonsensical,
since the VM wouldn't have been sealed at that point _anyway_, so
having a "don't bother checking sealing bits" flag seems entirely
useless.

Anyway, this is all a resounding NAK on this series in this form. My
complaints are not some kind of small "fix this up". These are
fundamental issues.

            Linus
Jann Horn Oct. 16, 2023, 5:34 p.m. UTC | #3
On Mon, Oct 16, 2023 at 4:38 PM <jeffxu@chromium.org> wrote:
>
> From: Jeff Xu <jeffxu@google.com>
>
> This patchset proposes a new mseal() syscall for the Linux kernel.
>
> Modern CPUs support memory permissions such as RW and NX bits. Linux has
> supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> The memory permission feature improves security stance on memory
> corruption bugs, i.e. the attacker can’t just write to arbitrary memory
> and point the code to it, the memory has to be marked with X bit, or
> else an exception will happen.
>
> Memory sealing additionally protects the mapping itself against
> modifications. This is useful to mitigate memory corruption issues where
> a corrupted pointer is passed to a memory management syscall. For example,
> such an attacker primitive can break control-flow integrity guarantees
> since read-only memory that is supposed to be trusted can become writable
> or .text pages can get remapped. Memory sealing can automatically be
> applied by the runtime loader to seal .text and .rodata pages and
> applications can additionally seal security critical data at runtime.
> A similar feature already exists in the XNU kernel with the
> VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.
>
> The new mseal() is an architecture independent syscall, and with
> following signature:
>
> mseal(void addr, size_t len, unsigned int types, unsigned int flags)

Is the plan that the VMAs you need to protect would be created and
mseal()'ed while you expect that attacker code can not (yet) be
running concurrently?

Do you expect to be using sealed memory for shadow stacks (in x86 CET
/ arm64 GCS) to prevent an attacker from mixing those up by moving
pages inside a shadow stack or between different shadow stacks or
such? (If that's even possible, I think it is but I haven't tried.)

> addr/len: memory range.  Must be continuous/allocated memory, or else
> mseal() will fail and no VMA is updated. For details on acceptable
> arguments, please refer to comments in mseal.c. Those are also fully
> covered by the selftest.
> types: bit mask to specify which syscall to seal, currently they are:
> MM_SEAL_MSEAL 0x1
> MM_SEAL_MPROTECT 0x2
> MM_SEAL_MUNMAP 0x4
> MM_SEAL_MMAP 0x8
> MM_SEAL_MREMAP 0x10

You'd probably also want to block destructive madvise() operations
that can effectively alter region contents by discarding pages and
such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED;
probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd
want to just block all madvise() for sealed VMAs? Or rename
process_madvise_behavior_valid() to something like
"madvise_is_nondestructive()" and use that.

The following comments probably mostly don't matter in practice if
this feature is used in a context that is heavily seccomp-sandboxed
(like Desktop Linux Chrome), but should maybe be addressed to make
this feature more usable for other users. (Including Android Chrome,
which has a weaker sandbox...)

FWIW, it is also possible to write to read-only memory through the
/proc/self/mem interface (or through ptrace commands like
PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel
configuration, seccomp policy, and what the LSMs do. (I think Android
Chrome would allow /proc/self/mem writes, but would block
PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)

I had a related ancient patch series in 2016 with an attempt to allow
SELinux to prevent bypassing W^X protections with this, but I never
followed through with getting that landed...
(https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh.net/)

I guess the question there is what the right semantics for this kind
of protected memory are when a debugger is active. The simple solution
that might break some debugging would be "just deny all FOLL_FORCE
write access for this memory" (which would prevent debuggers from
being able to place breakpoints, which would maybe not be great). But
maybe it makes more sense to consider this to be an independent
concern and solve it with a new SELinux feature or something like that
instead, and then document that mseal() requires some complement to
prevent forced writes to read-only private memory? (For which the
simplest solution would be "don't grant filesystem access or ptrace()
access to the sandboxed code".)


What is the intended interaction with userfaultfd, which I believe by
design permits arbitrary data into unpopulated areas of anonymous
VMAs? If the intention is that the process should otherwise be
sandboxed to not have access to userfaultfd, that should maybe be
documented. (Alternatively I guess you could theoretically remove the
VM_MAYWRITE bit from marked VMAs, but that might be more strict than
we want, since it'd also block all FOLL_FORCE writes.)


There are also some interfaces like AIO or the X86 Shadow Stacks
interface that indirectly unmap memory through the kernel and look
like they could perhaps be tricked into racily unmapping a
just-created sealed VMA. You'd probably have to make sure that they
can't do that and essentially treat their unmap operations as if they
came from userspace, I guess? What Linus just wrote.


I think either way this feature needs some documentation on what kind
of context it's supposed to run in.


> Each bit represents sealing for one specific syscall type, e.g.
> MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> is that the API is extendable, i.e. when needed, the sealing can be
> extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> The kernel will remember which seal types are applied, and the application
> doesn’t need to repeat all existing seal types in the next mseal().  Once
> a seal type is applied, it can’t be unsealed. Call mseal() on an existing
> seal type is a no-action, not a failure.
>
> MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
>
> Internally, vm_area_struct adds a new field vm_seals, to store the bit
> masks.
>
> For the affected syscalls, such as mprotect, a check(can_modify_mm) for
> sealing is added, this usually happens at the early point of the syscall,
> before any update is made to VMAs. The effect of that is: if any of the
> VMAs in the given address range fails the sealing check, none of the VMA
> will be updated. It might be worth noting that this is different from the
> rest of mprotect(), where some updates can happen even when mprotect
> returns fail. Consider can_modify_mm only checks vm_seals in
> vm_area_struct, and it is not going deeper in the page table or updating
> any HW, success or none behavior might fit better here. I would like to
> listen to the community's feedback on this.
>
> The idea that inspired this patch comes from Stephen Röttger’s work in
> V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.
>
> In addition, Stephen is working on glibc change to add sealing support
> into the dynamic linker to seal all non-writable segments at startup. When
> that work is completed, all applications can automatically benefit from
> these new protections.
>
> [1] https://kernelnewbies.org/Linux_2_6_8
> [2] https://v8.dev/blog/control-flow-integrity
> [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> [4] https://man.openbsd.org/mimmutable.2
> [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
>
> Jeff Xu (8):
>   Add mseal syscall
>   Wire up mseal syscall
>   mseal: add can_modify_mm and can_modify_vma
>   mseal: seal mprotect
>   mseal munmap
>   mseal mremap
>   mseal mmap
>   selftest mm/mseal mprotect/munmap/mremap/mmap
>
>  arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
>  arch/arm/tools/syscall.tbl                  |    1 +
>  arch/arm64/include/asm/unistd.h             |    2 +-
>  arch/arm64/include/asm/unistd32.h           |    2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |    1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |    1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
>  fs/aio.c                                    |    5 +-
>  include/linux/mm.h                          |   55 +-
>  include/linux/mm_types.h                    |    7 +
>  include/linux/syscalls.h                    |    2 +
>  include/uapi/asm-generic/unistd.h           |    5 +-
>  include/uapi/linux/mman.h                   |    6 +
>  ipc/shm.c                                   |    3 +-
>  kernel/sys_ni.c                             |    1 +
>  mm/Kconfig                                  |    8 +
>  mm/Makefile                                 |    1 +
>  mm/internal.h                               |    4 +-
>  mm/mmap.c                                   |   49 +-
>  mm/mprotect.c                               |    6 +
>  mm/mremap.c                                 |   19 +-
>  mm/mseal.c                                  |  328 +++++
>  mm/nommu.c                                  |    6 +-
>  mm/util.c                                   |    8 +-
>  tools/testing/selftests/mm/Makefile         |    1 +
>  tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
>  37 files changed, 1934 insertions(+), 28 deletions(-)
>  create mode 100644 mm/mseal.c
>  create mode 100644 tools/testing/selftests/mm/mseal_test.c
>
> --
> 2.42.0.609.gbb76f46606-goog
>
Jeff Xu Oct. 17, 2023, 8:34 a.m. UTC | #4
Hi Matthew.

Thanks for your comments and time to review the patchset.

On Mon, Oct 16, 2023 at 8:18 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
>
> This seems like a confusing way to introduce the subject.  Here, you're
> talking about page permissions, whereas (as far as I can tell), mseal() is
> about making _virtual_ addresses immutable, for some value of immutable.
>
> > Memory sealing additionally protects the mapping itself against
> > modifications. This is useful to mitigate memory corruption issues where
> > a corrupted pointer is passed to a memory management syscall. For example,
> > such an attacker primitive can break control-flow integrity guarantees
> > since read-only memory that is supposed to be trusted can become writable
> > or .text pages can get remapped. Memory sealing can automatically be
> > applied by the runtime loader to seal .text and .rodata pages and
> > applications can additionally seal security critical data at runtime.
> > A similar feature already exists in the XNU kernel with the
> > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> useful to understand what you're trying to do.
>
> > The new mseal() is an architecture independent syscall, and with
> > following signature:
> >
> > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> >
> > addr/len: memory range.  Must be continuous/allocated memory, or else
> > mseal() will fail and no VMA is updated. For details on acceptable
> > arguments, please refer to comments in mseal.c. Those are also fully
> > covered by the selftest.
>
> Mmm.  So when you say "continuous/allocated" what you really mean is
> "Must have contiguous VMAs" rather than "All pages in this range must
> be populated", yes?
>
There can't be a gap (unallocated memory) in the given range.
Those are covered in selftest:
test_seal_unmapped_start()
test_seal_unmapped_middle()
test_seal_unmapped_end()
The comments in check_mm_seal() also mentioned that.

> > types: bit mask to specify which syscall to seal, currently they are:
> > MM_SEAL_MSEAL 0x1
> > MM_SEAL_MPROTECT 0x2
> > MM_SEAL_MUNMAP 0x4
> > MM_SEAL_MMAP 0x8
> > MM_SEAL_MREMAP 0x10
>
> I don't understand why we want this level of granularity.  The OpenBSD
> and XNU examples just say "This must be immutable*".  For values of
> immutable that allow downgrading access (eg RW to RO or RX to RO),
> but not upgrading access (RW->RX, RO->*, RX->RW).
>
> > Each bit represents sealing for one specific syscall type, e.g.
> > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > is that the API is extendable, i.e. when needed, the sealing can be
> > extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> Honestly, it feels too flexible.  Why not just two flags to mprotect()
> -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> maybe for some things we want to be able to downgrade and for other
> things, we don't.
>
Having a seal type per syscall type helps to add the feature incrementally.
Applications also know exactly what is sealed.

I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we
can define what it seals precisely. As Jann pointed out, there have
other scenarios that potentially affect IMMUTABLE. Implementing all thoses
will take time. And if we missed a case, we could introduce backward
compatibility issues to the application. Bitmask will solve this nicely, i.e.
application will need to apply the newly added sealing type explicitly.

> I'd like to see some discussion of how this interacts with mprotect().
> As far as I can tell, the intent is to lock the protections/existance
> of the mapping, and not to force memory to stay in core.  So it's fine
> for the kernel to swap out the page and set up a PTE as a swap entry.
> It's also fine for the kernel to mark PTEs as RO to catch page faults;
> we're concerned with the LOGICAL permissions, and not the page tables.

Yes. That is correct.

-Jeff Xu
Jeff Xu Oct. 17, 2023, 8:42 a.m. UTC | #5
Hi Jann,

Thank you for reviewing the patch and comments.

On Mon, Oct 16, 2023 at 10:35 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Oct 16, 2023 at 4:38 PM <jeffxu@chromium.org> wrote:
> >
> > From: Jeff Xu <jeffxu@google.com>
> >
> > This patchset proposes a new mseal() syscall for the Linux kernel.
> >
> > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> > The memory permission feature improves security stance on memory
> > corruption bugs, i.e. the attacker can’t just write to arbitrary memory
> > and point the code to it, the memory has to be marked with X bit, or
> > else an exception will happen.
> >
> > Memory sealing additionally protects the mapping itself against
> > modifications. This is useful to mitigate memory corruption issues where
> > a corrupted pointer is passed to a memory management syscall. For example,
> > such an attacker primitive can break control-flow integrity guarantees
> > since read-only memory that is supposed to be trusted can become writable
> > or .text pages can get remapped. Memory sealing can automatically be
> > applied by the runtime loader to seal .text and .rodata pages and
> > applications can additionally seal security critical data at runtime.
> > A similar feature already exists in the XNU kernel with the
> > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
> >
> > The new mseal() is an architecture independent syscall, and with
> > following signature:
> >
> > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
>
> Is the plan that the VMAs you need to protect would be created and
> mseal()'ed while you expect that attacker code can not (yet) be
> running concurrently?
>
Yes.

> Do you expect to be using sealed memory for shadow stacks (in x86 CET
> / arm64 GCS) to prevent an attacker from mixing those up by moving
> pages inside a shadow stack or between different shadow stacks or
> such? (If that's even possible, I think it is but I haven't tried.)
>
I will defer the question to Stephen.
(I also think that is possible, if the application know where the
shadow stack is at runtime)

> > addr/len: memory range.  Must be continuous/allocated memory, or else
> > mseal() will fail and no VMA is updated. For details on acceptable
> > arguments, please refer to comments in mseal.c. Those are also fully
> > covered by the selftest.
> > types: bit mask to specify which syscall to seal, currently they are:
> > MM_SEAL_MSEAL 0x1
> > MM_SEAL_MPROTECT 0x2
> > MM_SEAL_MUNMAP 0x4
> > MM_SEAL_MMAP 0x8
> > MM_SEAL_MREMAP 0x10
>
> You'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED;
> probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd
> want to just block all madvise() for sealed VMAs? Or rename
> process_madvise_behavior_valid() to something like
> "madvise_is_nondestructive()" and use that.
>
Thanks for the suggestion. I can add madvise() later, maybe in another series.
For now, the goal of this patchset covers 4 syscalls, as wished for by
Chrome initially.

> The following comments probably mostly don't matter in practice if
> this feature is used in a context that is heavily seccomp-sandboxed
> (like Desktop Linux Chrome), but should maybe be addressed to make
> this feature more usable for other users. (Including Android Chrome,
> which has a weaker sandbox...)
>
> FWIW, it is also possible to write to read-only memory through the
> /proc/self/mem interface (or through ptrace commands like
> PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel
> configuration, seccomp policy, and what the LSMs do. (I think Android
> Chrome would allow /proc/self/mem writes, but would block
> PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)
>
> I had a related ancient patch series in 2016 with an attempt to allow
> SELinux to prevent bypassing W^X protections with this, but I never
> followed through with getting that landed...
> (https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh.net/)
>
> I guess the question there is what the right semantics for this kind
> of protected memory are when a debugger is active. The simple solution
> that might break some debugging would be "just deny all FOLL_FORCE
> write access for this memory" (which would prevent debuggers from
> being able to place breakpoints, which would maybe not be great). But
> maybe it makes more sense to consider this to be an independent
> concern and solve it with a new SELinux feature or something like that
> instead, and then document that mseal() requires some complement to
> prevent forced writes to read-only private memory? (For which the
> simplest solution would be "don't grant filesystem access or ptrace()
> access to the sandboxed code".)
>
> What is the intended interaction with userfaultfd, which I believe by
> design permits arbitrary data into unpopulated areas of anonymous
> VMAs? If the intention is that the process should otherwise be
> sandboxed to not have access to userfaultfd, that should maybe be
> documented. (Alternatively I guess you could theoretically remove the
> VM_MAYWRITE bit from marked VMAs, but that might be more strict than
> we want, since it'd also block all FOLL_FORCE writes.)
>
>
> There are also some interfaces like AIO or the X86 Shadow Stacks
> interface that indirectly unmap memory through the kernel and look
> like they could perhaps be tricked into racily unmapping a
> just-created sealed VMA. You'd probably have to make sure that they
> can't do that and essentially treat their unmap operations as if they
> came from userspace, I guess? What Linus just wrote.
>
>
> I think either way this feature needs some documentation on what kind
> of context it's supposed to run in.
>
Thanks for listing all the cases. As you pointed out, Chrome is
heavily sandboxed, with syscalls and file access. I will work with Stephan
to check if some of these applied to Chrome.

It is probably worth to mention, with the current approach of mseal(),
i.e. by bitmask,  it is possible to implement those above incrementally.

Thanks
-Jeff



-Jeff


>
> > Each bit represents sealing for one specific syscall type, e.g.
> > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > is that the API is extendable, i.e. when needed, the sealing can be
> > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > The kernel will remember which seal types are applied, and the application
> > doesn’t need to repeat all existing seal types in the next mseal().  Once
> > a seal type is applied, it can’t be unsealed. Call mseal() on an existing
> > seal type is a no-action, not a failure.
> >
> > MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
> >
> > Internally, vm_area_struct adds a new field vm_seals, to store the bit
> > masks.
> >
> > For the affected syscalls, such as mprotect, a check(can_modify_mm) for
> > sealing is added, this usually happens at the early point of the syscall,
> > before any update is made to VMAs. The effect of that is: if any of the
> > VMAs in the given address range fails the sealing check, none of the VMA
> > will be updated. It might be worth noting that this is different from the
> > rest of mprotect(), where some updates can happen even when mprotect
> > returns fail. Consider can_modify_mm only checks vm_seals in
> > vm_area_struct, and it is not going deeper in the page table or updating
> > any HW, success or none behavior might fit better here. I would like to
> > listen to the community's feedback on this.
> >
> > The idea that inspired this patch comes from Stephen Röttger’s work in
> > V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.
> >
> > In addition, Stephen is working on glibc change to add sealing support
> > into the dynamic linker to seal all non-writable segments at startup. When
> > that work is completed, all applications can automatically benefit from
> > these new protections.
> >
> > [1] https://kernelnewbies.org/Linux_2_6_8
> > [2] https://v8.dev/blog/control-flow-integrity
> > [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> > [4] https://man.openbsd.org/mimmutable.2
> > [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
> >
> > Jeff Xu (8):
> >   Add mseal syscall
> >   Wire up mseal syscall
> >   mseal: add can_modify_mm and can_modify_vma
> >   mseal: seal mprotect
> >   mseal munmap
> >   mseal mremap
> >   mseal mmap
> >   selftest mm/mseal mprotect/munmap/mremap/mmap
> >
> >  arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
> >  arch/arm/tools/syscall.tbl                  |    1 +
> >  arch/arm64/include/asm/unistd.h             |    2 +-
> >  arch/arm64/include/asm/unistd32.h           |    2 +
> >  arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
> >  arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
> >  arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
> >  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
> >  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
> >  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
> >  arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
> >  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
> >  arch/s390/kernel/syscalls/syscall.tbl       |    1 +
> >  arch/sh/kernel/syscalls/syscall.tbl         |    1 +
> >  arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
> >  arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
> >  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
> >  fs/aio.c                                    |    5 +-
> >  include/linux/mm.h                          |   55 +-
> >  include/linux/mm_types.h                    |    7 +
> >  include/linux/syscalls.h                    |    2 +
> >  include/uapi/asm-generic/unistd.h           |    5 +-
> >  include/uapi/linux/mman.h                   |    6 +
> >  ipc/shm.c                                   |    3 +-
> >  kernel/sys_ni.c                             |    1 +
> >  mm/Kconfig                                  |    8 +
> >  mm/Makefile                                 |    1 +
> >  mm/internal.h                               |    4 +-
> >  mm/mmap.c                                   |   49 +-
> >  mm/mprotect.c                               |    6 +
> >  mm/mremap.c                                 |   19 +-
> >  mm/mseal.c                                  |  328 +++++
> >  mm/nommu.c                                  |    6 +-
> >  mm/util.c                                   |    8 +-
> >  tools/testing/selftests/mm/Makefile         |    1 +
> >  tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
> >  37 files changed, 1934 insertions(+), 28 deletions(-)
> >  create mode 100644 mm/mseal.c
> >  create mode 100644 tools/testing/selftests/mm/mseal_test.c
> >
> > --
> > 2.42.0.609.gbb76f46606-goog
> >
Jeff Xu Oct. 17, 2023, 9:07 a.m. UTC | #6
Hello Linus,

Thank you for the time reviewing this patch set.

On Mon, Oct 16, 2023 at 10:23 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 16 Oct 2023 at 07:38, <jeffxu@chromium.org> wrote:
> >
> > This patchset proposes a new mseal() syscall for the Linux kernel.
>
> So I have no objections to adding some kind of "lock down memory
> mappings" model, but this isn't it.
>
> First off, the simple stuff: the commit messages are worthless. Having
>
>    check seal for mmap(2)
>
> as the commit message is not even remotely acceptable, to pick one
> random example from the series (7/8).
>
> But that doesn't matter much, since I think the more fundamental
> problems are much worse:
>
>  - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is
> just complete noise and totally illogical. The whole concept needs to
> be redone.
>
> Example of complete nonsense (again, picking 7/8, but that's again
> just a random example):
>
> > @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages,
> >                 flags |= MAP_LOCKED;
> >
> >         file = get_file(vma->vm_file);
> > -       ret = do_mmap(vma->vm_file, start, size,
> > -                       prot, flags, pgoff, &populate, NULL);
> > +       ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
> > +                       &populate, NULL, ON_BEHALF_OF_KERNEL);
> >         fput(file);
> >  out:
> >         mmap_write_unlock(mm);
>
> Christ. That's *literally* the remap_file_pages() system call
> definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense
> in this context.
>
> It's not the only situation. "mremap() as the same thing. vm_munmap()
> has the same thing. vm_brk_flags() has the same thing. None of them
> make any sense.
>
> But even if those obvious kinds of complete mis-designs were to be
> individually fixed, the whole naming and concept is bogus. The
> "ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check
> sealing". Naming should describe what the thing *means*, not some
> random policy thing that may or may not be at all relevant.
>

I apologize that I didn't think of a better name for ON_BEHALF_OF_XX
and I should have written a more clear commit message.

I prepared a V2 patchset with a more detailed commit message, hopefully
that will help to explain the design.

Indeed, the ON_BEHALF_OF_XX is confusing, especially with
remap_file_pathes(). remap_file_pathes(2) is not supported
in this patch set, this covers mprotect(2), mmap(2), munmap(2),
mremap(2) as the first feature set. I could extend the sealing
to more syscalls, if it is determined necessary from the outcome of
this discussion. The initial set of 4 syscalls was chosen based
on Chrome's initial wish list.

Regarding the ON_BEHALF_OF flag, my intention is to have a flag,
set at syscall entry points of mprotect(2), munmap(2), mremap(2),
mmap(2),  pass the flag along the call stack, till reaching
can_modify_mm(), can_modify_mm() does the actual check for
the sealing type.

It is probably worth noting that I choose to check one and only
one sealing type per syscall. i.e. munmap(2) checks
MM_SEAL_MUNMAP only. With this approach, sealing can be
implemented incrementally. For example, When implementing
sealing for munmap(2), I don't need to care that mremap(2)
can also call internal functions to unmap the VMAs.
The mremap(2) will be sealed by MM_SEAL_MREMAP(), and
be dealt with separately.

This approach also allows dev to expand the sealing to madvice(),
mlock(), or whatever syscalls or cases that modify VMA's meta data.
As Yann points out, There is a list of cases that we might care about.
Having all of those implemented will take time. Using bitmasks will
help to add those incrementally. An application will  be backward
compatible when a new sealing type is added, i.e. It has to set the
new sealing type explicitly.

>  - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away.
>
> Not only is it unacceptable to pointlessly create two different name
> spaces for no obvious reason, code like this (from 1/8) should not
> exist:
>
> > +       if (types & MM_SEAL_MSEAL)
> > +               newtypes |= VM_SEAL_MSEAL;
> > +
> > +       if (types & MM_SEAL_MPROTECT)
> > +               newtypes |= VM_SEAL_MPROTECT;
> > +
> > +       if (types & MM_SEAL_MUNMAP)
> > +               newtypes |= VM_SEAL_MUNMAP;
> > +
> > +       if (types & MM_SEAL_MMAP)
> > +               newtypes |= VM_SEAL_MMAP;
> > +
> > +       if (types & MM_SEAL_MREMAP)
> > +               newtypes |= VM_SEAL_MREMAP;
>
> Sure, we have that in some cases when there was a *reason* for
> namespacing imposed on us from an API standpoint (ie the "open()"
> flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags
> being turned into MNT_xyz flags), but those tend to be signs of
> problems in the system call API where some mixup made it impossible to
> use the flags directly (most commonly because there is some extended
> internal representation of said things).
>
> For example, the MS_xyz namespace is a combination of "these are flags
> for the new mount" (like MS_RDONLY) and "this is how you should mount
> it" (like MS_REMOUNT), and so the user interface has that odd mixing
> of different things, and the MNT_xyz namespace is a distillation of
> the former.
>
> But we certainly should not strive to introduce *new* interfaces that
> start out with this kind of mixup and pointless "translate from one
> bit to another" code.
>
The two namespaces can go away, that means the bitmap will be stored
as is by vm_seals in VMA struct. (1/8) Copied below.

+++ b/include/linux/mm_types.h
@@ -660,6 +660,13 @@ struct vm_area_struct {
+       unsigned long vm_seals;         /* seal flags, see mm.h. */

My original considerations are:
1. vm_seals is a new field, and mseal() currently uses 5 bits.
vm_seals can be repurposed to store other VMA flags in future,
having two namespaces (API and internal one) can be useful.
2. vm_flags is full and it seems to me there is pending work on expanding
vm_flags. [1]  if that happens, we will need translation logic.
[1] https://lore.kernel.org/linux-mm/4F6CA298.4000301@jp.fujitsu.com/

I might have over-engineered this, so I removed the VM_SEAL_XX in V2.

>  - And finally (for now), I hate that MM_ACTION_xyz thing too!
>
> Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8):
>
> > +       switch (action) {
> > +       case MM_ACTION_MPROTECT:
> > +               if (vma->vm_seals & VM_SEAL_MPROTECT)
> > +                       return false;
> > +               break;
>
> when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and
> sealing bitmask and just say
>
>         if (vma->vm_seal & vm_action)
>                 return -EPERM;
>

Make sense. My original thought is that can_modify_vma() will check one
and only one seal type, and having an enum type will enforce that.  This
restriction feels unnecessary. I removed the action type in V2.

> IOW, you have pointlessly introduced not *two* different namespaces,
> but *three*. All doing the exact same thing, and all just causing
> pointless and ugly code to "translate" the actions of one into the
> model of another.
>
> So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz"
> thing. Use *one* single "these are the sealing bits".
>
> And get rid of "enum caller_origin" entirely. I don't know what the
> right model for that thing is, but that isn't it.
>
> *Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space
> cannot set, but that the kernel can use internally - and if that is
> the right model, then dammit, the *uses* should be very very obvious
> why the override is fine, because that remap_file_pages() use sure as
> hell was *not* obvious.
>
> In fact, it's not at all obvious why anything should override the
> sealing bits - EVER. So I'm not convinced that the right model is
> "replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we
> *ever* want to override sealing? It sounds like complete garbage. Most
> of the users seem to be things like "execve()", which is nonsensical,
> since the VM wouldn't have been sealed at that point _anyway_, so
> having a "don't bother checking sealing bits" flag seems entirely
> useless.
>
Would the new commit message and comments in V2 help to
explain the design better ? (will send shortly)

Another code change I can make to help the readability (not in v2),
is to set and pass checkSeals flag from syscall entry point, all the
way to can_modify_vma(). Currently, I didn't do that if I checked the
internal function is only used by syscal entry point, e.g. in
do_mprotect_pkey(), mremap_to(), ksys_mmap_pgoff() cases.
Doing that does increase the size of the patch set though.

Thanks.
-Jeff


-Jeff

> Anyway, this is all a resounding NAK on this series in this form. My
> complaints are not some kind of small "fix this up". These are
> fundamental issues.
>
>             Linus
Matthew Wilcox (Oracle) Oct. 17, 2023, 12:56 p.m. UTC | #7
On Tue, Oct 17, 2023 at 01:34:35AM -0700, Jeff Xu wrote:
> > > types: bit mask to specify which syscall to seal, currently they are:
> > > MM_SEAL_MSEAL 0x1
> > > MM_SEAL_MPROTECT 0x2
> > > MM_SEAL_MUNMAP 0x4
> > > MM_SEAL_MMAP 0x8
> > > MM_SEAL_MREMAP 0x10
> >
> > I don't understand why we want this level of granularity.  The OpenBSD
> > and XNU examples just say "This must be immutable*".  For values of
> > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > but not upgrading access (RW->RX, RO->*, RX->RW).
> >
> > > Each bit represents sealing for one specific syscall type, e.g.
> > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > is that the API is extendable, i.e. when needed, the sealing can be
> > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > maybe for some things we want to be able to downgrade and for other
> > things, we don't.
> >
> Having a seal type per syscall type helps to add the feature incrementally.
> Applications also know exactly what is sealed.

You're trying to portray a disadvantage as an advantage,  This is the
seccomp disease, only worse because you're trying to deny individual
syscalls instead of building up a list of permitted syscalls.  If we
introduce a new syscall tomorrow which can affect VMAs, we'll then
make it the application's fault for not disabling that new syscall.
That's terrible design!

> I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we
> can define what it seals precisely. As Jann pointed out, there have
> other scenarios that potentially affect IMMUTABLE. Implementing all thoses
> will take time. And if we missed a case, we could introduce backward
> compatibility issues to the application. Bitmask will solve this nicely, i.e.
> application will need to apply the newly added sealing type explicitly.

It is your job to do this.  You seem to have taken the attitude that you
will give the Chrome team exactly what they asked for instead of trying
to understand their requirements and give them what they need.

And don't send a v2 before discussion of v1 has come to an end.  It
uselessly fragments the discussion.
Pedro Falcato Oct. 17, 2023, 3:29 p.m. UTC | #8
On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
>
> This seems like a confusing way to introduce the subject.  Here, you're
> talking about page permissions, whereas (as far as I can tell), mseal() is
> about making _virtual_ addresses immutable, for some value of immutable.
>
> > Memory sealing additionally protects the mapping itself against
> > modifications. This is useful to mitigate memory corruption issues where
> > a corrupted pointer is passed to a memory management syscall. For example,
> > such an attacker primitive can break control-flow integrity guarantees
> > since read-only memory that is supposed to be trusted can become writable
> > or .text pages can get remapped. Memory sealing can automatically be
> > applied by the runtime loader to seal .text and .rodata pages and
> > applications can additionally seal security critical data at runtime.
> > A similar feature already exists in the XNU kernel with the
> > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> useful to understand what you're trying to do.
>
> > The new mseal() is an architecture independent syscall, and with
> > following signature:
> >
> > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> >
> > addr/len: memory range.  Must be continuous/allocated memory, or else
> > mseal() will fail and no VMA is updated. For details on acceptable
> > arguments, please refer to comments in mseal.c. Those are also fully
> > covered by the selftest.
>
> Mmm.  So when you say "continuous/allocated" what you really mean is
> "Must have contiguous VMAs" rather than "All pages in this range must
> be populated", yes?
>
> > types: bit mask to specify which syscall to seal, currently they are:
> > MM_SEAL_MSEAL 0x1
> > MM_SEAL_MPROTECT 0x2
> > MM_SEAL_MUNMAP 0x4
> > MM_SEAL_MMAP 0x8
> > MM_SEAL_MREMAP 0x10
>
> I don't understand why we want this level of granularity.  The OpenBSD
> and XNU examples just say "This must be immutable*".  For values of
> immutable that allow downgrading access (eg RW to RO or RX to RO),
> but not upgrading access (RW->RX, RO->*, RX->RW).
>
> > Each bit represents sealing for one specific syscall type, e.g.
> > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > is that the API is extendable, i.e. when needed, the sealing can be
> > extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> Honestly, it feels too flexible.  Why not just two flags to mprotect()
> -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> maybe for some things we want to be able to downgrade and for other
> things, we don't.

I think it's worth pointing out that this suggestion (with PROT_*)
could easily integrate with mmap() and as such allow for one-shot
mmap() + mseal().
If we consider the common case as 'addr = mmap(...); mseal(addr);', it
definitely sounds like a performance win as we halve the number of
syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
like common patterns.
Linus Torvalds Oct. 17, 2023, 5:22 p.m. UTC | #9
On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
>
> It is probably worth noting that I choose to check one and only
> one sealing type per syscall. i.e. munmap(2) checks
> MM_SEAL_MUNMAP only.

Yeah, this is wrong.

It's wrong exactly because other system calls will unmap things too.

Using mmap() to over-map something will unmap the old one.

Same goes for mremap() to move over an existing mapping.

So the whole "do things by the name of the system call" is not workable.

All that matters is what the system calls *do*, not what their name is.

And mmap() will fundamentally munmap() as part of the action.

This is why I absolutely hated the old "ON_BEHALF_OF_xyz" flag, and
why I still absolutely hate the "randomly pass different sealing flags
fto do_munmap()".

You should *not* be passing any flags at all to do_munmap(). Because
*regardless* of who calls it, and regardless of which system call
started the action, do_munmap() unmaps a virtual memory area.

See what I'm saying?

                 Linus
Theo de Raadt Oct. 17, 2023, 6:20 p.m. UTC | #10
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
> >
> > It is probably worth noting that I choose to check one and only
> > one sealing type per syscall. i.e. munmap(2) checks
> > MM_SEAL_MUNMAP only.
> 
> Yeah, this is wrong.
> 
> It's wrong exactly because other system calls will unmap things too.
> 
> Using mmap() to over-map something will unmap the old one.
> 
> Same goes for mremap() to move over an existing mapping.
> 
> So the whole "do things by the name of the system call" is not workable.
> 
> All that matters is what the system calls *do*, not what their name is.

I agree completely...

mseal() is a clone of mimmutable(2), but with an extremely
over-complicated API based upon dubious arguments.

I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all
the components working correctly.  There were many intermediate API
during development, but in the end the API is simply:

     int mimmutable(void *addr, size_t len);

The kernel code for mimmutable() traverses the specified VA range.  In
that range, it will find unmapped sub-regions (which are are ignored)
and mapped sub-regions. For these mapped regions, it does not care what
the permissions are, it just marks each sub-region as immutable.

Later on, when any VM operation request upon a VA range attempts to
      (1) change the permissions
      (2) to re-map on top
      (3) or dispose of the mapping,
that operation is refused with errno EPERM.  We don't care where the
request comes from (ie. what system call).  It is a behaviour of the
VM system, when asked to act upon a VA sub-range mapping.

Very simple semantics.

The only case where the immutable marker is ignored is during address space
teardown as a result of process termination.


In his submission of this API, Jeff Xu makes three claims I find dubious;

> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.

I specifically designed mimmutable(2) with chrome in mind, and the
chrome binary running on OpenBSD is full of immutable mappings.  All the
library regions automatically become immutable because ld.so can infer
it and do the mimmutable calls for the right subregions.

So this chrome work has already been done by OpenBSD, and it is dead
simple.  During early development I thought mimmutable(2) would be
called by applications or libraries, but I was dead wrong: 99.9% of
calls are from ld.so, and no applications need to call it, these are the
two exceptions:

In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data
structures at initialization time, so they canoot be attacked to create
an invariant for use in ROP return-to-libc style methods.

In Chrome, there is a v8_flags variable rounded out to a full page, and
placed in .data.  Chrome initialized this variable, and wants to mprotect
PROT_READ, but .data has been made immutable by ld.so.  So we force this
page into a new ELF section called "openbsd.mutable" which also behaves RW
like .data.  Where chrome does the mprotect  PROT_READ, it now also performs
mimmutable() on that page.

> Having a seal type per syscall type helps to add the feature incrementally.

Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our
entire base operating system and 10,000+ applications automatically receive
the benefits.  In one year's effort.  The only application which cared about
it was chrome, described in the previous paragraph.

I think Jeff's idea here is super dangerous.  What will actually happen
is people will add a few mseal() sub-operations and think the job is done.
It isn't done.  They need all the mseal() requests, or the mapping are
not safe.

It is very counterproductive to provide developers a complex API that has
insecure suboperations.

> Applications also know exactly what is sealed.

Actually applicatins won't know because there is no tooling to inspect this --
but I will argue further that applications don't need to know.  Immutable
marking is a system decision, not a program decision.


I'll close by asking for a new look at the mimmutable(2) API we settled
on for OpenBSD.  I think there is nothing wrong with it.  I'm willing to
help guide glibc / ld.so / musl teams through the problems they may find
along the way, I know where the skeletons are buried.  Two in
particular: -znow RELRO already today, and xonly-text in the future.


[1] https://man.openbsd.org/mimmutable.2
Linus Torvalds Oct. 17, 2023, 6:38 p.m. UTC | #11
On Tue, 17 Oct 2023 at 11:20, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> The only case where the immutable marker is ignored is during address space
> teardown as a result of process termination.

.. and presumably also execve()?

I do like us starting with just "mimmutable()", since it already
exists. Particularly if chrome already knows how to use it.

Maybe add a flag field (require it to be zero initially) just to allow
any future expansion. Maybe the chrome team has *wanted* to have some
finer granularity thing and currently doesn't use mimmutable() in some
case?

                  Linus
Theo de Raadt Oct. 17, 2023, 6:55 p.m. UTC | #12
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Oct 2023 at 11:20, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > The only case where the immutable marker is ignored is during address space
> > teardown as a result of process termination.
> 
> .. and presumably also execve()?

Ah yes of course that also.

> I do like us starting with just "mimmutable()", since it already
> exists. Particularly if chrome already knows how to use it.

Well, our chrome fork knows how to use it.  Robert Nagy in our group maintains
1280 patches to make chrome work on OpenBSD.  Google ignores them and will not
upstream them.  Some of these changes are security related, and they ignore
them.  Other changes are to cope with security work we've done on our own,
for example: JIT changes from Stephen@google for mandatory IBT which google
hasn't upstreamed yet, impacts due to PROT_EXEC-only mappings, etc.

But the only chrome diff required for mimmutable is for that v8_flags thing
I described.   And the same issue would need handling for mseal().  Introducing
the new "mutable" ELF section is probably going to be a bigger fuss than the
system call after mprotect(PROT_READ)....

> Maybe add a flag field (require it to be zero initially) just to allow
> any future expansion. Maybe the chrome team has *wanted* to have some
> finer granularity thing and currently doesn't use mimmutable() in some
> case?

There's only one feature I can think of, and we already do it right now,
documented in our manual page:

CAVEATS
     At present, mprotect(2) may reduce permissions on immutable pages marked
     PROT_READ | PROT_WRITE to the less permissive PROT_READ.  This one-way
     operation is permitted for an introductory period to observe how software
     uses this mechanism.  It may change to require explicit mutable region
     annotation with __attribute__((section(".openbsd.mutable"))) and explicit
     calls to mimmutable().

We had something which needed this behaviour during the development
transition.  It is exlusively mprotect RW -> R, no other transitions
allowed.

But once the transition was done, we don't need it anymore.  I want to
delete it, because it is a bit of a trap.  It still fails closed from an
attack perspective, but...

What worries me is a piece of code reached by mistake can do a
mprotect(lowering), not receive -1 EPERM, and carry on running..  I'd
prefer the first time you touch a mapping in the wrong way, you receive
indication of error.  This only applies applies to mprotect() acting up
on a region so the argument is a bit weak, due to mprotect() return
value checking being about as rare as unicorns.

Also, it would be a pain for OpenBSD to transition to adding a 0 flag.
I would need to see real cause not theory.
Jeff Xu Oct. 17, 2023, 9:33 p.m. UTC | #13
On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> >
> > This seems like a confusing way to introduce the subject.  Here, you're
> > talking about page permissions, whereas (as far as I can tell), mseal() is
> > about making _virtual_ addresses immutable, for some value of immutable.
> >
> > > Memory sealing additionally protects the mapping itself against
> > > modifications. This is useful to mitigate memory corruption issues where
> > > a corrupted pointer is passed to a memory management syscall. For example,
> > > such an attacker primitive can break control-flow integrity guarantees
> > > since read-only memory that is supposed to be trusted can become writable
> > > or .text pages can get remapped. Memory sealing can automatically be
> > > applied by the runtime loader to seal .text and .rodata pages and
> > > applications can additionally seal security critical data at runtime.
> > > A similar feature already exists in the XNU kernel with the
> > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > > patchset has been designed to be compatible with the Chrome use case.
> >
> > This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> > useful to understand what you're trying to do.
> >
> > > The new mseal() is an architecture independent syscall, and with
> > > following signature:
> > >
> > > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> > >
> > > addr/len: memory range.  Must be continuous/allocated memory, or else
> > > mseal() will fail and no VMA is updated. For details on acceptable
> > > arguments, please refer to comments in mseal.c. Those are also fully
> > > covered by the selftest.
> >
> > Mmm.  So when you say "continuous/allocated" what you really mean is
> > "Must have contiguous VMAs" rather than "All pages in this range must
> > be populated", yes?
> >
> > > types: bit mask to specify which syscall to seal, currently they are:
> > > MM_SEAL_MSEAL 0x1
> > > MM_SEAL_MPROTECT 0x2
> > > MM_SEAL_MUNMAP 0x4
> > > MM_SEAL_MMAP 0x8
> > > MM_SEAL_MREMAP 0x10
> >
> > I don't understand why we want this level of granularity.  The OpenBSD
> > and XNU examples just say "This must be immutable*".  For values of
> > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > but not upgrading access (RW->RX, RO->*, RX->RW).
> >
> > > Each bit represents sealing for one specific syscall type, e.g.
> > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > is that the API is extendable, i.e. when needed, the sealing can be
> > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > maybe for some things we want to be able to downgrade and for other
> > things, we don't.
>
> I think it's worth pointing out that this suggestion (with PROT_*)
> could easily integrate with mmap() and as such allow for one-shot
> mmap() + mseal().
> If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> definitely sounds like a performance win as we halve the number of
> syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> like common patterns.
>
Yes. mmap() can support sealing as well, and memory is allocated as
immutable from begining.
This is orthogonal to mseal() though.
In case of ld.so, iiuc, memory can be first allocated as W, then later
changed to RO, for example, during symbol resolution.
The important point is that the application can decide what type of
sealing it wants, and when to apply it.  There needs to be an api(),
that can be mseal() or mprotect2() or mimmutable(), the naming is not
important to me.

mprotect() in linux have the following signature:
int mprotect(void addr[.len], size_t len, int prot);
the prot bitmasks are all taken here.
I have not checked the prot field in mmap(), there might be bits left,
even not, we could have mmap2(), so that is not an issue.

Thanks
-Jeff

> --
> Pedro
Pedro Falcato Oct. 17, 2023, 10:35 p.m. UTC | #14
On Tue, Oct 17, 2023 at 10:34 PM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > > > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > > > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> > >
> > > This seems like a confusing way to introduce the subject.  Here, you're
> > > talking about page permissions, whereas (as far as I can tell), mseal() is
> > > about making _virtual_ addresses immutable, for some value of immutable.
> > >
> > > > Memory sealing additionally protects the mapping itself against
> > > > modifications. This is useful to mitigate memory corruption issues where
> > > > a corrupted pointer is passed to a memory management syscall. For example,
> > > > such an attacker primitive can break control-flow integrity guarantees
> > > > since read-only memory that is supposed to be trusted can become writable
> > > > or .text pages can get remapped. Memory sealing can automatically be
> > > > applied by the runtime loader to seal .text and .rodata pages and
> > > > applications can additionally seal security critical data at runtime.
> > > > A similar feature already exists in the XNU kernel with the
> > > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > > > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > > > patchset has been designed to be compatible with the Chrome use case.
> > >
> > > This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> > > useful to understand what you're trying to do.
> > >
> > > > The new mseal() is an architecture independent syscall, and with
> > > > following signature:
> > > >
> > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> > > >
> > > > addr/len: memory range.  Must be continuous/allocated memory, or else
> > > > mseal() will fail and no VMA is updated. For details on acceptable
> > > > arguments, please refer to comments in mseal.c. Those are also fully
> > > > covered by the selftest.
> > >
> > > Mmm.  So when you say "continuous/allocated" what you really mean is
> > > "Must have contiguous VMAs" rather than "All pages in this range must
> > > be populated", yes?
> > >
> > > > types: bit mask to specify which syscall to seal, currently they are:
> > > > MM_SEAL_MSEAL 0x1
> > > > MM_SEAL_MPROTECT 0x2
> > > > MM_SEAL_MUNMAP 0x4
> > > > MM_SEAL_MMAP 0x8
> > > > MM_SEAL_MREMAP 0x10
> > >
> > > I don't understand why we want this level of granularity.  The OpenBSD
> > > and XNU examples just say "This must be immutable*".  For values of
> > > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > > but not upgrading access (RW->RX, RO->*, RX->RW).
> > >
> > > > Each bit represents sealing for one specific syscall type, e.g.
> > > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > > is that the API is extendable, i.e. when needed, the sealing can be
> > > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> > >
> > > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > > maybe for some things we want to be able to downgrade and for other
> > > things, we don't.
> >
> > I think it's worth pointing out that this suggestion (with PROT_*)
> > could easily integrate with mmap() and as such allow for one-shot
> > mmap() + mseal().
> > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > definitely sounds like a performance win as we halve the number of
> > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > like common patterns.
> >
> Yes. mmap() can support sealing as well, and memory is allocated as
> immutable from begining.
> This is orthogonal to mseal() though.

I don't see how this can be orthogonal to mseal().
In the case we opt for adding PROT_ bits, we should more or less only
need to adapt calc_vm_prot_bits(), and the rest should work without
issues.
vma merging won't merge vmas with different prots. The current
interfaces (mmap and mprotect) would work just fine.
In this case, mseal() or mimmutable() would only be needed if you need
to set immutability over a range of VMAs with different permissions.

Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
The only annoying wrench in my plans here is that we have effectively
run out of vm_flags bits in 32-bit architectures, so this approach as
I described is not compatible with 32-bit.

> In case of ld.so, iiuc, memory can be first allocated as W, then later
> changed to RO, for example, during symbol resolution.
> The important point is that the application can decide what type of
> sealing it wants, and when to apply it.  There needs to be an api(),
> that can be mseal() or mprotect2() or mimmutable(), the naming is not
> important to me.
>
> mprotect() in linux have the following signature:
> int mprotect(void addr[.len], size_t len, int prot);
> the prot bitmasks are all taken here.
> I have not checked the prot field in mmap(), there might be bits left,
> even not, we could have mmap2(), so that is not an issue.

I don't see what you mean. We have plenty of prot bits left (32-bits,
and we seem to have around 8 different bits used).
And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)

The only issue seems to be that 32-bit ran out of vm_flags, but that
can probably be worked around if need be.
Jeff Xu Oct. 17, 2023, 11:01 p.m. UTC | #15
On Tue, Oct 17, 2023 at 11:20 AM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
> > >
> > > It is probably worth noting that I choose to check one and only
> > > one sealing type per syscall. i.e. munmap(2) checks
> > > MM_SEAL_MUNMAP only.
> >
> > Yeah, this is wrong.
> >
> > It's wrong exactly because other system calls will unmap things too.
> >
> > Using mmap() to over-map something will unmap the old one.
> >
> > Same goes for mremap() to move over an existing mapping.
> >
> > So the whole "do things by the name of the system call" is not workable.
> >
> > All that matters is what the system calls *do*, not what their name is.
>
> I agree completely...
>
> mseal() is a clone of mimmutable(2), but with an extremely
> over-complicated API based upon dubious arguments.
>
> I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all
> the components working correctly.  There were many intermediate API
> during development, but in the end the API is simply:
>
>      int mimmutable(void *addr, size_t len);
>
> The kernel code for mimmutable() traverses the specified VA range.  In
> that range, it will find unmapped sub-regions (which are are ignored)
> and mapped sub-regions. For these mapped regions, it does not care what
> the permissions are, it just marks each sub-region as immutable.
>
> Later on, when any VM operation request upon a VA range attempts to
>       (1) change the permissions
>       (2) to re-map on top
>       (3) or dispose of the mapping,
> that operation is refused with errno EPERM.  We don't care where the
> request comes from (ie. what system call).  It is a behaviour of the
> VM system, when asked to act upon a VA sub-range mapping.
>
> Very simple semantics.
>
> The only case where the immutable marker is ignored is during address space
> teardown as a result of process termination.
>
May I ask, for BSD's implementation of immutable(), do you cover
things such as mlock(),
madvice() ? or just the protection bit (WRX) + remap() + unmap().

In other words:
Is BSD's definition of immutable equivalent to
MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?

I hesitate to introduce the concept of immutable into linux because I don't know
all the scenarios present in linux where VMAs's metadata can be
modified. As Jann's email pointed out,
There could be quite a few things we still need to deal with, to
completely block the possibility,
e.g. malicious code attempting to write to a RO memory or change RW
memory to RWX.

If, as part of immutable, I also block madvice(), mlock(), which also updates
VMA's metadata, so by definition, I could.  What if the user wants the
features in
madvice() and at the same time, also wants their .text protected ?

Also, if linux introduces a new syscall that depends on a new metadata of VMA,
say msecret(), (for discussion purpose), should immutable
automatically support that ?

Without those questions answered, I couldn't choose the route of
immutable() yet.

-Jeff



>
> In his submission of this API, Jeff Xu makes three claims I find dubious;
>
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> I specifically designed mimmutable(2) with chrome in mind, and the
> chrome binary running on OpenBSD is full of immutable mappings.  All the
> library regions automatically become immutable because ld.so can infer
> it and do the mimmutable calls for the right subregions.
>
> So this chrome work has already been done by OpenBSD, and it is dead
> simple.  During early development I thought mimmutable(2) would be
> called by applications or libraries, but I was dead wrong: 99.9% of
> calls are from ld.so, and no applications need to call it, these are the
> two exceptions:
>
> In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data
> structures at initialization time, so they canoot be attacked to create
> an invariant for use in ROP return-to-libc style methods.
>
> In Chrome, there is a v8_flags variable rounded out to a full page, and
> placed in .data.  Chrome initialized this variable, and wants to mprotect
> PROT_READ, but .data has been made immutable by ld.so.  So we force this
> page into a new ELF section called "openbsd.mutable" which also behaves RW
> like .data.  Where chrome does the mprotect  PROT_READ, it now also performs
> mimmutable() on that page.
>
> > Having a seal type per syscall type helps to add the feature incrementally.
>
> Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our
> entire base operating system and 10,000+ applications automatically receive
> the benefits.  In one year's effort.  The only application which cared about
> it was chrome, described in the previous paragraph.
>
> I think Jeff's idea here is super dangerous.  What will actually happen
> is people will add a few mseal() sub-operations and think the job is done.
> It isn't done.  They need all the mseal() requests, or the mapping are
> not safe.
>
> It is very counterproductive to provide developers a complex API that has
> insecure suboperations.
>
> > Applications also know exactly what is sealed.
>
> Actually applicatins won't know because there is no tooling to inspect this --
> but I will argue further that applications don't need to know.  Immutable
> marking is a system decision, not a program decision.
>
>
> I'll close by asking for a new look at the mimmutable(2) API we settled
> on for OpenBSD.  I think there is nothing wrong with it.  I'm willing to
> help guide glibc / ld.so / musl teams through the problems they may find
> along the way, I know where the skeletons are buried.  Two in
> particular: -znow RELRO already today, and xonly-text in the future.
>
>
> [1] https://man.openbsd.org/mimmutable.2
>
Theo de Raadt Oct. 17, 2023, 11:56 p.m. UTC | #16
Jeff Xu <jeffxu@google.com> wrote:

> May I ask, for BSD's implementation of immutable(), do you cover
> things such as mlock(),
> madvice() ? or just the protection bit (WRX) + remap() + unmap().

It only prevents removal of the mapping, placement of a replacement
mapping, or changing the existing permissions.  If one page in the
existing sub-region is marked immutable, the whole operation fails with
EPERM.

Those are the only user-visible aspects that an attacker cares about to
utilize in this area.

mlock() and madvise() deal with the physical memory handling underneath
the VA.  They have nothing to do with how attack code might manipulate
the VA address space inside a program to convert a series of dead-end
approaches into a succesfull escalation strategy.

[It would be very long conversation to explain where and how this has
been utilized to make an attack succesfull]

> In other words:
> Is BSD's definition of immutable equivalent to
> MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?

I can't compare it to your subsystem, because I completely fail to
understand the cause or benefit of all the complexity.

I think I've explained what mimmutable() is in extremely simple terms.

And I don't understand else you are trying to do anything beyond what
mimmutable() offers.  It seems like this is inventing additional
solutions without proof that any of them are necessary to solve the
specific problem that is known.

> I hesitate to introduce the concept of immutable into linux because I don't know
> all the scenarios present in linux where VMAs's metadata can be
> modified.

Good grief.  It seems obvious if you want to lock the change-behaviour
of an object (the object in this case being a VA sub-region, there is a
datastructure for that, in OpenBSD it is called an "entry"), then you
put a flag in that object's data-structure and you simply check the flag
everytime a change-operation is attempted.  It is a flag which gets set,
and checked.  Nothing ever clears it (except address space teardown).

This flag must be put on the data structure that manages VA sub-ranges.

In our case when a prot/mapping operation reaches low-level code that
will want to change an "entry", we notice it is not allowed and simply
percolate EPERM up through the layers.

> There could be quite a few things we still need to deal with, to
> completely block the possibility,
> e.g. malicious code attempting to write to a RO memory

What?!  writes to RO memory are blocked by the permission bits.

> or change RW memory to RWX.

In our case that is blocked by W^X policy.

But if the region is marked mimmutable, then that's another reason you cannot
change RW to RWX.  It seems so off-topic, to talk about writes to RO memory.
I get a feeling you are a bit lost.

mimmutable() is not about permissions, but about locking permissions.
- You can't change the permissions of the address space region.
- You cannot map a replacement object at the location instead (especially
  with different permission).
- You cannot unmap at that location (which you would do if you wanted to
  map a new object, with a different permission).

All 3 of these scenarios are identical.  No regular code performs these 3
operations on regions of the address space which we mark immutable.

There is nothing more to mimmutable in the VM layer.  The hard work is
writing code in execve() and ld.so which will decide which objects can
be marked immutable automatically, so that programs don't do this to
themselves.

I'm aware of where this simple piece fits in.  It does not solve all
problems, it is a very narrow change to impact a problem which only
high-value targets will ever face (like chrome).

But I think you don't understand the purpose of this mechanism.

> If, as part of immutable, I also block madvice(), mlock(), which also updates
> VMA's metadata, so by definition, I could.  What if the user wants the
> features in
> madvice() and at the same time, also wants their .text protected ?

I have no idea what you are talking about.  None of those things relate
to the access permission of the memory the user sees, and therefore none
of them are in the attack surface profile which is being prevented.

Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because
those relate to the physical storage and not the VA permission model.

> Also, if linux introduces a new syscall that depends on a new metadata of VMA,
> say msecret(), (for discussion purpose), should immutable
> automatically support that ?

How about the future makingexcuses() system call?

I don't think you understand the problem space well enough to come up with
your own solution for it.  I spent a year on this, and ship a complete system
using it.  You are asking such simplistic questions above it shocks me.

Maybe read the LWN article;

    https://lwn.net/Articles/915640/

> Without those questions answered, I couldn't choose the route of
> immutable() yet.

"... so I can clearly not choose the wine in front of you."

If you don't understand what this thing is for, and cannot minimize the
complexity of this thing, then Linux doesn't need it at all.

I should warn everyone the hard work is not in the VM layer, but in
ld.so -- deciding which parts of the image to make immutable, and when.
It is also possible to make some segments immutable directly in execve()
-- but in both cases you better have a really good grasp on RELRO
executable layout or will make too many pieces immutable...

I am pretty sure Linux will never get as far as we got. Even our main
stacks are marked immutable, but in Linux that would conflict with glibc
ld.so mprotecting RWX the stack if you dlopen() a shared library with
GNUSTACK, a very bad idea which needs a different fight...
Jeff Xu Oct. 18, 2023, 3:18 a.m. UTC | #17
On Tue, Oct 17, 2023 at 4:57 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@google.com> wrote:
>
> > May I ask, for BSD's implementation of immutable(), do you cover
> > things such as mlock(),
> > madvice() ? or just the protection bit (WRX) + remap() + unmap().
>
> It only prevents removal of the mapping, placement of a replacement
> mapping, or changing the existing permissions.  If one page in the
> existing sub-region is marked immutable, the whole operation fails with
> EPERM.
>
> Those are the only user-visible aspects that an attacker cares about to
> utilize in this area.
>
> mlock() and madvise() deal with the physical memory handling underneath
> the VA.  They have nothing to do with how attack code might manipulate
> the VA address space inside a program to convert a series of dead-end
> approaches into a succesfull escalation strategy.
>
> [It would be very long conversation to explain where and how this has
> been utilized to make an attack succesfull]
>
> > In other words:
> > Is BSD's definition of immutable equivalent to
> > MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?
>
> I can't compare it to your subsystem, because I completely fail to
> understand the cause or benefit of all the complexity.
>
> I think I've explained what mimmutable() is in extremely simple terms.
>

Thanks for the explanation, based on those, this is exactly what the
current set of patch does.
In practice: libc could do below:
#define MM_IMMUTABLE
(MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP)
mseal(add,len, MM_IMMUTABLE)
it will be equivalent to BSD's immutable().

> And I don't understand else you are trying to do anything beyond what

> mimmutable() offers.  It seems like this is inventing additional
> solutions without proof that any of them are necessary to solve the
> specific problem that is known.
>
> > I hesitate to introduce the concept of immutable into linux because I don't know
> > all the scenarios present in linux where VMAs's metadata can be
> > modified.
>
> Good grief.  It seems obvious if you want to lock the change-behaviour
> of an object (the object in this case being a VA sub-region, there is a
> datastructure for that, in OpenBSD it is called an "entry"), then you
> put a flag in that object's data-structure and you simply check the flag
> everytime a change-operation is attempted.  It is a flag which gets set,
> and checked.  Nothing ever clears it (except address space teardown).
>
> This flag must be put on the data structure that manages VA sub-ranges.
>
> In our case when a prot/mapping operation reaches low-level code that
> will want to change an "entry", we notice it is not allowed and simply
> percolate EPERM up through the layers.
>
> > There could be quite a few things we still need to deal with, to
> > completely block the possibility,
> > e.g. malicious code attempting to write to a RO memory
>
> What?!  writes to RO memory are blocked by the permission bits.
>
> > or change RW memory to RWX.
>
> In our case that is blocked by W^X policy.
>
> But if the region is marked mimmutable, then that's another reason you cannot
> change RW to RWX.  It seems so off-topic, to talk about writes to RO memory.
> I get a feeling you are a bit lost.
>
> immutable() is not about permissions, but about locking permissions.
> - You can't change the permissions of the address space region.
> - You cannot map a replacement object at the location instead (especially
>   with different permission).
> - You cannot unmap at that location (which you would do if you wanted to
>   map a new object, with a different permission).
>
> All 3 of these scenarios are identical.  No regular code performs these 3
> operations on regions of the address space which we mark immutable.
>
> There is nothing more to mimmutable in the VM layer.  The hard work is
> writing code in execve() and ld.so which will decide which objects can
> be marked immutable automatically, so that programs don't do this to
> themselves.
>
> I'm aware of where this simple piece fits in.  It does not solve all
> problems, it is a very narrow change to impact a problem which only
> high-value targets will ever face (like chrome).
>
> But I think you don't understand the purpose of this mechanism.
>

In linux cases, I think, eventually, mseal() will have a bigger scope than
BSD's mimmutable().  VMA's metadata(vm_area_struct) contains a lot
of control info, depending on application's needs, mseal() can be
expanded to seal individual control info.

For example, in madvice(2) case:
As Jann point out in [1] and I quote:
"you'd probably also want to block destructive madvise() operations
that can effectively alter region contents by discarding pages and
such, ..."

Another example: if an application wants to keep a memory always
present in RAM, for whatever the reason, it can call seal the mlock().

To handle those two new cases. mseal() could add two more bits:
MM_SEAL_MADVICE, MM_SEAL_MLOCK.

It is practical to keep syscall extentable, when the business logic is the same.

I think I  explained the logic of using bitmasks in the mseal()
interface clearly with the example of madvice() and mlock().

-Jeff


[1] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/


> > If, as part of immutable, I also block madvice(), mlock(), which also updates
> > VMA's metadata, so by definition, I could.  What if the user wants the
> > features in
> > madvice() and at the same time, also wants their .text protected ?
>
> I have no idea what you are talking about.  None of those things relate
> to the access permission of the memory the user sees, and therefore none
> of them are in the attack surface profile which is being prevented.
>
> Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because
> those relate to the physical storage and not the VA permission model.
>
> > Also, if linux introduces a new syscall that depends on a new metadata of VMA,
> > say msecret(), (for discussion purpose), should immutable
> > automatically support that ?
>
> How about the future makingexcuses() system call?
>
> I don't think you understand the problem space well enough to come up with
> your own solution for it.  I spent a year on this, and ship a complete system
> using it.  You are asking such simplistic questions above it shocks me.
>
> Maybe read the LWN article;
>
>     https://lwn.net/Articles/915640/
>
> > Without those questions answered, I couldn't choose the route of
> > immutable() yet.
>
> "... so I can clearly not choose the wine in front of you."
>
> If you don't understand what this thing is for, and cannot minimize the
> complexity of this thing, then Linux doesn't need it at all.
>
> I should warn everyone the hard work is not in the VM layer, but in
> ld.so -- deciding which parts of the image to make immutable, and when.
> It is also possible to make some segments immutable directly in execve()
> -- but in both cases you better have a really good grasp on RELRO
> executable layout or will make too many pieces immutable...
>
> I am pretty sure Linux will never get as far as we got. Even our main
> stacks are marked immutable, but in Linux that would conflict with glibc
> ld.so mprotecting RWX the stack if you dlopen() a shared library with
> GNUSTACK, a very bad idea which needs a different fight...
Theo de Raadt Oct. 18, 2023, 3:37 a.m. UTC | #18
Jeff Xu <jeffxu@google.com> wrote:

> In linux cases, I think, eventually, mseal() will have a bigger scope than
> BSD's mimmutable().

I don't believe that, considering noone needed this behaviour from the VM
system in the last 4 decades.

> VMA's metadata(vm_area_struct) contains a lot
> of control info, depending on application's needs, mseal() can be
> expanded to seal individual control info.

> For example, in madvice(2) case:
> As Jann point out in [1] and I quote:
> "you'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, ..."

Then prohibit madvise(MADV_FREE) on all non-writeable mappings that are
immutable.  Just include this in the set of behaviours. Or make it the
default.

Don't make it an option that a program needs to set on pages!  Noone
is going to call it.  Most programs don't know the addresses of the
*REGIONS* they would want to do this for.

Does your program know where libc's text segment starts and ends?
No your program does not know these addresses, so the parts of the
'system' which do know this needs to do it (which would be ld.so or
the libc init constructors).

If madvise(MADV_FREE) is so dangerous..  say you have a program that
would call through abort(), but you know a zero there can make the
abort not abort but return, then is it bad to let the attacker do:

   madvise(&abort, pagesize, MADV_FREE)

If that is bad, then block it in a smart way!  Don't make a programmer
of an application figure out how to do this.  That results in a defense
methodology where a handful of programs self-protect, but everything
else is unprotected or unprotectable.  That is shortsighted.

> Another example: if an application wants to keep a memory always
> present in RAM, for whatever the reason, it can call seal the mlock().

Please explain the attack surface.

> I think I  explained the logic of using bitmasks in the mseal()
> interface clearly with the example of madvice() and mlock().

It is clear as mud.
Matthew Wilcox (Oracle) Oct. 18, 2023, 3:17 p.m. UTC | #19
On Tue, Oct 17, 2023 at 08:18:47PM -0700, Jeff Xu wrote:
> In practice: libc could do below:
> #define MM_IMMUTABLE
> (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP)
> mseal(add,len, MM_IMMUTABLE)
> it will be equivalent to BSD's immutable().

No, it wouldn't, because you've carefully listed the syscalls you're
blocking instead of understanding the _concept_ of what you need to
block.

> In linux cases, I think, eventually, mseal() will have a bigger scope than
> BSD's mimmutable().  VMA's metadata(vm_area_struct) contains a lot
> of control info, depending on application's needs, mseal() can be
> expanded to seal individual control info.
> 
> For example, in madvice(2) case:
> As Jann point out in [1] and I quote:
> "you'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, ..."
> 
> Another example: if an application wants to keep a memory always
> present in RAM, for whatever the reason, it can call seal the mlock().
> 
> To handle those two new cases. mseal() could add two more bits:
> MM_SEAL_MADVICE, MM_SEAL_MLOCK.

Yes, thank you for demonstrating that you have no idea what you need to
block.

> It is practical to keep syscall extentable, when the business logic is the same.

I concur with Theo & Linus.  You don't know what you're doing.  I think
the underlying idea of mimmutable() is good, but how you've split it up
and how you've implemented it is terrible.

Let's start with the purpose.  The point of mimmutable/mseal/whatever is
to fix the mapping of an address range to its underlying object, be it
a particular file mapping or anonymous memory.  After the call succeeds,
it must not be possible to make any address in that virtual range point
into any other object.

The secondary purpose is to lock down permissions on that range.
Possibly to fix them where they are, possibly to allow RW->RO transitions.

With those purposes in mind, you should be able to deduce for any syscall
or any madvise(), ... whether it should be allowed.

Look, I appreciate this is only your second set of patches to Linux and
you've taken on a big job.  But that's all the more reason you should
listen to people who are trying to help you.
Jeff Xu Oct. 18, 2023, 6:20 p.m. UTC | #20
On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> > >
> > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > could easily integrate with mmap() and as such allow for one-shot
> > > mmap() + mseal().
> > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > definitely sounds like a performance win as we halve the number of
> > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > like common patterns.
> > >
> > Yes. mmap() can support sealing as well, and memory is allocated as
> > immutable from begining.
> > This is orthogonal to mseal() though.
>
> I don't see how this can be orthogonal to mseal().
> In the case we opt for adding PROT_ bits, we should more or less only
> need to adapt calc_vm_prot_bits(), and the rest should work without
> issues.
> vma merging won't merge vmas with different prots. The current
> interfaces (mmap and mprotect) would work just fine.
> In this case, mseal() or mimmutable() would only be needed if you need
> to set immutability over a range of VMAs with different permissions.
>
Agreed. By orthogonal, I meant we can have two APIs:
mmap() and mseal()/mprotect()
i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
Sealing can be applied after initial memory creation.

> Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> The only annoying wrench in my plans here is that we have effectively
> run out of vm_flags bits in 32-bit architectures, so this approach as
> I described is not compatible with 32-bit.
>
> > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > changed to RO, for example, during symbol resolution.
> > The important point is that the application can decide what type of
> > sealing it wants, and when to apply it.  There needs to be an api(),
> > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > important to me.
> >
> > mprotect() in linux have the following signature:
> > int mprotect(void addr[.len], size_t len, int prot);
> > the prot bitmasks are all taken here.
> > I have not checked the prot field in mmap(), there might be bits left,
> > even not, we could have mmap2(), so that is not an issue.
>
> I don't see what you mean. We have plenty of prot bits left (32-bits,
> and we seem to have around 8 different bits used).
> And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
>
> The only issue seems to be that 32-bit ran out of vm_flags, but that
> can probably be worked around if need be.
>
Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
Apology that I was wrong previously and caused confusion.

There is a slight difference in the syntax of mprotect and mseal.
Each time when mprotect() is called, the kernel takes all of RWX bits
and updates vm_flags,
In other words, the application sets/unset each RWX, and kernel takes it.

In the mseal() case, the kernel will remember which seal types were
applied previously, and the application doesn’t need to repeat all
existing seal types in the next mseal().  Once a seal type is applied,
it can’t be unsealed.

So if we want to use mprotect() for sealing, developers need to think
of sealing bits differently than the rest of prot bits. It is a
different programming model, might or might not be an obvious concept
to developers.

There is a difference in input check and error handling as well.
for mseal(), if a given address range has a gap (unallocated memory),
or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
updated.
For mprotect(), some VMAs can be updated, till an error happens to a VMA.

IMO: I think it makes sense for mseal() and mprotect() to be different
in this. mseal() only checks vma struct so it is fast, and mprotect()
goes deeper into HW.

Because of those two differences, personally I feel a new syscall
might be worth the effort.

That said, mmap() + mprotect() is workable to me if that is what the
community wants.

Thanks
-Jeff

-Jeff


> --
> Pedro
Jeff Xu Oct. 18, 2023, 6:54 p.m. UTC | #21
On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Let's start with the purpose.  The point of mimmutable/mseal/whatever is
> to fix the mapping of an address range to its underlying object, be it
> a particular file mapping or anonymous memory.  After the call succeeds,
> it must not be possible to make any address in that virtual range point
> into any other object.
>
> The secondary purpose is to lock down permissions on that range.
> Possibly to fix them where they are, possibly to allow RW->RO transitions.
>
> With those purposes in mind, you should be able to deduce for any syscall
> or any madvise(), ... whether it should be allowed.
>
I got it.

IMO: The approaches mimmutable() and mseal() took are different, but
we all want to seal the memory from attackers and make the linux
application safer.

mimmutable() started  with "none of updates to the sealed address is
allowed once marked as immutable", this includes from within kernel space
including driver, or any new syscalls. It is reasonable to me.

mseal() starts with 4 syscalls from userspace, which is just a way (among many
other ways) to demo what memory operation can be sealed, which happens
to meet what Chome wants.  This is an RFC, I appreciate your input.

Best regards,
-Jeff
Theo de Raadt Oct. 18, 2023, 8:36 p.m. UTC | #22
Jeff Xu <jeffxu@google.com> wrote:

> On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Let's start with the purpose.  The point of mimmutable/mseal/whatever is
> > to fix the mapping of an address range to its underlying object, be it
> > a particular file mapping or anonymous memory.  After the call succeeds,
> > it must not be possible to make any address in that virtual range point
> > into any other object.
> >
> > The secondary purpose is to lock down permissions on that range.
> > Possibly to fix them where they are, possibly to allow RW->RO transitions.
> >
> > With those purposes in mind, you should be able to deduce for any syscall
> > or any madvise(), ... whether it should be allowed.
> >
> I got it.
> 
> IMO: The approaches mimmutable() and mseal() took are different, but
> we all want to seal the memory from attackers and make the linux
> application safer.

I think you are building mseal for chrome, and chrome alone.

I do not think this will work out for the rest of the application space
because

1) it is too complicated
2) experience with mimmutable() says that applications don't do any of it
   themselves, it is all in execve(), libc initialization, and ld.so.
   You don't strike me as an execve, libc, or ld.so developer.
Stephen Röttger Oct. 19, 2023, 8 a.m. UTC | #23
> I do like us starting with just "mimmutable()", since it already
> exists. Particularly if chrome already knows how to use it.
>
> Maybe add a flag field (require it to be zero initially) just to allow
> any future expansion. Maybe the chrome team has *wanted* to have some
> finer granularity thing and currently doesn't use mimmutable() in some
> case?

Yes, we do have a use case in Chrome to split the sealing into unmap and
mprotect which will allow us to seal additional pages that we can't seal with
pure mimmutable().
For example, we have pkey-tagged RWX memory that we want to seal. Since
the memory is already RWX and the pkey controls write access, we don't care
about permission changes but sometimes we do need to mprotect data only
pages.
But the munmap sealing will provide protection against implicit changes of the
pkey in this case which would happen if a page gets unmapped and another
mapped in its place.
Stephen Röttger Oct. 19, 2023, 8:28 a.m. UTC | #24
> > IMO: The approaches mimmutable() and mseal() took are different, but
> > we all want to seal the memory from attackers and make the linux
> > application safer.
>
> I think you are building mseal for chrome, and chrome alone.
>
> I do not think this will work out for the rest of the application space
> because
>
> 1) it is too complicated
> 2) experience with mimmutable() says that applications don't do any of it
>    themselves, it is all in execve(), libc initialization, and ld.so.
>    You don't strike me as an execve, libc, or ld.so developer.

We do want to build this in a way that it can be applied automatically by ld.so
and we appreciate all your feedback on this. The intention of
splitting the sealing
by syscall was to provide flexibility while still allowing ld.so to
seal all operations.
But it's clear from the feedback that both the fine grained split and
the per-syscall
approach are not the right way to go.
Does Linus' proposal to just split munmap / mprotect sealing address your
complexity concerns? ld.so would always use both flags which should then behave
similar to mimmutable().
Jeff Xu Oct. 19, 2023, 5:30 p.m. UTC | #25
Hi Pedro

Some followup on mmap() + mprotect():

On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > > >
> > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > could easily integrate with mmap() and as such allow for one-shot
> > > > mmap() + mseal().
> > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > definitely sounds like a performance win as we halve the number of
> > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > like common patterns.
> > > >
> > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > immutable from begining.
> > > This is orthogonal to mseal() though.
> >
> > I don't see how this can be orthogonal to mseal().
> > In the case we opt for adding PROT_ bits, we should more or less only
> > need to adapt calc_vm_prot_bits(), and the rest should work without
> > issues.
> > vma merging won't merge vmas with different prots. The current
> > interfaces (mmap and mprotect) would work just fine.
> > In this case, mseal() or mimmutable() would only be needed if you need
> > to set immutability over a range of VMAs with different permissions.
> >
> Agreed. By orthogonal, I meant we can have two APIs:
> mmap() and mseal()/mprotect()
> i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> Sealing can be applied after initial memory creation.
>
> > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > The only annoying wrench in my plans here is that we have effectively
> > run out of vm_flags bits in 32-bit architectures, so this approach as
> > I described is not compatible with 32-bit.
> >
> > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > changed to RO, for example, during symbol resolution.
> > > The important point is that the application can decide what type of
> > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > important to me.
> > >
> > > mprotect() in linux have the following signature:
> > > int mprotect(void addr[.len], size_t len, int prot);
> > > the prot bitmasks are all taken here.
> > > I have not checked the prot field in mmap(), there might be bits left,
> > > even not, we could have mmap2(), so that is not an issue.
> >
> > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > and we seem to have around 8 different bits used).
> > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> >
> > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > can probably be worked around if need be.
> >
> Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> Apology that I was wrong previously and caused confusion.
>
> There is a slight difference in the syntax of mprotect and mseal.
> Each time when mprotect() is called, the kernel takes all of RWX bits
> and updates vm_flags,
> In other words, the application sets/unset each RWX, and kernel takes it.
>
> In the mseal() case, the kernel will remember which seal types were
> applied previously, and the application doesn’t need to repeat all
> existing seal types in the next mseal().  Once a seal type is applied,
> it can’t be unsealed.
>
> So if we want to use mprotect() for sealing, developers need to think
> of sealing bits differently than the rest of prot bits. It is a
> different programming model, might or might not be an obvious concept
> to developers.
>
This probably doesn't matter much to developers.
We can enforce the sealing bit to be the same as the rest of PROT bits.
If mprotect() tries to unset sealing, it will fail.

> There is a difference in input check and error handling as well.
> for mseal(), if a given address range has a gap (unallocated memory),
> or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> updated.
> For mprotect(), some VMAs can be updated, till an error happens to a VMA.
>
This difference doesn't matter much.

For mprotect()/mmap(), is Linux implementation limited by POSIX ?
This can be made backward compatible.
If there is no objection to adding linux specific values in mmap() and
mprotect(),
This works for me.

Thanks for your input.
-Jeff
Pedro Falcato Oct. 19, 2023, 10:47 p.m. UTC | #26
On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote:
>
> Hi Pedro
>
> Some followup on mmap() + mprotect():
>
> On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
> >
> > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > > >
> > > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > > could easily integrate with mmap() and as such allow for one-shot
> > > > > mmap() + mseal().
> > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > > definitely sounds like a performance win as we halve the number of
> > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > > like common patterns.
> > > > >
> > > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > > immutable from begining.
> > > > This is orthogonal to mseal() though.
> > >
> > > I don't see how this can be orthogonal to mseal().
> > > In the case we opt for adding PROT_ bits, we should more or less only
> > > need to adapt calc_vm_prot_bits(), and the rest should work without
> > > issues.
> > > vma merging won't merge vmas with different prots. The current
> > > interfaces (mmap and mprotect) would work just fine.
> > > In this case, mseal() or mimmutable() would only be needed if you need
> > > to set immutability over a range of VMAs with different permissions.
> > >
> > Agreed. By orthogonal, I meant we can have two APIs:
> > mmap() and mseal()/mprotect()
> > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> > Sealing can be applied after initial memory creation.
> >
> > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > > The only annoying wrench in my plans here is that we have effectively
> > > run out of vm_flags bits in 32-bit architectures, so this approach as
> > > I described is not compatible with 32-bit.
> > >
> > > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > > changed to RO, for example, during symbol resolution.
> > > > The important point is that the application can decide what type of
> > > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > > important to me.
> > > >
> > > > mprotect() in linux have the following signature:
> > > > int mprotect(void addr[.len], size_t len, int prot);
> > > > the prot bitmasks are all taken here.
> > > > I have not checked the prot field in mmap(), there might be bits left,
> > > > even not, we could have mmap2(), so that is not an issue.
> > >
> > > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > > and we seem to have around 8 different bits used).
> > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> > >
> > > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > > can probably be worked around if need be.
> > >
> > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> > Apology that I was wrong previously and caused confusion.
> >
> > There is a slight difference in the syntax of mprotect and mseal.
> > Each time when mprotect() is called, the kernel takes all of RWX bits
> > and updates vm_flags,
> > In other words, the application sets/unset each RWX, and kernel takes it.
> >
> > In the mseal() case, the kernel will remember which seal types were
> > applied previously, and the application doesn’t need to repeat all
> > existing seal types in the next mseal().  Once a seal type is applied,
> > it can’t be unsealed.
> >
> > So if we want to use mprotect() for sealing, developers need to think
> > of sealing bits differently than the rest of prot bits. It is a
> > different programming model, might or might not be an obvious concept
> > to developers.
> >
> This probably doesn't matter much to developers.
> We can enforce the sealing bit to be the same as the rest of PROT bits.
> If mprotect() tries to unset sealing, it will fail.

Yep. Erroneous or malicious mprotects would all be caught. However, if
we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect()
to less permissions or even downright munmap()) you'd want some care
to preserve that bit when setting permissions.

>
> > There is a difference in input check and error handling as well.
> > for mseal(), if a given address range has a gap (unallocated memory),
> > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> > updated.
> > For mprotect(), some VMAs can be updated, till an error happens to a VMA.
> >
> This difference doesn't matter much.
>
> For mprotect()/mmap(), is Linux implementation limited by POSIX ?

No. POSIX works merely as a baseline that UNIX systems aim towards.
You can (and very frequently do) extend POSIX interfaces (in fact,
it's how most of POSIX was written, through sheer
"design-by-committee" on a bunch of UNIX systems' extensions).

> This can be made backward compatible.
> If there is no objection to adding linux specific values in mmap() and
> mprotect(),
> This works for me.

Linux already has system-specific values for PROT_ (PROT_BTI,
PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc).
Whether this is the right interface is another question. I do like it
a lot, but there's of course value in being compatible with existing
solutions (like mimmutable()).
Linus Torvalds Oct. 19, 2023, 11:06 p.m. UTC | #27
On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
>
> No. POSIX works merely as a baseline that UNIX systems aim towards.
> You can (and very frequently do) extend POSIX interfaces (in fact,
> it's how most of POSIX was written, through sheer
> "design-by-committee" on a bunch of UNIX systems' extensions).

We can in extreme circumstances actually go further than that, and not
only extend on POSIX requirements, but actively even violate them.

It does need a very good reason, though, but it has happened when
POSIX requirements were simply actively wrong.

For example, at one point POSIX required

     int accept(int s, struct sockaddr *addr, size_t *addrlen);

which was simply completely wrong. It's utter shite, and didn't
actually match any reality.

The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make
it "size_t" was completely unacceptable.

So we ignored it, told POSIX people that they were full of sh*t, and
they eventually fixed it in the next version (by introducing a
"socklen_t" that had better be the same as "int").

So POSIX can simply be wrong.

Also, if it turns out that we had some ABI that wasn't
POSIX-compatible, the whole "don't break user space" will take
precedence over any POSIX concerns, and we will not "fix" our system
call if people already use our old semantics.

So in that case, we generally end up with a new system call (or new
flags) instead.

Or sometimes it just is such a small detail that nobody cares - POSIX
also has a notion of documenting areas of non-conformance, and people
who really care end up having notions like "conformance vs _strict_
conformance".

                 Linus
Theo de Raadt Oct. 20, 2023, 3:55 p.m. UTC | #28
Stephen Röttger <sroettger@google.com> wrote:

> > > IMO: The approaches mimmutable() and mseal() took are different, but
> > > we all want to seal the memory from attackers and make the linux
> > > application safer.
> >
> > I think you are building mseal for chrome, and chrome alone.
> >
> > I do not think this will work out for the rest of the application space
> > because
> >
> > 1) it is too complicated
> > 2) experience with mimmutable() says that applications don't do any of it
> >    themselves, it is all in execve(), libc initialization, and ld.so.
> >    You don't strike me as an execve, libc, or ld.so developer.
> 
> We do want to build this in a way that it can be applied automatically by ld.so
> and we appreciate all your feedback on this.

Hi Stephen,

I am pretty sure your mechanism will be useable by ld.so.

What bothers me is the complex many-bits approach may encourage people
to set only a subset of the bits, and then believe they have a security
primitive.

Partial sealing is not safe.  I define partial sealing as "blocking munmap,
but not mprotect".  Or "blocking mprotect, but not madvise or mmap".

In Message-id <ZS/3GCKvNn5qzhC4@casper.infradead.org> Matthew stated there
that there are two aspects being locked: which object is mapped, and the
permission of that mapping.  When additional system calls msync() and madvise()
are included in the picture, there are 3 actions being prevented:

 - Can someone replace the object
 - Can someone change the permission
 - Can someone throw away the cached pages, reverting to original
   content of the object (that is the madvise / msync)

In Message-id: <CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com>
Jan reminded us of this piece.  I'm taking this as a long-standing security
hole in some sub-operations of msync/madvise which can write to data regions
that aren't actually writeable.  Sub-operations with this problem are MADV_FREE,
MADV_DONTNEED, POSIX_MADV_DONTNEED, MS_INVALIDATE.. on Linux MADV_WIPEONFORK,
and probably a whole bunch of others.  I am testing OpenBSD changes which
demand PROT_WRITE permission for these sub-operations.  Perhaps some systems
are already careful.

If you leave any of these operators available, the object is not actually sealed
against abuse.  I believe an attacker will simply switch to a different operator
(mmap, munmap, mprotect, madvise, msync) to achieve a similar objective of
damaging the permission or contents.

Since mseal() is designed to create partial sealings, the name of the proposed
system call really smells.

> The intention of
> splitting the sealing
> by syscall was to provide flexibility while still allowing ld.so to
> seal all operations.

Yes, you will have ld.so set all the bits, and the same in C runtime
initialization.  If you convince glibc to stop make the stack executable
in dlopen(), the kernel could automatically do it.. With Linux backwards
compat management, getting there would be an extremely long long long
roadmap.  But anyways the idea would be "set all the bits".  Because otherwise
the object or data isn't safe.

> Does Linus' proposal to just split munmap / mprotect sealing address your
> complexity concerns? ld.so would always use both flags which should then behave
> similar to mimmutable().

No, I think it is weak, because it isn't sealed.

A seperate mail in the thread from you says this is about chrome wanting
to use PKU on RWX objects.  I think that's the reason for wanting to
seperate the sealing (I haven't heard of other applications wanting that).
How about we explore that in the other subthread..
Theo de Raadt Oct. 20, 2023, 4:27 p.m. UTC | #29
Stephen Röttger <sroettger@google.com> wrote:

> > I do like us starting with just "mimmutable()", since it already
> > exists. Particularly if chrome already knows how to use it.
> >
> > Maybe add a flag field (require it to be zero initially) just to allow
> > any future expansion. Maybe the chrome team has *wanted* to have some
> > finer granularity thing and currently doesn't use mimmutable() in some
> > case?
> 
> Yes, we do have a use case in Chrome to split the sealing into unmap and
> mprotect which will allow us to seal additional pages that we can't seal with
> pure mimmutable().

> For example, we have pkey-tagged RWX memory that we want to seal. Since
> the memory is already RWX and the pkey controls write access, we don't care
> about permission changes but sometimes we do need to mprotect data only
> pages.

Let me try to decompose this statement.

This is clearly for the JIT. You can pivot between the a JIT generated
code mapping being RW and RX (or X-only), the object will pivot between
W or X to satisfy W^X policy for safety.

I think you are talking about a RWX MAP_ANON object.

Then you use pkey_alloc() to get a PKEY.  pkey_mprotect() sets the PKEY on
the region.  I argue you can then make it entirely immutable / sealed.

Let's say it is fully immutable / sealed.

After which, you can change the in-processor PKU register (using pkey_set)
to toggle the Write-Inhibit and eXecute-Inhibit bits on that key.

The immutable object has a dangerous RWX permission.  But the specific
PKEY making it either RX (or X-only) or RW depending upon your context.
The mapping is never exposed as RWX. The PKU model reduces the
permission access of the object below the immutable permission level.

The security depends on the PKEY WI/XI bits being difficult to control.

SADLY on x86, this is managed with a PKRU userland register which is changeble
without any supervisor control -- yes, it seems quite dangerous.
Changing it requires a complicated MSR dance.  It is unfortunate that
the pkey_set() library function is easily reachedable in the PLT via ROP
methods.  On non-x86 cpus that have similar functionality, the register
is privileged, but operating supporting it generally change it and
return immediately.

The problem you seem to have with fully locked mseal() in chrome seems
to be here:

> about permission changes but sometimes we do need to mprotect data only
> pages.

Does that data have to be in the same region?  Can your allocator not
put the non-code pieces of the JIT elsewhere, with a different
permission, fully immutable / msealed -- and perhaps even managed with a
different PKEY if neccessary?

May that requires a huge rearchitecture.  But isn't the root problem here
that data and code are being handled in the same object with a shared
permission model?

> But the munmap sealing will provide protection against implicit changes of the
> pkey in this case which would happen if a page gets unmapped and another
> mapped in its place.

That primitive feels so weird, I have a difficult time believing it will
remain unattackable in the long term.


But what if you could replace mprotect() with pkey_mprotect() upon a
different key.. ?

--

A few more notes comparing what OpenBSD has done compared to Linux:


In OpenBSD, we do not have the pkey library.  We have stolen one of the
PKEY and use it for kernel support of xonly for kernel code and userland
code.  On x86 we recognize that userland can flip the permission by whacking
the RPKU register -- which would make xonly code readable.  (The chrome data
you are trying to guard faces the same problem).

To prevent that, a majority of traps in the kernel (page faults,
interrupts, etc) check if the PKRU register has been modified, and kill
the process.  It is statistically strong.

We are not making pkey available as a userland feature, but if we later
do so we would still have 15 pkeys to play with.  We would probably make
the pkey_set() operation a system call, so the trap handler can also
observe RPKU register modifications by the instruction.

Above, I mentioned pivoting between "RW or RX (or X-only)".  On OpenBSD, chrome would be
able to pivot between RW and X-only.

When it comes to Pkey utilization, we've ended up in a very different
place than Linux.
Jeff Xu Oct. 23, 2023, 5:42 p.m. UTC | #30
On Thu, Oct 19, 2023 at 3:47 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote:
> >
> > Hi Pedro
> >
> > Some followup on mmap() + mprotect():
> >
> > On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
> > >
> > > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > > > could easily integrate with mmap() and as such allow for one-shot
> > > > > > mmap() + mseal().
> > > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > > > definitely sounds like a performance win as we halve the number of
> > > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > > > like common patterns.
> > > > > >
> > > > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > > > immutable from begining.
> > > > > This is orthogonal to mseal() though.
> > > >
> > > > I don't see how this can be orthogonal to mseal().
> > > > In the case we opt for adding PROT_ bits, we should more or less only
> > > > need to adapt calc_vm_prot_bits(), and the rest should work without
> > > > issues.
> > > > vma merging won't merge vmas with different prots. The current
> > > > interfaces (mmap and mprotect) would work just fine.
> > > > In this case, mseal() or mimmutable() would only be needed if you need
> > > > to set immutability over a range of VMAs with different permissions.
> > > >
> > > Agreed. By orthogonal, I meant we can have two APIs:
> > > mmap() and mseal()/mprotect()
> > > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> > > Sealing can be applied after initial memory creation.
> > >
> > > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > > > The only annoying wrench in my plans here is that we have effectively
> > > > run out of vm_flags bits in 32-bit architectures, so this approach as
> > > > I described is not compatible with 32-bit.
> > > >
> > > > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > > > changed to RO, for example, during symbol resolution.
> > > > > The important point is that the application can decide what type of
> > > > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > > > important to me.
> > > > >
> > > > > mprotect() in linux have the following signature:
> > > > > int mprotect(void addr[.len], size_t len, int prot);
> > > > > the prot bitmasks are all taken here.
> > > > > I have not checked the prot field in mmap(), there might be bits left,
> > > > > even not, we could have mmap2(), so that is not an issue.
> > > >
> > > > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > > > and we seem to have around 8 different bits used).
> > > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> > > >
> > > > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > > > can probably be worked around if need be.
> > > >
> > > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> > > Apology that I was wrong previously and caused confusion.
> > >
> > > There is a slight difference in the syntax of mprotect and mseal.
> > > Each time when mprotect() is called, the kernel takes all of RWX bits
> > > and updates vm_flags,
> > > In other words, the application sets/unset each RWX, and kernel takes it.
> > >
> > > In the mseal() case, the kernel will remember which seal types were
> > > applied previously, and the application doesn’t need to repeat all
> > > existing seal types in the next mseal().  Once a seal type is applied,
> > > it can’t be unsealed.
> > >
> > > So if we want to use mprotect() for sealing, developers need to think
> > > of sealing bits differently than the rest of prot bits. It is a
> > > different programming model, might or might not be an obvious concept
> > > to developers.
> > >
> > This probably doesn't matter much to developers.
> > We can enforce the sealing bit to be the same as the rest of PROT bits.
> > If mprotect() tries to unset sealing, it will fail.
>
> Yep. Erroneous or malicious mprotects would all be caught. However, if
> we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect()
> to less permissions or even downright munmap()) you'd want some care
> to preserve that bit when setting permissions.
>
> >
> > > There is a difference in input check and error handling as well.
> > > for mseal(), if a given address range has a gap (unallocated memory),
> > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> > > updated.
> > > For mprotect(), some VMAs can be updated, till an error happens to a VMA.
> > >
> > This difference doesn't matter much.
> >
> > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
>
> No. POSIX works merely as a baseline that UNIX systems aim towards.
> You can (and very frequently do) extend POSIX interfaces (in fact,
> it's how most of POSIX was written, through sheer
> "design-by-committee" on a bunch of UNIX systems' extensions).
>
> > This can be made backward compatible.
> > If there is no objection to adding linux specific values in mmap() and
> > mprotect(),
> > This works for me.
>
> Linux already has system-specific values for PROT_ (PROT_BTI,
> PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc).
> Whether this is the right interface is another question. I do like it
> a lot, but there's of course value in being compatible with existing
> solutions (like mimmutable()).
>

Thanks Pedro for providing examples on mm extension to POSIX. This
opens more design options on solving the sealing problem. I will take
a few days to research  design options.

-Jeff


> --
> Pedro
Jeff Xu Oct. 23, 2023, 5:44 p.m. UTC | #31
On Thu, Oct 19, 2023 at 4:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
> >
> > No. POSIX works merely as a baseline that UNIX systems aim towards.
> > You can (and very frequently do) extend POSIX interfaces (in fact,
> > it's how most of POSIX was written, through sheer
> > "design-by-committee" on a bunch of UNIX systems' extensions).
>
> We can in extreme circumstances actually go further than that, and not
> only extend on POSIX requirements, but actively even violate them.
>
> It does need a very good reason, though, but it has happened when
> POSIX requirements were simply actively wrong.
>
> For example, at one point POSIX required
>
>      int accept(int s, struct sockaddr *addr, size_t *addrlen);
>
> which was simply completely wrong. It's utter shite, and didn't
> actually match any reality.
>
> The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make
> it "size_t" was completely unacceptable.
>
> So we ignored it, told POSIX people that they were full of sh*t, and
> they eventually fixed it in the next version (by introducing a
> "socklen_t" that had better be the same as "int").
>
> So POSIX can simply be wrong.
>
> Also, if it turns out that we had some ABI that wasn't
> POSIX-compatible, the whole "don't break user space" will take
> precedence over any POSIX concerns, and we will not "fix" our system
> call if people already use our old semantics.
>
> So in that case, we generally end up with a new system call (or new
> flags) instead.
>
> Or sometimes it just is such a small detail that nobody cares - POSIX
> also has a notion of documenting areas of non-conformance, and people
> who really care end up having notions like "conformance vs _strict_
> conformance".
>
>                  Linus
>

Thanks Linus for clarifying the guidelines on POSIX in Linux.

-Jeff
Stephen Röttger Oct. 24, 2023, 10:42 a.m. UTC | #32
> The problem you seem to have with fully locked mseal() in chrome seems
> to be here:
>
> > about permission changes but sometimes we do need to mprotect data only
> > pages.
>
> Does that data have to be in the same region?  Can your allocator not
> put the non-code pieces of the JIT elsewhere, with a different
> permission, fully immutable / msealed -- and perhaps even managed with a
> different PKEY if neccessary?

No we can't. We investigated this extensively since this also poses some
difficulties on MacOS. We implemented different approaches but any such
change to the allocator introduces too much of a performance impact.