Message ID | 20231016143828.647848-1-jeffxu@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | Introduce mseal() syscall | expand |
On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote: > Modern CPUs support memory permissions such as RW and NX bits. Linux has > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. This seems like a confusing way to introduce the subject. Here, you're talking about page permissions, whereas (as far as I can tell), mseal() is about making _virtual_ addresses immutable, for some value of immutable. > Memory sealing additionally protects the mapping itself against > modifications. This is useful to mitigate memory corruption issues where > a corrupted pointer is passed to a memory management syscall. For example, > such an attacker primitive can break control-flow integrity guarantees > since read-only memory that is supposed to be trusted can become writable > or .text pages can get remapped. Memory sealing can automatically be > applied by the runtime loader to seal .text and .rodata pages and > applications can additionally seal security critical data at runtime. > A similar feature already exists in the XNU kernel with the > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > Also, Chrome wants to adopt this feature for their CFI work [2] and this > patchset has been designed to be compatible with the Chrome use case. This [2] seems very generic and wide-ranging, not helpful. [5] was more useful to understand what you're trying to do. > The new mseal() is an architecture independent syscall, and with > following signature: > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) > > addr/len: memory range. Must be continuous/allocated memory, or else > mseal() will fail and no VMA is updated. For details on acceptable > arguments, please refer to comments in mseal.c. Those are also fully > covered by the selftest. Mmm. So when you say "continuous/allocated" what you really mean is "Must have contiguous VMAs" rather than "All pages in this range must be populated", yes? > types: bit mask to specify which syscall to seal, currently they are: > MM_SEAL_MSEAL 0x1 > MM_SEAL_MPROTECT 0x2 > MM_SEAL_MUNMAP 0x4 > MM_SEAL_MMAP 0x8 > MM_SEAL_MREMAP 0x10 I don't understand why we want this level of granularity. The OpenBSD and XNU examples just say "This must be immutable*". For values of immutable that allow downgrading access (eg RW to RO or RX to RO), but not upgrading access (RW->RX, RO->*, RX->RW). > Each bit represents sealing for one specific syscall type, e.g. > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > is that the API is extendable, i.e. when needed, the sealing can be > extended to madvise, mlock, etc. Backward compatibility is also easy. Honestly, it feels too flexible. Why not just two flags to mprotect() -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- maybe for some things we want to be able to downgrade and for other things, we don't. I'd like to see some discussion of how this interacts with mprotect(). As far as I can tell, the intent is to lock the protections/existance of the mapping, and not to force memory to stay in core. So it's fine for the kernel to swap out the page and set up a PTE as a swap entry. It's also fine for the kernel to mark PTEs as RO to catch page faults; we're concerned with the LOGICAL permissions, and not the page tables.
On Mon, 16 Oct 2023 at 07:38, <jeffxu@chromium.org> wrote: > > This patchset proposes a new mseal() syscall for the Linux kernel. So I have no objections to adding some kind of "lock down memory mappings" model, but this isn't it. First off, the simple stuff: the commit messages are worthless. Having check seal for mmap(2) as the commit message is not even remotely acceptable, to pick one random example from the series (7/8). But that doesn't matter much, since I think the more fundamental problems are much worse: - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is just complete noise and totally illogical. The whole concept needs to be redone. Example of complete nonsense (again, picking 7/8, but that's again just a random example): > @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages, > flags |= MAP_LOCKED; > > file = get_file(vma->vm_file); > - ret = do_mmap(vma->vm_file, start, size, > - prot, flags, pgoff, &populate, NULL); > + ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff, > + &populate, NULL, ON_BEHALF_OF_KERNEL); > fput(file); > out: > mmap_write_unlock(mm); Christ. That's *literally* the remap_file_pages() system call definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense in this context. It's not the only situation. "mremap() as the same thing. vm_munmap() has the same thing. vm_brk_flags() has the same thing. None of them make any sense. But even if those obvious kinds of complete mis-designs were to be individually fixed, the whole naming and concept is bogus. The "ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check sealing". Naming should describe what the thing *means*, not some random policy thing that may or may not be at all relevant. - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away. Not only is it unacceptable to pointlessly create two different name spaces for no obvious reason, code like this (from 1/8) should not exist: > + if (types & MM_SEAL_MSEAL) > + newtypes |= VM_SEAL_MSEAL; > + > + if (types & MM_SEAL_MPROTECT) > + newtypes |= VM_SEAL_MPROTECT; > + > + if (types & MM_SEAL_MUNMAP) > + newtypes |= VM_SEAL_MUNMAP; > + > + if (types & MM_SEAL_MMAP) > + newtypes |= VM_SEAL_MMAP; > + > + if (types & MM_SEAL_MREMAP) > + newtypes |= VM_SEAL_MREMAP; Sure, we have that in some cases when there was a *reason* for namespacing imposed on us from an API standpoint (ie the "open()" flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags being turned into MNT_xyz flags), but those tend to be signs of problems in the system call API where some mixup made it impossible to use the flags directly (most commonly because there is some extended internal representation of said things). For example, the MS_xyz namespace is a combination of "these are flags for the new mount" (like MS_RDONLY) and "this is how you should mount it" (like MS_REMOUNT), and so the user interface has that odd mixing of different things, and the MNT_xyz namespace is a distillation of the former. But we certainly should not strive to introduce *new* interfaces that start out with this kind of mixup and pointless "translate from one bit to another" code. - And finally (for now), I hate that MM_ACTION_xyz thing too! Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8): > + switch (action) { > + case MM_ACTION_MPROTECT: > + if (vma->vm_seals & VM_SEAL_MPROTECT) > + return false; > + break; when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and sealing bitmask and just say if (vma->vm_seal & vm_action) return -EPERM; IOW, you have pointlessly introduced not *two* different namespaces, but *three*. All doing the exact same thing, and all just causing pointless and ugly code to "translate" the actions of one into the model of another. So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz" thing. Use *one* single "these are the sealing bits". And get rid of "enum caller_origin" entirely. I don't know what the right model for that thing is, but that isn't it. *Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space cannot set, but that the kernel can use internally - and if that is the right model, then dammit, the *uses* should be very very obvious why the override is fine, because that remap_file_pages() use sure as hell was *not* obvious. In fact, it's not at all obvious why anything should override the sealing bits - EVER. So I'm not convinced that the right model is "replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we *ever* want to override sealing? It sounds like complete garbage. Most of the users seem to be things like "execve()", which is nonsensical, since the VM wouldn't have been sealed at that point _anyway_, so having a "don't bother checking sealing bits" flag seems entirely useless. Anyway, this is all a resounding NAK on this series in this form. My complaints are not some kind of small "fix this up". These are fundamental issues. Linus
On Mon, Oct 16, 2023 at 4:38 PM <jeffxu@chromium.org> wrote: > > From: Jeff Xu <jeffxu@google.com> > > This patchset proposes a new mseal() syscall for the Linux kernel. > > Modern CPUs support memory permissions such as RW and NX bits. Linux has > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. > The memory permission feature improves security stance on memory > corruption bugs, i.e. the attacker can’t just write to arbitrary memory > and point the code to it, the memory has to be marked with X bit, or > else an exception will happen. > > Memory sealing additionally protects the mapping itself against > modifications. This is useful to mitigate memory corruption issues where > a corrupted pointer is passed to a memory management syscall. For example, > such an attacker primitive can break control-flow integrity guarantees > since read-only memory that is supposed to be trusted can become writable > or .text pages can get remapped. Memory sealing can automatically be > applied by the runtime loader to seal .text and .rodata pages and > applications can additionally seal security critical data at runtime. > A similar feature already exists in the XNU kernel with the > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > Also, Chrome wants to adopt this feature for their CFI work [2] and this > patchset has been designed to be compatible with the Chrome use case. > > The new mseal() is an architecture independent syscall, and with > following signature: > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) Is the plan that the VMAs you need to protect would be created and mseal()'ed while you expect that attacker code can not (yet) be running concurrently? Do you expect to be using sealed memory for shadow stacks (in x86 CET / arm64 GCS) to prevent an attacker from mixing those up by moving pages inside a shadow stack or between different shadow stacks or such? (If that's even possible, I think it is but I haven't tried.) > addr/len: memory range. Must be continuous/allocated memory, or else > mseal() will fail and no VMA is updated. For details on acceptable > arguments, please refer to comments in mseal.c. Those are also fully > covered by the selftest. > types: bit mask to specify which syscall to seal, currently they are: > MM_SEAL_MSEAL 0x1 > MM_SEAL_MPROTECT 0x2 > MM_SEAL_MUNMAP 0x4 > MM_SEAL_MMAP 0x8 > MM_SEAL_MREMAP 0x10 You'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED; probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd want to just block all madvise() for sealed VMAs? Or rename process_madvise_behavior_valid() to something like "madvise_is_nondestructive()" and use that. The following comments probably mostly don't matter in practice if this feature is used in a context that is heavily seccomp-sandboxed (like Desktop Linux Chrome), but should maybe be addressed to make this feature more usable for other users. (Including Android Chrome, which has a weaker sandbox...) FWIW, it is also possible to write to read-only memory through the /proc/self/mem interface (or through ptrace commands like PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel configuration, seccomp policy, and what the LSMs do. (I think Android Chrome would allow /proc/self/mem writes, but would block PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?) I had a related ancient patch series in 2016 with an attempt to allow SELinux to prevent bypassing W^X protections with this, but I never followed through with getting that landed... (https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh.net/) I guess the question there is what the right semantics for this kind of protected memory are when a debugger is active. The simple solution that might break some debugging would be "just deny all FOLL_FORCE write access for this memory" (which would prevent debuggers from being able to place breakpoints, which would maybe not be great). But maybe it makes more sense to consider this to be an independent concern and solve it with a new SELinux feature or something like that instead, and then document that mseal() requires some complement to prevent forced writes to read-only private memory? (For which the simplest solution would be "don't grant filesystem access or ptrace() access to the sandboxed code".) What is the intended interaction with userfaultfd, which I believe by design permits arbitrary data into unpopulated areas of anonymous VMAs? If the intention is that the process should otherwise be sandboxed to not have access to userfaultfd, that should maybe be documented. (Alternatively I guess you could theoretically remove the VM_MAYWRITE bit from marked VMAs, but that might be more strict than we want, since it'd also block all FOLL_FORCE writes.) There are also some interfaces like AIO or the X86 Shadow Stacks interface that indirectly unmap memory through the kernel and look like they could perhaps be tricked into racily unmapping a just-created sealed VMA. You'd probably have to make sure that they can't do that and essentially treat their unmap operations as if they came from userspace, I guess? What Linus just wrote. I think either way this feature needs some documentation on what kind of context it's supposed to run in. > Each bit represents sealing for one specific syscall type, e.g. > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > is that the API is extendable, i.e. when needed, the sealing can be > extended to madvise, mlock, etc. Backward compatibility is also easy. > > The kernel will remember which seal types are applied, and the application > doesn’t need to repeat all existing seal types in the next mseal(). Once > a seal type is applied, it can’t be unsealed. Call mseal() on an existing > seal type is a no-action, not a failure. > > MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type. > > Internally, vm_area_struct adds a new field vm_seals, to store the bit > masks. > > For the affected syscalls, such as mprotect, a check(can_modify_mm) for > sealing is added, this usually happens at the early point of the syscall, > before any update is made to VMAs. The effect of that is: if any of the > VMAs in the given address range fails the sealing check, none of the VMA > will be updated. It might be worth noting that this is different from the > rest of mprotect(), where some updates can happen even when mprotect > returns fail. Consider can_modify_mm only checks vm_seals in > vm_area_struct, and it is not going deeper in the page table or updating > any HW, success or none behavior might fit better here. I would like to > listen to the community's feedback on this. > > The idea that inspired this patch comes from Stephen Röttger’s work in > V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API. > > In addition, Stephen is working on glibc change to add sealing support > into the dynamic linker to seal all non-writable segments at startup. When > that work is completed, all applications can automatically benefit from > these new protections. > > [1] https://kernelnewbies.org/Linux_2_6_8 > [2] https://v8.dev/blog/control-flow-integrity > [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 > [4] https://man.openbsd.org/mimmutable.2 > [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc > > Jeff Xu (8): > Add mseal syscall > Wire up mseal syscall > mseal: add can_modify_mm and can_modify_vma > mseal: seal mprotect > mseal munmap > mseal mremap > mseal mmap > selftest mm/mseal mprotect/munmap/mremap/mmap > > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > arch/arm/tools/syscall.tbl | 1 + > arch/arm64/include/asm/unistd.h | 2 +- > arch/arm64/include/asm/unistd32.h | 2 + > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > arch/s390/kernel/syscalls/syscall.tbl | 1 + > arch/sh/kernel/syscalls/syscall.tbl | 1 + > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > fs/aio.c | 5 +- > include/linux/mm.h | 55 +- > include/linux/mm_types.h | 7 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/unistd.h | 5 +- > include/uapi/linux/mman.h | 6 + > ipc/shm.c | 3 +- > kernel/sys_ni.c | 1 + > mm/Kconfig | 8 + > mm/Makefile | 1 + > mm/internal.h | 4 +- > mm/mmap.c | 49 +- > mm/mprotect.c | 6 + > mm/mremap.c | 19 +- > mm/mseal.c | 328 +++++ > mm/nommu.c | 6 +- > mm/util.c | 8 +- > tools/testing/selftests/mm/Makefile | 1 + > tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ > 37 files changed, 1934 insertions(+), 28 deletions(-) > create mode 100644 mm/mseal.c > create mode 100644 tools/testing/selftests/mm/mseal_test.c > > -- > 2.42.0.609.gbb76f46606-goog >
Hi Matthew. Thanks for your comments and time to review the patchset. On Mon, Oct 16, 2023 at 8:18 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote: > > Modern CPUs support memory permissions such as RW and NX bits. Linux has > > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. > > This seems like a confusing way to introduce the subject. Here, you're > talking about page permissions, whereas (as far as I can tell), mseal() is > about making _virtual_ addresses immutable, for some value of immutable. > > > Memory sealing additionally protects the mapping itself against > > modifications. This is useful to mitigate memory corruption issues where > > a corrupted pointer is passed to a memory management syscall. For example, > > such an attacker primitive can break control-flow integrity guarantees > > since read-only memory that is supposed to be trusted can become writable > > or .text pages can get remapped. Memory sealing can automatically be > > applied by the runtime loader to seal .text and .rodata pages and > > applications can additionally seal security critical data at runtime. > > A similar feature already exists in the XNU kernel with the > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > > Also, Chrome wants to adopt this feature for their CFI work [2] and this > > patchset has been designed to be compatible with the Chrome use case. > > This [2] seems very generic and wide-ranging, not helpful. [5] was more > useful to understand what you're trying to do. > > > The new mseal() is an architecture independent syscall, and with > > following signature: > > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) > > > > addr/len: memory range. Must be continuous/allocated memory, or else > > mseal() will fail and no VMA is updated. For details on acceptable > > arguments, please refer to comments in mseal.c. Those are also fully > > covered by the selftest. > > Mmm. So when you say "continuous/allocated" what you really mean is > "Must have contiguous VMAs" rather than "All pages in this range must > be populated", yes? > There can't be a gap (unallocated memory) in the given range. Those are covered in selftest: test_seal_unmapped_start() test_seal_unmapped_middle() test_seal_unmapped_end() The comments in check_mm_seal() also mentioned that. > > types: bit mask to specify which syscall to seal, currently they are: > > MM_SEAL_MSEAL 0x1 > > MM_SEAL_MPROTECT 0x2 > > MM_SEAL_MUNMAP 0x4 > > MM_SEAL_MMAP 0x8 > > MM_SEAL_MREMAP 0x10 > > I don't understand why we want this level of granularity. The OpenBSD > and XNU examples just say "This must be immutable*". For values of > immutable that allow downgrading access (eg RW to RO or RX to RO), > but not upgrading access (RW->RX, RO->*, RX->RW). > > > Each bit represents sealing for one specific syscall type, e.g. > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > > is that the API is extendable, i.e. when needed, the sealing can be > > extended to madvise, mlock, etc. Backward compatibility is also easy. > > Honestly, it feels too flexible. Why not just two flags to mprotect() > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- > maybe for some things we want to be able to downgrade and for other > things, we don't. > Having a seal type per syscall type helps to add the feature incrementally. Applications also know exactly what is sealed. I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we can define what it seals precisely. As Jann pointed out, there have other scenarios that potentially affect IMMUTABLE. Implementing all thoses will take time. And if we missed a case, we could introduce backward compatibility issues to the application. Bitmask will solve this nicely, i.e. application will need to apply the newly added sealing type explicitly. > I'd like to see some discussion of how this interacts with mprotect(). > As far as I can tell, the intent is to lock the protections/existance > of the mapping, and not to force memory to stay in core. So it's fine > for the kernel to swap out the page and set up a PTE as a swap entry. > It's also fine for the kernel to mark PTEs as RO to catch page faults; > we're concerned with the LOGICAL permissions, and not the page tables. Yes. That is correct. -Jeff Xu
Hi Jann, Thank you for reviewing the patch and comments. On Mon, Oct 16, 2023 at 10:35 AM Jann Horn <jannh@google.com> wrote: > > On Mon, Oct 16, 2023 at 4:38 PM <jeffxu@chromium.org> wrote: > > > > From: Jeff Xu <jeffxu@google.com> > > > > This patchset proposes a new mseal() syscall for the Linux kernel. > > > > Modern CPUs support memory permissions such as RW and NX bits. Linux has > > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. > > The memory permission feature improves security stance on memory > > corruption bugs, i.e. the attacker can’t just write to arbitrary memory > > and point the code to it, the memory has to be marked with X bit, or > > else an exception will happen. > > > > Memory sealing additionally protects the mapping itself against > > modifications. This is useful to mitigate memory corruption issues where > > a corrupted pointer is passed to a memory management syscall. For example, > > such an attacker primitive can break control-flow integrity guarantees > > since read-only memory that is supposed to be trusted can become writable > > or .text pages can get remapped. Memory sealing can automatically be > > applied by the runtime loader to seal .text and .rodata pages and > > applications can additionally seal security critical data at runtime. > > A similar feature already exists in the XNU kernel with the > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > > Also, Chrome wants to adopt this feature for their CFI work [2] and this > > patchset has been designed to be compatible with the Chrome use case. > > > > The new mseal() is an architecture independent syscall, and with > > following signature: > > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) > > Is the plan that the VMAs you need to protect would be created and > mseal()'ed while you expect that attacker code can not (yet) be > running concurrently? > Yes. > Do you expect to be using sealed memory for shadow stacks (in x86 CET > / arm64 GCS) to prevent an attacker from mixing those up by moving > pages inside a shadow stack or between different shadow stacks or > such? (If that's even possible, I think it is but I haven't tried.) > I will defer the question to Stephen. (I also think that is possible, if the application know where the shadow stack is at runtime) > > addr/len: memory range. Must be continuous/allocated memory, or else > > mseal() will fail and no VMA is updated. For details on acceptable > > arguments, please refer to comments in mseal.c. Those are also fully > > covered by the selftest. > > types: bit mask to specify which syscall to seal, currently they are: > > MM_SEAL_MSEAL 0x1 > > MM_SEAL_MPROTECT 0x2 > > MM_SEAL_MUNMAP 0x4 > > MM_SEAL_MMAP 0x8 > > MM_SEAL_MREMAP 0x10 > > You'd probably also want to block destructive madvise() operations > that can effectively alter region contents by discarding pages and > such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED; > probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd > want to just block all madvise() for sealed VMAs? Or rename > process_madvise_behavior_valid() to something like > "madvise_is_nondestructive()" and use that. > Thanks for the suggestion. I can add madvise() later, maybe in another series. For now, the goal of this patchset covers 4 syscalls, as wished for by Chrome initially. > The following comments probably mostly don't matter in practice if > this feature is used in a context that is heavily seccomp-sandboxed > (like Desktop Linux Chrome), but should maybe be addressed to make > this feature more usable for other users. (Including Android Chrome, > which has a weaker sandbox...) > > FWIW, it is also possible to write to read-only memory through the > /proc/self/mem interface (or through ptrace commands like > PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel > configuration, seccomp policy, and what the LSMs do. (I think Android > Chrome would allow /proc/self/mem writes, but would block > PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?) > > I had a related ancient patch series in 2016 with an attempt to allow > SELinux to prevent bypassing W^X protections with this, but I never > followed through with getting that landed... > (https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh.net/) > > I guess the question there is what the right semantics for this kind > of protected memory are when a debugger is active. The simple solution > that might break some debugging would be "just deny all FOLL_FORCE > write access for this memory" (which would prevent debuggers from > being able to place breakpoints, which would maybe not be great). But > maybe it makes more sense to consider this to be an independent > concern and solve it with a new SELinux feature or something like that > instead, and then document that mseal() requires some complement to > prevent forced writes to read-only private memory? (For which the > simplest solution would be "don't grant filesystem access or ptrace() > access to the sandboxed code".) > > What is the intended interaction with userfaultfd, which I believe by > design permits arbitrary data into unpopulated areas of anonymous > VMAs? If the intention is that the process should otherwise be > sandboxed to not have access to userfaultfd, that should maybe be > documented. (Alternatively I guess you could theoretically remove the > VM_MAYWRITE bit from marked VMAs, but that might be more strict than > we want, since it'd also block all FOLL_FORCE writes.) > > > There are also some interfaces like AIO or the X86 Shadow Stacks > interface that indirectly unmap memory through the kernel and look > like they could perhaps be tricked into racily unmapping a > just-created sealed VMA. You'd probably have to make sure that they > can't do that and essentially treat their unmap operations as if they > came from userspace, I guess? What Linus just wrote. > > > I think either way this feature needs some documentation on what kind > of context it's supposed to run in. > Thanks for listing all the cases. As you pointed out, Chrome is heavily sandboxed, with syscalls and file access. I will work with Stephan to check if some of these applied to Chrome. It is probably worth to mention, with the current approach of mseal(), i.e. by bitmask, it is possible to implement those above incrementally. Thanks -Jeff -Jeff > > > Each bit represents sealing for one specific syscall type, e.g. > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > > is that the API is extendable, i.e. when needed, the sealing can be > > extended to madvise, mlock, etc. Backward compatibility is also easy. > > > > The kernel will remember which seal types are applied, and the application > > doesn’t need to repeat all existing seal types in the next mseal(). Once > > a seal type is applied, it can’t be unsealed. Call mseal() on an existing > > seal type is a no-action, not a failure. > > > > MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type. > > > > Internally, vm_area_struct adds a new field vm_seals, to store the bit > > masks. > > > > For the affected syscalls, such as mprotect, a check(can_modify_mm) for > > sealing is added, this usually happens at the early point of the syscall, > > before any update is made to VMAs. The effect of that is: if any of the > > VMAs in the given address range fails the sealing check, none of the VMA > > will be updated. It might be worth noting that this is different from the > > rest of mprotect(), where some updates can happen even when mprotect > > returns fail. Consider can_modify_mm only checks vm_seals in > > vm_area_struct, and it is not going deeper in the page table or updating > > any HW, success or none behavior might fit better here. I would like to > > listen to the community's feedback on this. > > > > The idea that inspired this patch comes from Stephen Röttger’s work in > > V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API. > > > > In addition, Stephen is working on glibc change to add sealing support > > into the dynamic linker to seal all non-writable segments at startup. When > > that work is completed, all applications can automatically benefit from > > these new protections. > > > > [1] https://kernelnewbies.org/Linux_2_6_8 > > [2] https://v8.dev/blog/control-flow-integrity > > [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 > > [4] https://man.openbsd.org/mimmutable.2 > > [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc > > > > Jeff Xu (8): > > Add mseal syscall > > Wire up mseal syscall > > mseal: add can_modify_mm and can_modify_vma > > mseal: seal mprotect > > mseal munmap > > mseal mremap > > mseal mmap > > selftest mm/mseal mprotect/munmap/mremap/mmap > > > > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > > arch/arm/tools/syscall.tbl | 1 + > > arch/arm64/include/asm/unistd.h | 2 +- > > arch/arm64/include/asm/unistd32.h | 2 + > > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + > > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + > > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + > > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > > arch/s390/kernel/syscalls/syscall.tbl | 1 + > > arch/sh/kernel/syscalls/syscall.tbl | 1 + > > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > > fs/aio.c | 5 +- > > include/linux/mm.h | 55 +- > > include/linux/mm_types.h | 7 + > > include/linux/syscalls.h | 2 + > > include/uapi/asm-generic/unistd.h | 5 +- > > include/uapi/linux/mman.h | 6 + > > ipc/shm.c | 3 +- > > kernel/sys_ni.c | 1 + > > mm/Kconfig | 8 + > > mm/Makefile | 1 + > > mm/internal.h | 4 +- > > mm/mmap.c | 49 +- > > mm/mprotect.c | 6 + > > mm/mremap.c | 19 +- > > mm/mseal.c | 328 +++++ > > mm/nommu.c | 6 +- > > mm/util.c | 8 +- > > tools/testing/selftests/mm/Makefile | 1 + > > tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ > > 37 files changed, 1934 insertions(+), 28 deletions(-) > > create mode 100644 mm/mseal.c > > create mode 100644 tools/testing/selftests/mm/mseal_test.c > > > > -- > > 2.42.0.609.gbb76f46606-goog > >
Hello Linus, Thank you for the time reviewing this patch set. On Mon, Oct 16, 2023 at 10:23 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 16 Oct 2023 at 07:38, <jeffxu@chromium.org> wrote: > > > > This patchset proposes a new mseal() syscall for the Linux kernel. > > So I have no objections to adding some kind of "lock down memory > mappings" model, but this isn't it. > > First off, the simple stuff: the commit messages are worthless. Having > > check seal for mmap(2) > > as the commit message is not even remotely acceptable, to pick one > random example from the series (7/8). > > But that doesn't matter much, since I think the more fundamental > problems are much worse: > > - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is > just complete noise and totally illogical. The whole concept needs to > be redone. > > Example of complete nonsense (again, picking 7/8, but that's again > just a random example): > > > @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages, > > flags |= MAP_LOCKED; > > > > file = get_file(vma->vm_file); > > - ret = do_mmap(vma->vm_file, start, size, > > - prot, flags, pgoff, &populate, NULL); > > + ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff, > > + &populate, NULL, ON_BEHALF_OF_KERNEL); > > fput(file); > > out: > > mmap_write_unlock(mm); > > Christ. That's *literally* the remap_file_pages() system call > definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense > in this context. > > It's not the only situation. "mremap() as the same thing. vm_munmap() > has the same thing. vm_brk_flags() has the same thing. None of them > make any sense. > > But even if those obvious kinds of complete mis-designs were to be > individually fixed, the whole naming and concept is bogus. The > "ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check > sealing". Naming should describe what the thing *means*, not some > random policy thing that may or may not be at all relevant. > I apologize that I didn't think of a better name for ON_BEHALF_OF_XX and I should have written a more clear commit message. I prepared a V2 patchset with a more detailed commit message, hopefully that will help to explain the design. Indeed, the ON_BEHALF_OF_XX is confusing, especially with remap_file_pathes(). remap_file_pathes(2) is not supported in this patch set, this covers mprotect(2), mmap(2), munmap(2), mremap(2) as the first feature set. I could extend the sealing to more syscalls, if it is determined necessary from the outcome of this discussion. The initial set of 4 syscalls was chosen based on Chrome's initial wish list. Regarding the ON_BEHALF_OF flag, my intention is to have a flag, set at syscall entry points of mprotect(2), munmap(2), mremap(2), mmap(2), pass the flag along the call stack, till reaching can_modify_mm(), can_modify_mm() does the actual check for the sealing type. It is probably worth noting that I choose to check one and only one sealing type per syscall. i.e. munmap(2) checks MM_SEAL_MUNMAP only. With this approach, sealing can be implemented incrementally. For example, When implementing sealing for munmap(2), I don't need to care that mremap(2) can also call internal functions to unmap the VMAs. The mremap(2) will be sealed by MM_SEAL_MREMAP(), and be dealt with separately. This approach also allows dev to expand the sealing to madvice(), mlock(), or whatever syscalls or cases that modify VMA's meta data. As Yann points out, There is a list of cases that we might care about. Having all of those implemented will take time. Using bitmasks will help to add those incrementally. An application will be backward compatible when a new sealing type is added, i.e. It has to set the new sealing type explicitly. > - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away. > > Not only is it unacceptable to pointlessly create two different name > spaces for no obvious reason, code like this (from 1/8) should not > exist: > > > + if (types & MM_SEAL_MSEAL) > > + newtypes |= VM_SEAL_MSEAL; > > + > > + if (types & MM_SEAL_MPROTECT) > > + newtypes |= VM_SEAL_MPROTECT; > > + > > + if (types & MM_SEAL_MUNMAP) > > + newtypes |= VM_SEAL_MUNMAP; > > + > > + if (types & MM_SEAL_MMAP) > > + newtypes |= VM_SEAL_MMAP; > > + > > + if (types & MM_SEAL_MREMAP) > > + newtypes |= VM_SEAL_MREMAP; > > Sure, we have that in some cases when there was a *reason* for > namespacing imposed on us from an API standpoint (ie the "open()" > flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags > being turned into MNT_xyz flags), but those tend to be signs of > problems in the system call API where some mixup made it impossible to > use the flags directly (most commonly because there is some extended > internal representation of said things). > > For example, the MS_xyz namespace is a combination of "these are flags > for the new mount" (like MS_RDONLY) and "this is how you should mount > it" (like MS_REMOUNT), and so the user interface has that odd mixing > of different things, and the MNT_xyz namespace is a distillation of > the former. > > But we certainly should not strive to introduce *new* interfaces that > start out with this kind of mixup and pointless "translate from one > bit to another" code. > The two namespaces can go away, that means the bitmap will be stored as is by vm_seals in VMA struct. (1/8) Copied below. +++ b/include/linux/mm_types.h @@ -660,6 +660,13 @@ struct vm_area_struct { + unsigned long vm_seals; /* seal flags, see mm.h. */ My original considerations are: 1. vm_seals is a new field, and mseal() currently uses 5 bits. vm_seals can be repurposed to store other VMA flags in future, having two namespaces (API and internal one) can be useful. 2. vm_flags is full and it seems to me there is pending work on expanding vm_flags. [1] if that happens, we will need translation logic. [1] https://lore.kernel.org/linux-mm/4F6CA298.4000301@jp.fujitsu.com/ I might have over-engineered this, so I removed the VM_SEAL_XX in V2. > - And finally (for now), I hate that MM_ACTION_xyz thing too! > > Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8): > > > + switch (action) { > > + case MM_ACTION_MPROTECT: > > + if (vma->vm_seals & VM_SEAL_MPROTECT) > > + return false; > > + break; > > when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and > sealing bitmask and just say > > if (vma->vm_seal & vm_action) > return -EPERM; > Make sense. My original thought is that can_modify_vma() will check one and only one seal type, and having an enum type will enforce that. This restriction feels unnecessary. I removed the action type in V2. > IOW, you have pointlessly introduced not *two* different namespaces, > but *three*. All doing the exact same thing, and all just causing > pointless and ugly code to "translate" the actions of one into the > model of another. > > So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz" > thing. Use *one* single "these are the sealing bits". > > And get rid of "enum caller_origin" entirely. I don't know what the > right model for that thing is, but that isn't it. > > *Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space > cannot set, but that the kernel can use internally - and if that is > the right model, then dammit, the *uses* should be very very obvious > why the override is fine, because that remap_file_pages() use sure as > hell was *not* obvious. > > In fact, it's not at all obvious why anything should override the > sealing bits - EVER. So I'm not convinced that the right model is > "replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we > *ever* want to override sealing? It sounds like complete garbage. Most > of the users seem to be things like "execve()", which is nonsensical, > since the VM wouldn't have been sealed at that point _anyway_, so > having a "don't bother checking sealing bits" flag seems entirely > useless. > Would the new commit message and comments in V2 help to explain the design better ? (will send shortly) Another code change I can make to help the readability (not in v2), is to set and pass checkSeals flag from syscall entry point, all the way to can_modify_vma(). Currently, I didn't do that if I checked the internal function is only used by syscal entry point, e.g. in do_mprotect_pkey(), mremap_to(), ksys_mmap_pgoff() cases. Doing that does increase the size of the patch set though. Thanks. -Jeff -Jeff > Anyway, this is all a resounding NAK on this series in this form. My > complaints are not some kind of small "fix this up". These are > fundamental issues. > > Linus
On Tue, Oct 17, 2023 at 01:34:35AM -0700, Jeff Xu wrote: > > > types: bit mask to specify which syscall to seal, currently they are: > > > MM_SEAL_MSEAL 0x1 > > > MM_SEAL_MPROTECT 0x2 > > > MM_SEAL_MUNMAP 0x4 > > > MM_SEAL_MMAP 0x8 > > > MM_SEAL_MREMAP 0x10 > > > > I don't understand why we want this level of granularity. The OpenBSD > > and XNU examples just say "This must be immutable*". For values of > > immutable that allow downgrading access (eg RW to RO or RX to RO), > > but not upgrading access (RW->RX, RO->*, RX->RW). > > > > > Each bit represents sealing for one specific syscall type, e.g. > > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > > > is that the API is extendable, i.e. when needed, the sealing can be > > > extended to madvise, mlock, etc. Backward compatibility is also easy. > > > > Honestly, it feels too flexible. Why not just two flags to mprotect() > > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- > > maybe for some things we want to be able to downgrade and for other > > things, we don't. > > > Having a seal type per syscall type helps to add the feature incrementally. > Applications also know exactly what is sealed. You're trying to portray a disadvantage as an advantage, This is the seccomp disease, only worse because you're trying to deny individual syscalls instead of building up a list of permitted syscalls. If we introduce a new syscall tomorrow which can affect VMAs, we'll then make it the application's fault for not disabling that new syscall. That's terrible design! > I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we > can define what it seals precisely. As Jann pointed out, there have > other scenarios that potentially affect IMMUTABLE. Implementing all thoses > will take time. And if we missed a case, we could introduce backward > compatibility issues to the application. Bitmask will solve this nicely, i.e. > application will need to apply the newly added sealing type explicitly. It is your job to do this. You seem to have taken the attitude that you will give the Chrome team exactly what they asked for instead of trying to understand their requirements and give them what they need. And don't send a v2 before discussion of v1 has come to an end. It uselessly fragments the discussion.
On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote: > > Modern CPUs support memory permissions such as RW and NX bits. Linux has > > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. > > This seems like a confusing way to introduce the subject. Here, you're > talking about page permissions, whereas (as far as I can tell), mseal() is > about making _virtual_ addresses immutable, for some value of immutable. > > > Memory sealing additionally protects the mapping itself against > > modifications. This is useful to mitigate memory corruption issues where > > a corrupted pointer is passed to a memory management syscall. For example, > > such an attacker primitive can break control-flow integrity guarantees > > since read-only memory that is supposed to be trusted can become writable > > or .text pages can get remapped. Memory sealing can automatically be > > applied by the runtime loader to seal .text and .rodata pages and > > applications can additionally seal security critical data at runtime. > > A similar feature already exists in the XNU kernel with the > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > > Also, Chrome wants to adopt this feature for their CFI work [2] and this > > patchset has been designed to be compatible with the Chrome use case. > > This [2] seems very generic and wide-ranging, not helpful. [5] was more > useful to understand what you're trying to do. > > > The new mseal() is an architecture independent syscall, and with > > following signature: > > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) > > > > addr/len: memory range. Must be continuous/allocated memory, or else > > mseal() will fail and no VMA is updated. For details on acceptable > > arguments, please refer to comments in mseal.c. Those are also fully > > covered by the selftest. > > Mmm. So when you say "continuous/allocated" what you really mean is > "Must have contiguous VMAs" rather than "All pages in this range must > be populated", yes? > > > types: bit mask to specify which syscall to seal, currently they are: > > MM_SEAL_MSEAL 0x1 > > MM_SEAL_MPROTECT 0x2 > > MM_SEAL_MUNMAP 0x4 > > MM_SEAL_MMAP 0x8 > > MM_SEAL_MREMAP 0x10 > > I don't understand why we want this level of granularity. The OpenBSD > and XNU examples just say "This must be immutable*". For values of > immutable that allow downgrading access (eg RW to RO or RX to RO), > but not upgrading access (RW->RX, RO->*, RX->RW). > > > Each bit represents sealing for one specific syscall type, e.g. > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > > is that the API is extendable, i.e. when needed, the sealing can be > > extended to madvise, mlock, etc. Backward compatibility is also easy. > > Honestly, it feels too flexible. Why not just two flags to mprotect() > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- > maybe for some things we want to be able to downgrade and for other > things, we don't. I think it's worth pointing out that this suggestion (with PROT_*) could easily integrate with mmap() and as such allow for one-shot mmap() + mseal(). If we consider the common case as 'addr = mmap(...); mseal(addr);', it definitely sounds like a performance win as we halve the number of syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem like common patterns.
On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote: > > It is probably worth noting that I choose to check one and only > one sealing type per syscall. i.e. munmap(2) checks > MM_SEAL_MUNMAP only. Yeah, this is wrong. It's wrong exactly because other system calls will unmap things too. Using mmap() to over-map something will unmap the old one. Same goes for mremap() to move over an existing mapping. So the whole "do things by the name of the system call" is not workable. All that matters is what the system calls *do*, not what their name is. And mmap() will fundamentally munmap() as part of the action. This is why I absolutely hated the old "ON_BEHALF_OF_xyz" flag, and why I still absolutely hate the "randomly pass different sealing flags fto do_munmap()". You should *not* be passing any flags at all to do_munmap(). Because *regardless* of who calls it, and regardless of which system call started the action, do_munmap() unmaps a virtual memory area. See what I'm saying? Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote: > > > > It is probably worth noting that I choose to check one and only > > one sealing type per syscall. i.e. munmap(2) checks > > MM_SEAL_MUNMAP only. > > Yeah, this is wrong. > > It's wrong exactly because other system calls will unmap things too. > > Using mmap() to over-map something will unmap the old one. > > Same goes for mremap() to move over an existing mapping. > > So the whole "do things by the name of the system call" is not workable. > > All that matters is what the system calls *do*, not what their name is. I agree completely... mseal() is a clone of mimmutable(2), but with an extremely over-complicated API based upon dubious arguments. I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all the components working correctly. There were many intermediate API during development, but in the end the API is simply: int mimmutable(void *addr, size_t len); The kernel code for mimmutable() traverses the specified VA range. In that range, it will find unmapped sub-regions (which are are ignored) and mapped sub-regions. For these mapped regions, it does not care what the permissions are, it just marks each sub-region as immutable. Later on, when any VM operation request upon a VA range attempts to (1) change the permissions (2) to re-map on top (3) or dispose of the mapping, that operation is refused with errno EPERM. We don't care where the request comes from (ie. what system call). It is a behaviour of the VM system, when asked to act upon a VA sub-range mapping. Very simple semantics. The only case where the immutable marker is ignored is during address space teardown as a result of process termination. In his submission of this API, Jeff Xu makes three claims I find dubious; > Also, Chrome wants to adopt this feature for their CFI work [2] and this > patchset has been designed to be compatible with the Chrome use case. I specifically designed mimmutable(2) with chrome in mind, and the chrome binary running on OpenBSD is full of immutable mappings. All the library regions automatically become immutable because ld.so can infer it and do the mimmutable calls for the right subregions. So this chrome work has already been done by OpenBSD, and it is dead simple. During early development I thought mimmutable(2) would be called by applications or libraries, but I was dead wrong: 99.9% of calls are from ld.so, and no applications need to call it, these are the two exceptions: In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data structures at initialization time, so they canoot be attacked to create an invariant for use in ROP return-to-libc style methods. In Chrome, there is a v8_flags variable rounded out to a full page, and placed in .data. Chrome initialized this variable, and wants to mprotect PROT_READ, but .data has been made immutable by ld.so. So we force this page into a new ELF section called "openbsd.mutable" which also behaves RW like .data. Where chrome does the mprotect PROT_READ, it now also performs mimmutable() on that page. > Having a seal type per syscall type helps to add the feature incrementally. Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our entire base operating system and 10,000+ applications automatically receive the benefits. In one year's effort. The only application which cared about it was chrome, described in the previous paragraph. I think Jeff's idea here is super dangerous. What will actually happen is people will add a few mseal() sub-operations and think the job is done. It isn't done. They need all the mseal() requests, or the mapping are not safe. It is very counterproductive to provide developers a complex API that has insecure suboperations. > Applications also know exactly what is sealed. Actually applicatins won't know because there is no tooling to inspect this -- but I will argue further that applications don't need to know. Immutable marking is a system decision, not a program decision. I'll close by asking for a new look at the mimmutable(2) API we settled on for OpenBSD. I think there is nothing wrong with it. I'm willing to help guide glibc / ld.so / musl teams through the problems they may find along the way, I know where the skeletons are buried. Two in particular: -znow RELRO already today, and xonly-text in the future. [1] https://man.openbsd.org/mimmutable.2
On Tue, 17 Oct 2023 at 11:20, Theo de Raadt <deraadt@openbsd.org> wrote: > > The only case where the immutable marker is ignored is during address space > teardown as a result of process termination. .. and presumably also execve()? I do like us starting with just "mimmutable()", since it already exists. Particularly if chrome already knows how to use it. Maybe add a flag field (require it to be zero initially) just to allow any future expansion. Maybe the chrome team has *wanted* to have some finer granularity thing and currently doesn't use mimmutable() in some case? Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 17 Oct 2023 at 11:20, Theo de Raadt <deraadt@openbsd.org> wrote: > > > > The only case where the immutable marker is ignored is during address space > > teardown as a result of process termination. > > .. and presumably also execve()? Ah yes of course that also. > I do like us starting with just "mimmutable()", since it already > exists. Particularly if chrome already knows how to use it. Well, our chrome fork knows how to use it. Robert Nagy in our group maintains 1280 patches to make chrome work on OpenBSD. Google ignores them and will not upstream them. Some of these changes are security related, and they ignore them. Other changes are to cope with security work we've done on our own, for example: JIT changes from Stephen@google for mandatory IBT which google hasn't upstreamed yet, impacts due to PROT_EXEC-only mappings, etc. But the only chrome diff required for mimmutable is for that v8_flags thing I described. And the same issue would need handling for mseal(). Introducing the new "mutable" ELF section is probably going to be a bigger fuss than the system call after mprotect(PROT_READ).... > Maybe add a flag field (require it to be zero initially) just to allow > any future expansion. Maybe the chrome team has *wanted* to have some > finer granularity thing and currently doesn't use mimmutable() in some > case? There's only one feature I can think of, and we already do it right now, documented in our manual page: CAVEATS At present, mprotect(2) may reduce permissions on immutable pages marked PROT_READ | PROT_WRITE to the less permissive PROT_READ. This one-way operation is permitted for an introductory period to observe how software uses this mechanism. It may change to require explicit mutable region annotation with __attribute__((section(".openbsd.mutable"))) and explicit calls to mimmutable(). We had something which needed this behaviour during the development transition. It is exlusively mprotect RW -> R, no other transitions allowed. But once the transition was done, we don't need it anymore. I want to delete it, because it is a bit of a trap. It still fails closed from an attack perspective, but... What worries me is a piece of code reached by mistake can do a mprotect(lowering), not receive -1 EPERM, and carry on running.. I'd prefer the first time you touch a mapping in the wrong way, you receive indication of error. This only applies applies to mprotect() acting up on a region so the argument is a bit weak, due to mprotect() return value checking being about as rare as unicorns. Also, it would be a pain for OpenBSD to transition to adding a 0 flag. I would need to see real cause not theory.
On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote: > > > Modern CPUs support memory permissions such as RW and NX bits. Linux has > > > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. > > > > This seems like a confusing way to introduce the subject. Here, you're > > talking about page permissions, whereas (as far as I can tell), mseal() is > > about making _virtual_ addresses immutable, for some value of immutable. > > > > > Memory sealing additionally protects the mapping itself against > > > modifications. This is useful to mitigate memory corruption issues where > > > a corrupted pointer is passed to a memory management syscall. For example, > > > such an attacker primitive can break control-flow integrity guarantees > > > since read-only memory that is supposed to be trusted can become writable > > > or .text pages can get remapped. Memory sealing can automatically be > > > applied by the runtime loader to seal .text and .rodata pages and > > > applications can additionally seal security critical data at runtime. > > > A similar feature already exists in the XNU kernel with the > > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > > > Also, Chrome wants to adopt this feature for their CFI work [2] and this > > > patchset has been designed to be compatible with the Chrome use case. > > > > This [2] seems very generic and wide-ranging, not helpful. [5] was more > > useful to understand what you're trying to do. > > > > > The new mseal() is an architecture independent syscall, and with > > > following signature: > > > > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) > > > > > > addr/len: memory range. Must be continuous/allocated memory, or else > > > mseal() will fail and no VMA is updated. For details on acceptable > > > arguments, please refer to comments in mseal.c. Those are also fully > > > covered by the selftest. > > > > Mmm. So when you say "continuous/allocated" what you really mean is > > "Must have contiguous VMAs" rather than "All pages in this range must > > be populated", yes? > > > > > types: bit mask to specify which syscall to seal, currently they are: > > > MM_SEAL_MSEAL 0x1 > > > MM_SEAL_MPROTECT 0x2 > > > MM_SEAL_MUNMAP 0x4 > > > MM_SEAL_MMAP 0x8 > > > MM_SEAL_MREMAP 0x10 > > > > I don't understand why we want this level of granularity. The OpenBSD > > and XNU examples just say "This must be immutable*". For values of > > immutable that allow downgrading access (eg RW to RO or RX to RO), > > but not upgrading access (RW->RX, RO->*, RX->RW). > > > > > Each bit represents sealing for one specific syscall type, e.g. > > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > > > is that the API is extendable, i.e. when needed, the sealing can be > > > extended to madvise, mlock, etc. Backward compatibility is also easy. > > > > Honestly, it feels too flexible. Why not just two flags to mprotect() > > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- > > maybe for some things we want to be able to downgrade and for other > > things, we don't. > > I think it's worth pointing out that this suggestion (with PROT_*) > could easily integrate with mmap() and as such allow for one-shot > mmap() + mseal(). > If we consider the common case as 'addr = mmap(...); mseal(addr);', it > definitely sounds like a performance win as we halve the number of > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem > like common patterns. > Yes. mmap() can support sealing as well, and memory is allocated as immutable from begining. This is orthogonal to mseal() though. In case of ld.so, iiuc, memory can be first allocated as W, then later changed to RO, for example, during symbol resolution. The important point is that the application can decide what type of sealing it wants, and when to apply it. There needs to be an api(), that can be mseal() or mprotect2() or mimmutable(), the naming is not important to me. mprotect() in linux have the following signature: int mprotect(void addr[.len], size_t len, int prot); the prot bitmasks are all taken here. I have not checked the prot field in mmap(), there might be bits left, even not, we could have mmap2(), so that is not an issue. Thanks -Jeff > -- > Pedro
On Tue, Oct 17, 2023 at 10:34 PM Jeff Xu <jeffxu@google.com> wrote: > > On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote: > > > > Modern CPUs support memory permissions such as RW and NX bits. Linux has > > > > supported NX since the release of kernel version 2.6.8 in August 2004 [1]. > > > > > > This seems like a confusing way to introduce the subject. Here, you're > > > talking about page permissions, whereas (as far as I can tell), mseal() is > > > about making _virtual_ addresses immutable, for some value of immutable. > > > > > > > Memory sealing additionally protects the mapping itself against > > > > modifications. This is useful to mitigate memory corruption issues where > > > > a corrupted pointer is passed to a memory management syscall. For example, > > > > such an attacker primitive can break control-flow integrity guarantees > > > > since read-only memory that is supposed to be trusted can become writable > > > > or .text pages can get remapped. Memory sealing can automatically be > > > > applied by the runtime loader to seal .text and .rodata pages and > > > > applications can additionally seal security critical data at runtime. > > > > A similar feature already exists in the XNU kernel with the > > > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. > > > > Also, Chrome wants to adopt this feature for their CFI work [2] and this > > > > patchset has been designed to be compatible with the Chrome use case. > > > > > > This [2] seems very generic and wide-ranging, not helpful. [5] was more > > > useful to understand what you're trying to do. > > > > > > > The new mseal() is an architecture independent syscall, and with > > > > following signature: > > > > > > > > mseal(void addr, size_t len, unsigned int types, unsigned int flags) > > > > > > > > addr/len: memory range. Must be continuous/allocated memory, or else > > > > mseal() will fail and no VMA is updated. For details on acceptable > > > > arguments, please refer to comments in mseal.c. Those are also fully > > > > covered by the selftest. > > > > > > Mmm. So when you say "continuous/allocated" what you really mean is > > > "Must have contiguous VMAs" rather than "All pages in this range must > > > be populated", yes? > > > > > > > types: bit mask to specify which syscall to seal, currently they are: > > > > MM_SEAL_MSEAL 0x1 > > > > MM_SEAL_MPROTECT 0x2 > > > > MM_SEAL_MUNMAP 0x4 > > > > MM_SEAL_MMAP 0x8 > > > > MM_SEAL_MREMAP 0x10 > > > > > > I don't understand why we want this level of granularity. The OpenBSD > > > and XNU examples just say "This must be immutable*". For values of > > > immutable that allow downgrading access (eg RW to RO or RX to RO), > > > but not upgrading access (RW->RX, RO->*, RX->RW). > > > > > > > Each bit represents sealing for one specific syscall type, e.g. > > > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask > > > > is that the API is extendable, i.e. when needed, the sealing can be > > > > extended to madvise, mlock, etc. Backward compatibility is also easy. > > > > > > Honestly, it feels too flexible. Why not just two flags to mprotect() > > > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE. I can see a use for that -- > > > maybe for some things we want to be able to downgrade and for other > > > things, we don't. > > > > I think it's worth pointing out that this suggestion (with PROT_*) > > could easily integrate with mmap() and as such allow for one-shot > > mmap() + mseal(). > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it > > definitely sounds like a performance win as we halve the number of > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem > > like common patterns. > > > Yes. mmap() can support sealing as well, and memory is allocated as > immutable from begining. > This is orthogonal to mseal() though. I don't see how this can be orthogonal to mseal(). In the case we opt for adding PROT_ bits, we should more or less only need to adapt calc_vm_prot_bits(), and the rest should work without issues. vma merging won't merge vmas with different prots. The current interfaces (mmap and mprotect) would work just fine. In this case, mseal() or mimmutable() would only be needed if you need to set immutability over a range of VMAs with different permissions. Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe The only annoying wrench in my plans here is that we have effectively run out of vm_flags bits in 32-bit architectures, so this approach as I described is not compatible with 32-bit. > In case of ld.so, iiuc, memory can be first allocated as W, then later > changed to RO, for example, during symbol resolution. > The important point is that the application can decide what type of > sealing it wants, and when to apply it. There needs to be an api(), > that can be mseal() or mprotect2() or mimmutable(), the naming is not > important to me. > > mprotect() in linux have the following signature: > int mprotect(void addr[.len], size_t len, int prot); > the prot bitmasks are all taken here. > I have not checked the prot field in mmap(), there might be bits left, > even not, we could have mmap2(), so that is not an issue. I don't see what you mean. We have plenty of prot bits left (32-bits, and we seem to have around 8 different bits used). And even if we didn't, prot is the same in mprotect and mmap and mmap2 :) The only issue seems to be that 32-bit ran out of vm_flags, but that can probably be worked around if need be.
On Tue, Oct 17, 2023 at 11:20 AM Theo de Raadt <deraadt@openbsd.org> wrote: > > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote: > > > > > > It is probably worth noting that I choose to check one and only > > > one sealing type per syscall. i.e. munmap(2) checks > > > MM_SEAL_MUNMAP only. > > > > Yeah, this is wrong. > > > > It's wrong exactly because other system calls will unmap things too. > > > > Using mmap() to over-map something will unmap the old one. > > > > Same goes for mremap() to move over an existing mapping. > > > > So the whole "do things by the name of the system call" is not workable. > > > > All that matters is what the system calls *do*, not what their name is. > > I agree completely... > > mseal() is a clone of mimmutable(2), but with an extremely > over-complicated API based upon dubious arguments. > > I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all > the components working correctly. There were many intermediate API > during development, but in the end the API is simply: > > int mimmutable(void *addr, size_t len); > > The kernel code for mimmutable() traverses the specified VA range. In > that range, it will find unmapped sub-regions (which are are ignored) > and mapped sub-regions. For these mapped regions, it does not care what > the permissions are, it just marks each sub-region as immutable. > > Later on, when any VM operation request upon a VA range attempts to > (1) change the permissions > (2) to re-map on top > (3) or dispose of the mapping, > that operation is refused with errno EPERM. We don't care where the > request comes from (ie. what system call). It is a behaviour of the > VM system, when asked to act upon a VA sub-range mapping. > > Very simple semantics. > > The only case where the immutable marker is ignored is during address space > teardown as a result of process termination. > May I ask, for BSD's implementation of immutable(), do you cover things such as mlock(), madvice() ? or just the protection bit (WRX) + remap() + unmap(). In other words: Is BSD's definition of immutable equivalent to MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ? I hesitate to introduce the concept of immutable into linux because I don't know all the scenarios present in linux where VMAs's metadata can be modified. As Jann's email pointed out, There could be quite a few things we still need to deal with, to completely block the possibility, e.g. malicious code attempting to write to a RO memory or change RW memory to RWX. If, as part of immutable, I also block madvice(), mlock(), which also updates VMA's metadata, so by definition, I could. What if the user wants the features in madvice() and at the same time, also wants their .text protected ? Also, if linux introduces a new syscall that depends on a new metadata of VMA, say msecret(), (for discussion purpose), should immutable automatically support that ? Without those questions answered, I couldn't choose the route of immutable() yet. -Jeff > > In his submission of this API, Jeff Xu makes three claims I find dubious; > > > Also, Chrome wants to adopt this feature for their CFI work [2] and this > > patchset has been designed to be compatible with the Chrome use case. > > I specifically designed mimmutable(2) with chrome in mind, and the > chrome binary running on OpenBSD is full of immutable mappings. All the > library regions automatically become immutable because ld.so can infer > it and do the mimmutable calls for the right subregions. > > So this chrome work has already been done by OpenBSD, and it is dead > simple. During early development I thought mimmutable(2) would be > called by applications or libraries, but I was dead wrong: 99.9% of > calls are from ld.so, and no applications need to call it, these are the > two exceptions: > > In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data > structures at initialization time, so they canoot be attacked to create > an invariant for use in ROP return-to-libc style methods. > > In Chrome, there is a v8_flags variable rounded out to a full page, and > placed in .data. Chrome initialized this variable, and wants to mprotect > PROT_READ, but .data has been made immutable by ld.so. So we force this > page into a new ELF section called "openbsd.mutable" which also behaves RW > like .data. Where chrome does the mprotect PROT_READ, it now also performs > mimmutable() on that page. > > > Having a seal type per syscall type helps to add the feature incrementally. > > Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our > entire base operating system and 10,000+ applications automatically receive > the benefits. In one year's effort. The only application which cared about > it was chrome, described in the previous paragraph. > > I think Jeff's idea here is super dangerous. What will actually happen > is people will add a few mseal() sub-operations and think the job is done. > It isn't done. They need all the mseal() requests, or the mapping are > not safe. > > It is very counterproductive to provide developers a complex API that has > insecure suboperations. > > > Applications also know exactly what is sealed. > > Actually applicatins won't know because there is no tooling to inspect this -- > but I will argue further that applications don't need to know. Immutable > marking is a system decision, not a program decision. > > > I'll close by asking for a new look at the mimmutable(2) API we settled > on for OpenBSD. I think there is nothing wrong with it. I'm willing to > help guide glibc / ld.so / musl teams through the problems they may find > along the way, I know where the skeletons are buried. Two in > particular: -znow RELRO already today, and xonly-text in the future. > > > [1] https://man.openbsd.org/mimmutable.2 >
Jeff Xu <jeffxu@google.com> wrote: > May I ask, for BSD's implementation of immutable(), do you cover > things such as mlock(), > madvice() ? or just the protection bit (WRX) + remap() + unmap(). It only prevents removal of the mapping, placement of a replacement mapping, or changing the existing permissions. If one page in the existing sub-region is marked immutable, the whole operation fails with EPERM. Those are the only user-visible aspects that an attacker cares about to utilize in this area. mlock() and madvise() deal with the physical memory handling underneath the VA. They have nothing to do with how attack code might manipulate the VA address space inside a program to convert a series of dead-end approaches into a succesfull escalation strategy. [It would be very long conversation to explain where and how this has been utilized to make an attack succesfull] > In other words: > Is BSD's definition of immutable equivalent to > MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ? I can't compare it to your subsystem, because I completely fail to understand the cause or benefit of all the complexity. I think I've explained what mimmutable() is in extremely simple terms. And I don't understand else you are trying to do anything beyond what mimmutable() offers. It seems like this is inventing additional solutions without proof that any of them are necessary to solve the specific problem that is known. > I hesitate to introduce the concept of immutable into linux because I don't know > all the scenarios present in linux where VMAs's metadata can be > modified. Good grief. It seems obvious if you want to lock the change-behaviour of an object (the object in this case being a VA sub-region, there is a datastructure for that, in OpenBSD it is called an "entry"), then you put a flag in that object's data-structure and you simply check the flag everytime a change-operation is attempted. It is a flag which gets set, and checked. Nothing ever clears it (except address space teardown). This flag must be put on the data structure that manages VA sub-ranges. In our case when a prot/mapping operation reaches low-level code that will want to change an "entry", we notice it is not allowed and simply percolate EPERM up through the layers. > There could be quite a few things we still need to deal with, to > completely block the possibility, > e.g. malicious code attempting to write to a RO memory What?! writes to RO memory are blocked by the permission bits. > or change RW memory to RWX. In our case that is blocked by W^X policy. But if the region is marked mimmutable, then that's another reason you cannot change RW to RWX. It seems so off-topic, to talk about writes to RO memory. I get a feeling you are a bit lost. mimmutable() is not about permissions, but about locking permissions. - You can't change the permissions of the address space region. - You cannot map a replacement object at the location instead (especially with different permission). - You cannot unmap at that location (which you would do if you wanted to map a new object, with a different permission). All 3 of these scenarios are identical. No regular code performs these 3 operations on regions of the address space which we mark immutable. There is nothing more to mimmutable in the VM layer. The hard work is writing code in execve() and ld.so which will decide which objects can be marked immutable automatically, so that programs don't do this to themselves. I'm aware of where this simple piece fits in. It does not solve all problems, it is a very narrow change to impact a problem which only high-value targets will ever face (like chrome). But I think you don't understand the purpose of this mechanism. > If, as part of immutable, I also block madvice(), mlock(), which also updates > VMA's metadata, so by definition, I could. What if the user wants the > features in > madvice() and at the same time, also wants their .text protected ? I have no idea what you are talking about. None of those things relate to the access permission of the memory the user sees, and therefore none of them are in the attack surface profile which is being prevented. Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because those relate to the physical storage and not the VA permission model. > Also, if linux introduces a new syscall that depends on a new metadata of VMA, > say msecret(), (for discussion purpose), should immutable > automatically support that ? How about the future makingexcuses() system call? I don't think you understand the problem space well enough to come up with your own solution for it. I spent a year on this, and ship a complete system using it. You are asking such simplistic questions above it shocks me. Maybe read the LWN article; https://lwn.net/Articles/915640/ > Without those questions answered, I couldn't choose the route of > immutable() yet. "... so I can clearly not choose the wine in front of you." If you don't understand what this thing is for, and cannot minimize the complexity of this thing, then Linux doesn't need it at all. I should warn everyone the hard work is not in the VM layer, but in ld.so -- deciding which parts of the image to make immutable, and when. It is also possible to make some segments immutable directly in execve() -- but in both cases you better have a really good grasp on RELRO executable layout or will make too many pieces immutable... I am pretty sure Linux will never get as far as we got. Even our main stacks are marked immutable, but in Linux that would conflict with glibc ld.so mprotecting RWX the stack if you dlopen() a shared library with GNUSTACK, a very bad idea which needs a different fight...
On Tue, Oct 17, 2023 at 4:57 PM Theo de Raadt <deraadt@openbsd.org> wrote: > > Jeff Xu <jeffxu@google.com> wrote: > > > May I ask, for BSD's implementation of immutable(), do you cover > > things such as mlock(), > > madvice() ? or just the protection bit (WRX) + remap() + unmap(). > > It only prevents removal of the mapping, placement of a replacement > mapping, or changing the existing permissions. If one page in the > existing sub-region is marked immutable, the whole operation fails with > EPERM. > > Those are the only user-visible aspects that an attacker cares about to > utilize in this area. > > mlock() and madvise() deal with the physical memory handling underneath > the VA. They have nothing to do with how attack code might manipulate > the VA address space inside a program to convert a series of dead-end > approaches into a succesfull escalation strategy. > > [It would be very long conversation to explain where and how this has > been utilized to make an attack succesfull] > > > In other words: > > Is BSD's definition of immutable equivalent to > > MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ? > > I can't compare it to your subsystem, because I completely fail to > understand the cause or benefit of all the complexity. > > I think I've explained what mimmutable() is in extremely simple terms. > Thanks for the explanation, based on those, this is exactly what the current set of patch does. In practice: libc could do below: #define MM_IMMUTABLE (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP) mseal(add,len, MM_IMMUTABLE) it will be equivalent to BSD's immutable(). > And I don't understand else you are trying to do anything beyond what > mimmutable() offers. It seems like this is inventing additional > solutions without proof that any of them are necessary to solve the > specific problem that is known. > > > I hesitate to introduce the concept of immutable into linux because I don't know > > all the scenarios present in linux where VMAs's metadata can be > > modified. > > Good grief. It seems obvious if you want to lock the change-behaviour > of an object (the object in this case being a VA sub-region, there is a > datastructure for that, in OpenBSD it is called an "entry"), then you > put a flag in that object's data-structure and you simply check the flag > everytime a change-operation is attempted. It is a flag which gets set, > and checked. Nothing ever clears it (except address space teardown). > > This flag must be put on the data structure that manages VA sub-ranges. > > In our case when a prot/mapping operation reaches low-level code that > will want to change an "entry", we notice it is not allowed and simply > percolate EPERM up through the layers. > > > There could be quite a few things we still need to deal with, to > > completely block the possibility, > > e.g. malicious code attempting to write to a RO memory > > What?! writes to RO memory are blocked by the permission bits. > > > or change RW memory to RWX. > > In our case that is blocked by W^X policy. > > But if the region is marked mimmutable, then that's another reason you cannot > change RW to RWX. It seems so off-topic, to talk about writes to RO memory. > I get a feeling you are a bit lost. > > immutable() is not about permissions, but about locking permissions. > - You can't change the permissions of the address space region. > - You cannot map a replacement object at the location instead (especially > with different permission). > - You cannot unmap at that location (which you would do if you wanted to > map a new object, with a different permission). > > All 3 of these scenarios are identical. No regular code performs these 3 > operations on regions of the address space which we mark immutable. > > There is nothing more to mimmutable in the VM layer. The hard work is > writing code in execve() and ld.so which will decide which objects can > be marked immutable automatically, so that programs don't do this to > themselves. > > I'm aware of where this simple piece fits in. It does not solve all > problems, it is a very narrow change to impact a problem which only > high-value targets will ever face (like chrome). > > But I think you don't understand the purpose of this mechanism. > In linux cases, I think, eventually, mseal() will have a bigger scope than BSD's mimmutable(). VMA's metadata(vm_area_struct) contains a lot of control info, depending on application's needs, mseal() can be expanded to seal individual control info. For example, in madvice(2) case: As Jann point out in [1] and I quote: "you'd probably also want to block destructive madvise() operations that can effectively alter region contents by discarding pages and such, ..." Another example: if an application wants to keep a memory always present in RAM, for whatever the reason, it can call seal the mlock(). To handle those two new cases. mseal() could add two more bits: MM_SEAL_MADVICE, MM_SEAL_MLOCK. It is practical to keep syscall extentable, when the business logic is the same. I think I explained the logic of using bitmasks in the mseal() interface clearly with the example of madvice() and mlock(). -Jeff [1] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/ > > If, as part of immutable, I also block madvice(), mlock(), which also updates > > VMA's metadata, so by definition, I could. What if the user wants the > > features in > > madvice() and at the same time, also wants their .text protected ? > > I have no idea what you are talking about. None of those things relate > to the access permission of the memory the user sees, and therefore none > of them are in the attack surface profile which is being prevented. > > Meaning, we allow madvise() and mlock() and mphysicalquantummemory() because > those relate to the physical storage and not the VA permission model. > > > Also, if linux introduces a new syscall that depends on a new metadata of VMA, > > say msecret(), (for discussion purpose), should immutable > > automatically support that ? > > How about the future makingexcuses() system call? > > I don't think you understand the problem space well enough to come up with > your own solution for it. I spent a year on this, and ship a complete system > using it. You are asking such simplistic questions above it shocks me. > > Maybe read the LWN article; > > https://lwn.net/Articles/915640/ > > > Without those questions answered, I couldn't choose the route of > > immutable() yet. > > "... so I can clearly not choose the wine in front of you." > > If you don't understand what this thing is for, and cannot minimize the > complexity of this thing, then Linux doesn't need it at all. > > I should warn everyone the hard work is not in the VM layer, but in > ld.so -- deciding which parts of the image to make immutable, and when. > It is also possible to make some segments immutable directly in execve() > -- but in both cases you better have a really good grasp on RELRO > executable layout or will make too many pieces immutable... > > I am pretty sure Linux will never get as far as we got. Even our main > stacks are marked immutable, but in Linux that would conflict with glibc > ld.so mprotecting RWX the stack if you dlopen() a shared library with > GNUSTACK, a very bad idea which needs a different fight...
Jeff Xu <jeffxu@google.com> wrote: > In linux cases, I think, eventually, mseal() will have a bigger scope than > BSD's mimmutable(). I don't believe that, considering noone needed this behaviour from the VM system in the last 4 decades. > VMA's metadata(vm_area_struct) contains a lot > of control info, depending on application's needs, mseal() can be > expanded to seal individual control info. > For example, in madvice(2) case: > As Jann point out in [1] and I quote: > "you'd probably also want to block destructive madvise() operations > that can effectively alter region contents by discarding pages and > such, ..." Then prohibit madvise(MADV_FREE) on all non-writeable mappings that are immutable. Just include this in the set of behaviours. Or make it the default. Don't make it an option that a program needs to set on pages! Noone is going to call it. Most programs don't know the addresses of the *REGIONS* they would want to do this for. Does your program know where libc's text segment starts and ends? No your program does not know these addresses, so the parts of the 'system' which do know this needs to do it (which would be ld.so or the libc init constructors). If madvise(MADV_FREE) is so dangerous.. say you have a program that would call through abort(), but you know a zero there can make the abort not abort but return, then is it bad to let the attacker do: madvise(&abort, pagesize, MADV_FREE) If that is bad, then block it in a smart way! Don't make a programmer of an application figure out how to do this. That results in a defense methodology where a handful of programs self-protect, but everything else is unprotected or unprotectable. That is shortsighted. > Another example: if an application wants to keep a memory always > present in RAM, for whatever the reason, it can call seal the mlock(). Please explain the attack surface. > I think I explained the logic of using bitmasks in the mseal() > interface clearly with the example of madvice() and mlock(). It is clear as mud.
On Tue, Oct 17, 2023 at 08:18:47PM -0700, Jeff Xu wrote: > In practice: libc could do below: > #define MM_IMMUTABLE > (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP) > mseal(add,len, MM_IMMUTABLE) > it will be equivalent to BSD's immutable(). No, it wouldn't, because you've carefully listed the syscalls you're blocking instead of understanding the _concept_ of what you need to block. > In linux cases, I think, eventually, mseal() will have a bigger scope than > BSD's mimmutable(). VMA's metadata(vm_area_struct) contains a lot > of control info, depending on application's needs, mseal() can be > expanded to seal individual control info. > > For example, in madvice(2) case: > As Jann point out in [1] and I quote: > "you'd probably also want to block destructive madvise() operations > that can effectively alter region contents by discarding pages and > such, ..." > > Another example: if an application wants to keep a memory always > present in RAM, for whatever the reason, it can call seal the mlock(). > > To handle those two new cases. mseal() could add two more bits: > MM_SEAL_MADVICE, MM_SEAL_MLOCK. Yes, thank you for demonstrating that you have no idea what you need to block. > It is practical to keep syscall extentable, when the business logic is the same. I concur with Theo & Linus. You don't know what you're doing. I think the underlying idea of mimmutable() is good, but how you've split it up and how you've implemented it is terrible. Let's start with the purpose. The point of mimmutable/mseal/whatever is to fix the mapping of an address range to its underlying object, be it a particular file mapping or anonymous memory. After the call succeeds, it must not be possible to make any address in that virtual range point into any other object. The secondary purpose is to lock down permissions on that range. Possibly to fix them where they are, possibly to allow RW->RO transitions. With those purposes in mind, you should be able to deduce for any syscall or any madvise(), ... whether it should be allowed. Look, I appreciate this is only your second set of patches to Linux and you've taken on a big job. But that's all the more reason you should listen to people who are trying to help you.
On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > > > > I think it's worth pointing out that this suggestion (with PROT_*) > > > could easily integrate with mmap() and as such allow for one-shot > > > mmap() + mseal(). > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it > > > definitely sounds like a performance win as we halve the number of > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem > > > like common patterns. > > > > > Yes. mmap() can support sealing as well, and memory is allocated as > > immutable from begining. > > This is orthogonal to mseal() though. > > I don't see how this can be orthogonal to mseal(). > In the case we opt for adding PROT_ bits, we should more or less only > need to adapt calc_vm_prot_bits(), and the rest should work without > issues. > vma merging won't merge vmas with different prots. The current > interfaces (mmap and mprotect) would work just fine. > In this case, mseal() or mimmutable() would only be needed if you need > to set immutability over a range of VMAs with different permissions. > Agreed. By orthogonal, I meant we can have two APIs: mmap() and mseal()/mprotect() i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). Sealing can be applied after initial memory creation. > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe > The only annoying wrench in my plans here is that we have effectively > run out of vm_flags bits in 32-bit architectures, so this approach as > I described is not compatible with 32-bit. > > > In case of ld.so, iiuc, memory can be first allocated as W, then later > > changed to RO, for example, during symbol resolution. > > The important point is that the application can decide what type of > > sealing it wants, and when to apply it. There needs to be an api(), > > that can be mseal() or mprotect2() or mimmutable(), the naming is not > > important to me. > > > > mprotect() in linux have the following signature: > > int mprotect(void addr[.len], size_t len, int prot); > > the prot bitmasks are all taken here. > > I have not checked the prot field in mmap(), there might be bits left, > > even not, we could have mmap2(), so that is not an issue. > > I don't see what you mean. We have plenty of prot bits left (32-bits, > and we seem to have around 8 different bits used). > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :) > > The only issue seems to be that 32-bit ran out of vm_flags, but that > can probably be worked around if need be. > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. Apology that I was wrong previously and caused confusion. There is a slight difference in the syntax of mprotect and mseal. Each time when mprotect() is called, the kernel takes all of RWX bits and updates vm_flags, In other words, the application sets/unset each RWX, and kernel takes it. In the mseal() case, the kernel will remember which seal types were applied previously, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. So if we want to use mprotect() for sealing, developers need to think of sealing bits differently than the rest of prot bits. It is a different programming model, might or might not be an obvious concept to developers. There is a difference in input check and error handling as well. for mseal(), if a given address range has a gap (unallocated memory), or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is updated. For mprotect(), some VMAs can be updated, till an error happens to a VMA. IMO: I think it makes sense for mseal() and mprotect() to be different in this. mseal() only checks vma struct so it is fast, and mprotect() goes deeper into HW. Because of those two differences, personally I feel a new syscall might be worth the effort. That said, mmap() + mprotect() is workable to me if that is what the community wants. Thanks -Jeff -Jeff > -- > Pedro
On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote: > > Let's start with the purpose. The point of mimmutable/mseal/whatever is > to fix the mapping of an address range to its underlying object, be it > a particular file mapping or anonymous memory. After the call succeeds, > it must not be possible to make any address in that virtual range point > into any other object. > > The secondary purpose is to lock down permissions on that range. > Possibly to fix them where they are, possibly to allow RW->RO transitions. > > With those purposes in mind, you should be able to deduce for any syscall > or any madvise(), ... whether it should be allowed. > I got it. IMO: The approaches mimmutable() and mseal() took are different, but we all want to seal the memory from attackers and make the linux application safer. mimmutable() started with "none of updates to the sealed address is allowed once marked as immutable", this includes from within kernel space including driver, or any new syscalls. It is reasonable to me. mseal() starts with 4 syscalls from userspace, which is just a way (among many other ways) to demo what memory operation can be sealed, which happens to meet what Chome wants. This is an RFC, I appreciate your input. Best regards, -Jeff
Jeff Xu <jeffxu@google.com> wrote: > On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > Let's start with the purpose. The point of mimmutable/mseal/whatever is > > to fix the mapping of an address range to its underlying object, be it > > a particular file mapping or anonymous memory. After the call succeeds, > > it must not be possible to make any address in that virtual range point > > into any other object. > > > > The secondary purpose is to lock down permissions on that range. > > Possibly to fix them where they are, possibly to allow RW->RO transitions. > > > > With those purposes in mind, you should be able to deduce for any syscall > > or any madvise(), ... whether it should be allowed. > > > I got it. > > IMO: The approaches mimmutable() and mseal() took are different, but > we all want to seal the memory from attackers and make the linux > application safer. I think you are building mseal for chrome, and chrome alone. I do not think this will work out for the rest of the application space because 1) it is too complicated 2) experience with mimmutable() says that applications don't do any of it themselves, it is all in execve(), libc initialization, and ld.so. You don't strike me as an execve, libc, or ld.so developer.
> I do like us starting with just "mimmutable()", since it already > exists. Particularly if chrome already knows how to use it. > > Maybe add a flag field (require it to be zero initially) just to allow > any future expansion. Maybe the chrome team has *wanted* to have some > finer granularity thing and currently doesn't use mimmutable() in some > case? Yes, we do have a use case in Chrome to split the sealing into unmap and mprotect which will allow us to seal additional pages that we can't seal with pure mimmutable(). For example, we have pkey-tagged RWX memory that we want to seal. Since the memory is already RWX and the pkey controls write access, we don't care about permission changes but sometimes we do need to mprotect data only pages. But the munmap sealing will provide protection against implicit changes of the pkey in this case which would happen if a page gets unmapped and another mapped in its place.
> > IMO: The approaches mimmutable() and mseal() took are different, but > > we all want to seal the memory from attackers and make the linux > > application safer. > > I think you are building mseal for chrome, and chrome alone. > > I do not think this will work out for the rest of the application space > because > > 1) it is too complicated > 2) experience with mimmutable() says that applications don't do any of it > themselves, it is all in execve(), libc initialization, and ld.so. > You don't strike me as an execve, libc, or ld.so developer. We do want to build this in a way that it can be applied automatically by ld.so and we appreciate all your feedback on this. The intention of splitting the sealing by syscall was to provide flexibility while still allowing ld.so to seal all operations. But it's clear from the feedback that both the fine grained split and the per-syscall approach are not the right way to go. Does Linus' proposal to just split munmap / mprotect sealing address your complexity concerns? ld.so would always use both flags which should then behave similar to mimmutable().
Hi Pedro Some followup on mmap() + mprotect(): On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote: > > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > > > > > > > I think it's worth pointing out that this suggestion (with PROT_*) > > > > could easily integrate with mmap() and as such allow for one-shot > > > > mmap() + mseal(). > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it > > > > definitely sounds like a performance win as we halve the number of > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem > > > > like common patterns. > > > > > > > Yes. mmap() can support sealing as well, and memory is allocated as > > > immutable from begining. > > > This is orthogonal to mseal() though. > > > > I don't see how this can be orthogonal to mseal(). > > In the case we opt for adding PROT_ bits, we should more or less only > > need to adapt calc_vm_prot_bits(), and the rest should work without > > issues. > > vma merging won't merge vmas with different prots. The current > > interfaces (mmap and mprotect) would work just fine. > > In this case, mseal() or mimmutable() would only be needed if you need > > to set immutability over a range of VMAs with different permissions. > > > Agreed. By orthogonal, I meant we can have two APIs: > mmap() and mseal()/mprotect() > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). > Sealing can be applied after initial memory creation. > > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe > > The only annoying wrench in my plans here is that we have effectively > > run out of vm_flags bits in 32-bit architectures, so this approach as > > I described is not compatible with 32-bit. > > > > > In case of ld.so, iiuc, memory can be first allocated as W, then later > > > changed to RO, for example, during symbol resolution. > > > The important point is that the application can decide what type of > > > sealing it wants, and when to apply it. There needs to be an api(), > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not > > > important to me. > > > > > > mprotect() in linux have the following signature: > > > int mprotect(void addr[.len], size_t len, int prot); > > > the prot bitmasks are all taken here. > > > I have not checked the prot field in mmap(), there might be bits left, > > > even not, we could have mmap2(), so that is not an issue. > > > > I don't see what you mean. We have plenty of prot bits left (32-bits, > > and we seem to have around 8 different bits used). > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :) > > > > The only issue seems to be that 32-bit ran out of vm_flags, but that > > can probably be worked around if need be. > > > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. > Apology that I was wrong previously and caused confusion. > > There is a slight difference in the syntax of mprotect and mseal. > Each time when mprotect() is called, the kernel takes all of RWX bits > and updates vm_flags, > In other words, the application sets/unset each RWX, and kernel takes it. > > In the mseal() case, the kernel will remember which seal types were > applied previously, and the application doesn’t need to repeat all > existing seal types in the next mseal(). Once a seal type is applied, > it can’t be unsealed. > > So if we want to use mprotect() for sealing, developers need to think > of sealing bits differently than the rest of prot bits. It is a > different programming model, might or might not be an obvious concept > to developers. > This probably doesn't matter much to developers. We can enforce the sealing bit to be the same as the rest of PROT bits. If mprotect() tries to unset sealing, it will fail. > There is a difference in input check and error handling as well. > for mseal(), if a given address range has a gap (unallocated memory), > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is > updated. > For mprotect(), some VMAs can be updated, till an error happens to a VMA. > This difference doesn't matter much. For mprotect()/mmap(), is Linux implementation limited by POSIX ? This can be made backward compatible. If there is no objection to adding linux specific values in mmap() and mprotect(), This works for me. Thanks for your input. -Jeff
On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote: > > Hi Pedro > > Some followup on mmap() + mprotect(): > > On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote: > > > > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > > > > > > > > > > I think it's worth pointing out that this suggestion (with PROT_*) > > > > > could easily integrate with mmap() and as such allow for one-shot > > > > > mmap() + mseal(). > > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it > > > > > definitely sounds like a performance win as we halve the number of > > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem > > > > > like common patterns. > > > > > > > > > Yes. mmap() can support sealing as well, and memory is allocated as > > > > immutable from begining. > > > > This is orthogonal to mseal() though. > > > > > > I don't see how this can be orthogonal to mseal(). > > > In the case we opt for adding PROT_ bits, we should more or less only > > > need to adapt calc_vm_prot_bits(), and the rest should work without > > > issues. > > > vma merging won't merge vmas with different prots. The current > > > interfaces (mmap and mprotect) would work just fine. > > > In this case, mseal() or mimmutable() would only be needed if you need > > > to set immutability over a range of VMAs with different permissions. > > > > > Agreed. By orthogonal, I meant we can have two APIs: > > mmap() and mseal()/mprotect() > > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). > > Sealing can be applied after initial memory creation. > > > > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe > > > The only annoying wrench in my plans here is that we have effectively > > > run out of vm_flags bits in 32-bit architectures, so this approach as > > > I described is not compatible with 32-bit. > > > > > > > In case of ld.so, iiuc, memory can be first allocated as W, then later > > > > changed to RO, for example, during symbol resolution. > > > > The important point is that the application can decide what type of > > > > sealing it wants, and when to apply it. There needs to be an api(), > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not > > > > important to me. > > > > > > > > mprotect() in linux have the following signature: > > > > int mprotect(void addr[.len], size_t len, int prot); > > > > the prot bitmasks are all taken here. > > > > I have not checked the prot field in mmap(), there might be bits left, > > > > even not, we could have mmap2(), so that is not an issue. > > > > > > I don't see what you mean. We have plenty of prot bits left (32-bits, > > > and we seem to have around 8 different bits used). > > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :) > > > > > > The only issue seems to be that 32-bit ran out of vm_flags, but that > > > can probably be worked around if need be. > > > > > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. > > Apology that I was wrong previously and caused confusion. > > > > There is a slight difference in the syntax of mprotect and mseal. > > Each time when mprotect() is called, the kernel takes all of RWX bits > > and updates vm_flags, > > In other words, the application sets/unset each RWX, and kernel takes it. > > > > In the mseal() case, the kernel will remember which seal types were > > applied previously, and the application doesn’t need to repeat all > > existing seal types in the next mseal(). Once a seal type is applied, > > it can’t be unsealed. > > > > So if we want to use mprotect() for sealing, developers need to think > > of sealing bits differently than the rest of prot bits. It is a > > different programming model, might or might not be an obvious concept > > to developers. > > > This probably doesn't matter much to developers. > We can enforce the sealing bit to be the same as the rest of PROT bits. > If mprotect() tries to unset sealing, it will fail. Yep. Erroneous or malicious mprotects would all be caught. However, if we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect() to less permissions or even downright munmap()) you'd want some care to preserve that bit when setting permissions. > > > There is a difference in input check and error handling as well. > > for mseal(), if a given address range has a gap (unallocated memory), > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is > > updated. > > For mprotect(), some VMAs can be updated, till an error happens to a VMA. > > > This difference doesn't matter much. > > For mprotect()/mmap(), is Linux implementation limited by POSIX ? No. POSIX works merely as a baseline that UNIX systems aim towards. You can (and very frequently do) extend POSIX interfaces (in fact, it's how most of POSIX was written, through sheer "design-by-committee" on a bunch of UNIX systems' extensions). > This can be made backward compatible. > If there is no objection to adding linux specific values in mmap() and > mprotect(), > This works for me. Linux already has system-specific values for PROT_ (PROT_BTI, PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc). Whether this is the right interface is another question. I do like it a lot, but there's of course value in being compatible with existing solutions (like mimmutable()).
On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > For mprotect()/mmap(), is Linux implementation limited by POSIX ? > > No. POSIX works merely as a baseline that UNIX systems aim towards. > You can (and very frequently do) extend POSIX interfaces (in fact, > it's how most of POSIX was written, through sheer > "design-by-committee" on a bunch of UNIX systems' extensions). We can in extreme circumstances actually go further than that, and not only extend on POSIX requirements, but actively even violate them. It does need a very good reason, though, but it has happened when POSIX requirements were simply actively wrong. For example, at one point POSIX required int accept(int s, struct sockaddr *addr, size_t *addrlen); which was simply completely wrong. It's utter shite, and didn't actually match any reality. The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make it "size_t" was completely unacceptable. So we ignored it, told POSIX people that they were full of sh*t, and they eventually fixed it in the next version (by introducing a "socklen_t" that had better be the same as "int"). So POSIX can simply be wrong. Also, if it turns out that we had some ABI that wasn't POSIX-compatible, the whole "don't break user space" will take precedence over any POSIX concerns, and we will not "fix" our system call if people already use our old semantics. So in that case, we generally end up with a new system call (or new flags) instead. Or sometimes it just is such a small detail that nobody cares - POSIX also has a notion of documenting areas of non-conformance, and people who really care end up having notions like "conformance vs _strict_ conformance". Linus
Stephen Röttger <sroettger@google.com> wrote: > > > IMO: The approaches mimmutable() and mseal() took are different, but > > > we all want to seal the memory from attackers and make the linux > > > application safer. > > > > I think you are building mseal for chrome, and chrome alone. > > > > I do not think this will work out for the rest of the application space > > because > > > > 1) it is too complicated > > 2) experience with mimmutable() says that applications don't do any of it > > themselves, it is all in execve(), libc initialization, and ld.so. > > You don't strike me as an execve, libc, or ld.so developer. > > We do want to build this in a way that it can be applied automatically by ld.so > and we appreciate all your feedback on this. Hi Stephen, I am pretty sure your mechanism will be useable by ld.so. What bothers me is the complex many-bits approach may encourage people to set only a subset of the bits, and then believe they have a security primitive. Partial sealing is not safe. I define partial sealing as "blocking munmap, but not mprotect". Or "blocking mprotect, but not madvise or mmap". In Message-id <ZS/3GCKvNn5qzhC4@casper.infradead.org> Matthew stated there that there are two aspects being locked: which object is mapped, and the permission of that mapping. When additional system calls msync() and madvise() are included in the picture, there are 3 actions being prevented: - Can someone replace the object - Can someone change the permission - Can someone throw away the cached pages, reverting to original content of the object (that is the madvise / msync) In Message-id: <CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com> Jan reminded us of this piece. I'm taking this as a long-standing security hole in some sub-operations of msync/madvise which can write to data regions that aren't actually writeable. Sub-operations with this problem are MADV_FREE, MADV_DONTNEED, POSIX_MADV_DONTNEED, MS_INVALIDATE.. on Linux MADV_WIPEONFORK, and probably a whole bunch of others. I am testing OpenBSD changes which demand PROT_WRITE permission for these sub-operations. Perhaps some systems are already careful. If you leave any of these operators available, the object is not actually sealed against abuse. I believe an attacker will simply switch to a different operator (mmap, munmap, mprotect, madvise, msync) to achieve a similar objective of damaging the permission or contents. Since mseal() is designed to create partial sealings, the name of the proposed system call really smells. > The intention of > splitting the sealing > by syscall was to provide flexibility while still allowing ld.so to > seal all operations. Yes, you will have ld.so set all the bits, and the same in C runtime initialization. If you convince glibc to stop make the stack executable in dlopen(), the kernel could automatically do it.. With Linux backwards compat management, getting there would be an extremely long long long roadmap. But anyways the idea would be "set all the bits". Because otherwise the object or data isn't safe. > Does Linus' proposal to just split munmap / mprotect sealing address your > complexity concerns? ld.so would always use both flags which should then behave > similar to mimmutable(). No, I think it is weak, because it isn't sealed. A seperate mail in the thread from you says this is about chrome wanting to use PKU on RWX objects. I think that's the reason for wanting to seperate the sealing (I haven't heard of other applications wanting that). How about we explore that in the other subthread..
Stephen Röttger <sroettger@google.com> wrote: > > I do like us starting with just "mimmutable()", since it already > > exists. Particularly if chrome already knows how to use it. > > > > Maybe add a flag field (require it to be zero initially) just to allow > > any future expansion. Maybe the chrome team has *wanted* to have some > > finer granularity thing and currently doesn't use mimmutable() in some > > case? > > Yes, we do have a use case in Chrome to split the sealing into unmap and > mprotect which will allow us to seal additional pages that we can't seal with > pure mimmutable(). > For example, we have pkey-tagged RWX memory that we want to seal. Since > the memory is already RWX and the pkey controls write access, we don't care > about permission changes but sometimes we do need to mprotect data only > pages. Let me try to decompose this statement. This is clearly for the JIT. You can pivot between the a JIT generated code mapping being RW and RX (or X-only), the object will pivot between W or X to satisfy W^X policy for safety. I think you are talking about a RWX MAP_ANON object. Then you use pkey_alloc() to get a PKEY. pkey_mprotect() sets the PKEY on the region. I argue you can then make it entirely immutable / sealed. Let's say it is fully immutable / sealed. After which, you can change the in-processor PKU register (using pkey_set) to toggle the Write-Inhibit and eXecute-Inhibit bits on that key. The immutable object has a dangerous RWX permission. But the specific PKEY making it either RX (or X-only) or RW depending upon your context. The mapping is never exposed as RWX. The PKU model reduces the permission access of the object below the immutable permission level. The security depends on the PKEY WI/XI bits being difficult to control. SADLY on x86, this is managed with a PKRU userland register which is changeble without any supervisor control -- yes, it seems quite dangerous. Changing it requires a complicated MSR dance. It is unfortunate that the pkey_set() library function is easily reachedable in the PLT via ROP methods. On non-x86 cpus that have similar functionality, the register is privileged, but operating supporting it generally change it and return immediately. The problem you seem to have with fully locked mseal() in chrome seems to be here: > about permission changes but sometimes we do need to mprotect data only > pages. Does that data have to be in the same region? Can your allocator not put the non-code pieces of the JIT elsewhere, with a different permission, fully immutable / msealed -- and perhaps even managed with a different PKEY if neccessary? May that requires a huge rearchitecture. But isn't the root problem here that data and code are being handled in the same object with a shared permission model? > But the munmap sealing will provide protection against implicit changes of the > pkey in this case which would happen if a page gets unmapped and another > mapped in its place. That primitive feels so weird, I have a difficult time believing it will remain unattackable in the long term. But what if you could replace mprotect() with pkey_mprotect() upon a different key.. ? -- A few more notes comparing what OpenBSD has done compared to Linux: In OpenBSD, we do not have the pkey library. We have stolen one of the PKEY and use it for kernel support of xonly for kernel code and userland code. On x86 we recognize that userland can flip the permission by whacking the RPKU register -- which would make xonly code readable. (The chrome data you are trying to guard faces the same problem). To prevent that, a majority of traps in the kernel (page faults, interrupts, etc) check if the PKRU register has been modified, and kill the process. It is statistically strong. We are not making pkey available as a userland feature, but if we later do so we would still have 15 pkeys to play with. We would probably make the pkey_set() operation a system call, so the trap handler can also observe RPKU register modifications by the instruction. Above, I mentioned pivoting between "RW or RX (or X-only)". On OpenBSD, chrome would be able to pivot between RW and X-only. When it comes to Pkey utilization, we've ended up in a very different place than Linux.
On Thu, Oct 19, 2023 at 3:47 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote: > > > > Hi Pedro > > > > Some followup on mmap() + mprotect(): > > > > On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote: > > > > > > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > > > > > > > > > > > > > I think it's worth pointing out that this suggestion (with PROT_*) > > > > > > could easily integrate with mmap() and as such allow for one-shot > > > > > > mmap() + mseal(). > > > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it > > > > > > definitely sounds like a performance win as we halve the number of > > > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD > > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem > > > > > > like common patterns. > > > > > > > > > > > Yes. mmap() can support sealing as well, and memory is allocated as > > > > > immutable from begining. > > > > > This is orthogonal to mseal() though. > > > > > > > > I don't see how this can be orthogonal to mseal(). > > > > In the case we opt for adding PROT_ bits, we should more or less only > > > > need to adapt calc_vm_prot_bits(), and the rest should work without > > > > issues. > > > > vma merging won't merge vmas with different prots. The current > > > > interfaces (mmap and mprotect) would work just fine. > > > > In this case, mseal() or mimmutable() would only be needed if you need > > > > to set immutability over a range of VMAs with different permissions. > > > > > > > Agreed. By orthogonal, I meant we can have two APIs: > > > mmap() and mseal()/mprotect() > > > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable(). > > > Sealing can be applied after initial memory creation. > > > > > > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe > > > > The only annoying wrench in my plans here is that we have effectively > > > > run out of vm_flags bits in 32-bit architectures, so this approach as > > > > I described is not compatible with 32-bit. > > > > > > > > > In case of ld.so, iiuc, memory can be first allocated as W, then later > > > > > changed to RO, for example, during symbol resolution. > > > > > The important point is that the application can decide what type of > > > > > sealing it wants, and when to apply it. There needs to be an api(), > > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not > > > > > important to me. > > > > > > > > > > mprotect() in linux have the following signature: > > > > > int mprotect(void addr[.len], size_t len, int prot); > > > > > the prot bitmasks are all taken here. > > > > > I have not checked the prot field in mmap(), there might be bits left, > > > > > even not, we could have mmap2(), so that is not an issue. > > > > > > > > I don't see what you mean. We have plenty of prot bits left (32-bits, > > > > and we seem to have around 8 different bits used). > > > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :) > > > > > > > > The only issue seems to be that 32-bit ran out of vm_flags, but that > > > > can probably be worked around if need be. > > > > > > > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not. > > > Apology that I was wrong previously and caused confusion. > > > > > > There is a slight difference in the syntax of mprotect and mseal. > > > Each time when mprotect() is called, the kernel takes all of RWX bits > > > and updates vm_flags, > > > In other words, the application sets/unset each RWX, and kernel takes it. > > > > > > In the mseal() case, the kernel will remember which seal types were > > > applied previously, and the application doesn’t need to repeat all > > > existing seal types in the next mseal(). Once a seal type is applied, > > > it can’t be unsealed. > > > > > > So if we want to use mprotect() for sealing, developers need to think > > > of sealing bits differently than the rest of prot bits. It is a > > > different programming model, might or might not be an obvious concept > > > to developers. > > > > > This probably doesn't matter much to developers. > > We can enforce the sealing bit to be the same as the rest of PROT bits. > > If mprotect() tries to unset sealing, it will fail. > > Yep. Erroneous or malicious mprotects would all be caught. However, if > we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect() > to less permissions or even downright munmap()) you'd want some care > to preserve that bit when setting permissions. > > > > > > There is a difference in input check and error handling as well. > > > for mseal(), if a given address range has a gap (unallocated memory), > > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is > > > updated. > > > For mprotect(), some VMAs can be updated, till an error happens to a VMA. > > > > > This difference doesn't matter much. > > > > For mprotect()/mmap(), is Linux implementation limited by POSIX ? > > No. POSIX works merely as a baseline that UNIX systems aim towards. > You can (and very frequently do) extend POSIX interfaces (in fact, > it's how most of POSIX was written, through sheer > "design-by-committee" on a bunch of UNIX systems' extensions). > > > This can be made backward compatible. > > If there is no objection to adding linux specific values in mmap() and > > mprotect(), > > This works for me. > > Linux already has system-specific values for PROT_ (PROT_BTI, > PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc). > Whether this is the right interface is another question. I do like it > a lot, but there's of course value in being compatible with existing > solutions (like mimmutable()). > Thanks Pedro for providing examples on mm extension to POSIX. This opens more design options on solving the sealing problem. I will take a few days to research design options. -Jeff > -- > Pedro
On Thu, Oct 19, 2023 at 4:06 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote: > > > > > > For mprotect()/mmap(), is Linux implementation limited by POSIX ? > > > > No. POSIX works merely as a baseline that UNIX systems aim towards. > > You can (and very frequently do) extend POSIX interfaces (in fact, > > it's how most of POSIX was written, through sheer > > "design-by-committee" on a bunch of UNIX systems' extensions). > > We can in extreme circumstances actually go further than that, and not > only extend on POSIX requirements, but actively even violate them. > > It does need a very good reason, though, but it has happened when > POSIX requirements were simply actively wrong. > > For example, at one point POSIX required > > int accept(int s, struct sockaddr *addr, size_t *addrlen); > > which was simply completely wrong. It's utter shite, and didn't > actually match any reality. > > The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make > it "size_t" was completely unacceptable. > > So we ignored it, told POSIX people that they were full of sh*t, and > they eventually fixed it in the next version (by introducing a > "socklen_t" that had better be the same as "int"). > > So POSIX can simply be wrong. > > Also, if it turns out that we had some ABI that wasn't > POSIX-compatible, the whole "don't break user space" will take > precedence over any POSIX concerns, and we will not "fix" our system > call if people already use our old semantics. > > So in that case, we generally end up with a new system call (or new > flags) instead. > > Or sometimes it just is such a small detail that nobody cares - POSIX > also has a notion of documenting areas of non-conformance, and people > who really care end up having notions like "conformance vs _strict_ > conformance". > > Linus > Thanks Linus for clarifying the guidelines on POSIX in Linux. -Jeff
> The problem you seem to have with fully locked mseal() in chrome seems > to be here: > > > about permission changes but sometimes we do need to mprotect data only > > pages. > > Does that data have to be in the same region? Can your allocator not > put the non-code pieces of the JIT elsewhere, with a different > permission, fully immutable / msealed -- and perhaps even managed with a > different PKEY if neccessary? No we can't. We investigated this extensively since this also poses some difficulties on MacOS. We implemented different approaches but any such change to the allocator introduces too much of a performance impact.
From: Jeff Xu <jeffxu@google.com> This patchset proposes a new mseal() syscall for the Linux kernel. Modern CPUs support memory permissions such as RW and NX bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves security stance on memory corruption bugs, i.e. the attacker can’t just write to arbitrary memory and point the code to it, the memory has to be marked with X bit, or else an exception will happen. Memory sealing additionally protects the mapping itself against modifications. This is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management syscall. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. The new mseal() is an architecture independent syscall, and with following signature: mseal(void addr, size_t len, unsigned int types, unsigned int flags) addr/len: memory range. Must be continuous/allocated memory, or else mseal() will fail and no VMA is updated. For details on acceptable arguments, please refer to comments in mseal.c. Those are also fully covered by the selftest. types: bit mask to specify which syscall to seal, currently they are: MM_SEAL_MSEAL 0x1 MM_SEAL_MPROTECT 0x2 MM_SEAL_MUNMAP 0x4 MM_SEAL_MMAP 0x8 MM_SEAL_MREMAP 0x10 Each bit represents sealing for one specific syscall type, e.g. MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask is that the API is extendable, i.e. when needed, the sealing can be extended to madvise, mlock, etc. Backward compatibility is also easy. The kernel will remember which seal types are applied, and the application doesn’t need to repeat all existing seal types in the next mseal(). Once a seal type is applied, it can’t be unsealed. Call mseal() on an existing seal type is a no-action, not a failure. MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type. Internally, vm_area_struct adds a new field vm_seals, to store the bit masks. For the affected syscalls, such as mprotect, a check(can_modify_mm) for sealing is added, this usually happens at the early point of the syscall, before any update is made to VMAs. The effect of that is: if any of the VMAs in the given address range fails the sealing check, none of the VMA will be updated. It might be worth noting that this is different from the rest of mprotect(), where some updates can happen even when mprotect returns fail. Consider can_modify_mm only checks vm_seals in vm_area_struct, and it is not going deeper in the page table or updating any HW, success or none behavior might fit better here. I would like to listen to the community's feedback on this. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5], Chrome browser in ChromeOS will be the first user of this API. In addition, Stephen is working on glibc change to add sealing support into the dynamic linker to seal all non-writable segments at startup. When that work is completed, all applications can automatically benefit from these new protections. [1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc Jeff Xu (8): Add mseal syscall Wire up mseal syscall mseal: add can_modify_mm and can_modify_vma mseal: seal mprotect mseal munmap mseal mremap mseal mmap selftest mm/mseal mprotect/munmap/mremap/mmap arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + fs/aio.c | 5 +- include/linux/mm.h | 55 +- include/linux/mm_types.h | 7 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/mman.h | 6 + ipc/shm.c | 3 +- kernel/sys_ni.c | 1 + mm/Kconfig | 8 + mm/Makefile | 1 + mm/internal.h | 4 +- mm/mmap.c | 49 +- mm/mprotect.c | 6 + mm/mremap.c | 19 +- mm/mseal.c | 328 +++++ mm/nommu.c | 6 +- mm/util.c | 8 +- tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 1428 +++++++++++++++++++ 37 files changed, 1934 insertions(+), 28 deletions(-) create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c