mbox series

[v8,0/4] Introduce mseal

Message ID 20240131175027.3287009-1-jeffxu@chromium.org (mailing list archive)
Headers show
Series Introduce mseal | expand

Message

Jeff Xu Jan. 31, 2024, 5:50 p.m. UTC
From: Jeff Xu <jeffxu@chromium.org>

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.

Two system calls are involved in sealing the map:  mmap() and mseal().

The new mseal() is an syscall on 64 bit CPU, and with
following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
   via munmap() and mremap(), can leave an empty space, therefore can
   be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
   via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
   risks to sealed VMAs. It is included anyway because the use case is
   unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
   memory, when users don't have write permission to the memory. Those
   behaviors can alter region contents by discarding pages, effectively a
   memset(0) for anonymous memory.

In addition: mmap() has two related changes.

The PROT_SEAL bit in prot field of mmap(). When present, it marks
the map sealed since creation.

The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks
the map as sealable. A map created without MAP_SEALABLE will not support
sealing, i.e. mseal() will fail.

Applications that don't care about sealing will expect their behavior
unchanged. For those that need sealing support, opt-in by adding
MAP_SEALABLE in mmap().

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.

Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.

Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.

In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:

Jann Horn: raising awareness and providing valuable insights on the
  destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Pedro Falcato: suggesting sealing in the mmap().
Theo de Raadt: sharing the experiences and insight gained from
  implementing mimmutable() in OpenBSD.

Change history:
===============
V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address) 
- Update mseal.rst to add note for MAP_SEALABLE.

V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.org/T/

V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024) 
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/

V5:
- fix build issue in mseal-Wire-up-mseal-syscall
  (Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r

V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/

V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
  MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
  both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
  destructive operations of madvise. (Suggested by Jann Horn and
  Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.org/

v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/

v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/

----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/

Jeff Xu (4):
  mseal: Wire up mseal syscall
  mseal: add mseal syscall
  selftest mm/mseal memory sealing
  mseal:add documentation

 Documentation/userspace-api/index.rst       |    1 +
 Documentation/userspace-api/mseal.rst       |  215 ++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 include/linux/syscalls.h                    |    1 +
 include/uapi/asm-generic/mman-common.h      |    8 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 kernel/sys_ni.c                             |    1 +
 mm/Makefile                                 |    4 +
 mm/internal.h                               |   48 +
 mm/madvise.c                                |   12 +
 mm/mmap.c                                   |   35 +-
 mm/mprotect.c                               |   10 +
 mm/mremap.c                                 |   31 +
 mm/mseal.c                                  |  343 ++++
 tools/testing/selftests/mm/.gitignore       |    1 +
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 2024 +++++++++++++++++++
 33 files changed, 2756 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/userspace-api/mseal.rst
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

Comments

Liam R. Howlett Jan. 31, 2024, 7:34 p.m. UTC | #1
Please add me to the Cc list of these patches.

* jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> This patchset proposes a new mseal() syscall for the Linux kernel.
> 
> In a nutshell, mseal() protects the VMAs of a given virtual memory
> range against modifications, such as changes to their permission bits.
> 
> Modern CPUs support memory permissions, such as the read/write (RW)
> and no-execute (NX) bits. Linux has supported NX since the release of
> kernel version 2.6.8 in August 2004 [1]. The memory permission feature
> improves the security stance on memory corruption bugs, as an attacker
> cannot simply write to arbitrary memory and point the code to it. The
> memory must be marked with the X bit, or else an exception will occur.
> Internally, the kernel maintains the memory permissions in a data
> structure called VMA (vm_area_struct). mseal() additionally protects
> the VMA itself against modifications of the selected seal type.

... The v8 cut Jonathan's email discussion [1] off and instead of
replying there, I'm going to add my question here.

The best plan to ensure it is a general safety measure for all of linux
is to work with the community before it lands upstream.  It's much
harder to change functionality provided to users after it is upstream.
I'm happy to hear google is super excited about sharing this, but so
far, the community isn't as excited.

It seems Theo has a lot of experience trying to add a feature very close
to what you are doing and has real data on how this went [2].  Can we
see if there is a solution that is, at least, different enough from what
he tried to do for a shot of success?  Do we have anyone in the
toolchain groups that sees this working well?  If this means Stephen
needs to do something, can we get that to happen please?

I mean, you specifically state that this is a 'very specific
requirement' in your cover letter.  Does this mean even other browsers
have no use for it?

I am very concerned this feature will land and have to be maintained by
the core mm people for the one user it was specifically targeting.

Can we also get some benchmarking on the impact of this feature?  I
believe my answer in v7 removed the worst offender, but since there is
no benchmarking we really are guessing (educated or not, hard data would
help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
(and mreamp syscall?).

You also did not clean up the loop you copied from mlock, which I
pointed out [3].  Stating that your copy/paste is easier to review is
not sufficient to keep unneeded assignments around.

[1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/
[2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/
[3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/
Jeff Xu Feb. 1, 2024, 1:27 a.m. UTC | #2
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:
>
> Please add me to the Cc list of these patches.
Ok.
>
> * jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > This patchset proposes a new mseal() syscall for the Linux kernel.
> >
> > In a nutshell, mseal() protects the VMAs of a given virtual memory
> > range against modifications, such as changes to their permission bits.
> >
> > Modern CPUs support memory permissions, such as the read/write (RW)
> > and no-execute (NX) bits. Linux has supported NX since the release of
> > kernel version 2.6.8 in August 2004 [1]. The memory permission feature
> > improves the security stance on memory corruption bugs, as an attacker
> > cannot simply write to arbitrary memory and point the code to it. The
> > memory must be marked with the X bit, or else an exception will occur.
> > Internally, the kernel maintains the memory permissions in a data
> > structure called VMA (vm_area_struct). mseal() additionally protects
> > the VMA itself is against modifications of the selected seal type.
>
> ... The v8 cut Jonathan's email discussion [1] off and
> instead of
> replying there, I'm going to add my question here.
>
> The best plan to ensure it is a general safety measure for all of linux
> is to work with the community before it lands upstream.  It's much
> harder to change functionality provided to users after it is upstream.
> I'm happy to hear google is super excited about sharing this, but so
> far, the community isn't as excited.
>
> It seems Theo has a lot of experience trying to add a feature very close
> to what you are doing and has real data on how this went [2].  Can we
> see if there is a solution that is, at least, different enough from what
> he tried to do for a shot of success?  Do we have anyone in the
> toolchain groups that sees this working well?  If this means Stephen
> needs to do something, can we get that to happen please?
>
For Theo's input from OpenBSD's perspective;
IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same
scope on what operations to seal, e.g. considering the progress made
on both sides since the beginning of the RFC:
- mseal(Linux): dropped "multiple-bit" approach.
- mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).

The difference is in mmap(), i.e.
- mseal(Linux): support of PROT_SEAL in mmap().
- mseal(Linux): use of MAP_SEALABLE in mmap().

I considered Theo's inputs from OpenBSD's perspective regarding the
difference, and I wasn't convinced that Linux should remove these. In
my view, those are two different kernels code, and the difference in
Linux is not added without reasons (for MAP_SEALABLE, there is a note
in the documentation section with details).

I would love to hear more from Linux developers on this.

> I mean, you specifically state that this is a 'very specific
> requirement' in your cover letter.  Does this mean even other browsers
> have no use for it?
>
No, I don’t mean “other browsers have no use for it”.

About specific requirements from Chrome, that refers to "The lifetime
of those mappings are not tied to the lifetime of the process, which
is not the case of libc" as in the cover letter. This addition to the
cover letter was made in V3, thus, it might be beneficial to provide
additional context to help answer the question.

This patch series begins with multiple-bit approaches (v1,v2,v3), the
rationale for this is that I am uncertain if Chrome's specific needs
are common enough for other use cases.  Consequently, I am unable to
make this decision myself without input from the community. To
accommodate this, multiple bits are selected initially due to their
adaptability.

Since V1, after hearing from the community, Chrome has changed its
design (no longer relying on separating out mprotect), and Linus
acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
today mseal() has a simple design that:
 - meet Chrome's specific needs.
 - meet Libc's needs.
 - Chrome's specific need doesn't interfere with Libc's.

[1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

> I am very concerned this feature will land and have to be maintained by
> the core mm people for the one user it was specifically targeting.
>
See above. This feature is not specifically targeting Chrome.

> Can we also get some benchmarking on the impact of this feature?  I
> believe my answer in v7 removed the worst offender, but since there is
> no benchmarking we really are guessing (educated or not, hard data would
> help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> (and mreamp syscall?).
>
Yes. There is an extra loop in mmap(FIXED), munmap(),
madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
address range. I suspect the impact would be low, but having some hard
data would be good. I will see what I can find to assist the perf
testing. If you have a specific test suite in mind, I can also try it.

> You also did not clean up the loop you copied from mlock, which I
> pointed out [3].  Stating that your copy/paste is easier to review is
> not sufficient to keep unneeded assignments around.
>
OK.

> [1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/
> [2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/
> [3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/
Theo de Raadt Feb. 1, 2024, 1:46 a.m. UTC | #3
Jeff Xu <jeffxu@chromium.org> wrote:

> I considered Theo's inputs from OpenBSD's perspective regarding the
> difference, and I wasn't convinced that Linux should remove these. In
> my view, those are two different kernels code, and the difference in
> Linux is not added without reasons (for MAP_SEALABLE, there is a note
> in the documentation section with details).

That note is describing a fiction.

> I would love to hear more from Linux developers on this.

I'm not sure you are capable of listening.

But I'll repeat for others to stop this train wreck:


1. When execve() maps a programs's .data section, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

   Does the kernel seal the .data section?  It cannot, because of RELRO
   and IFUNCS.  Do you know what those are?  (like in OpenBSD) the kernel
   cannot and will *not* seal the .data section, it lets later code do that.

2. When execve() maps a programs's .bss section, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

   Does the kernel seal the .bss section?  It cannot, because of RELRO
   and IFUNCS.  Do you know what those are?  (like in OpenBSD) the kernel
   cannot and will *not* seal the .bss section, it lets later code do that.

In the proposed diff, the kernel does not set MAP_SEALABLE on those
regions.

How does a userland program seal the .data and .bss regions?

It cannot.  It is too late to set the MAP_SEALABLE, because the kernel
already decided not do to it.

So those regions cannot be sealed.

3. When execve() maps a programs's stack, does the kernel set
   MAP_SEALABLE on that region?  Or does it not set MAP_SEALABLE?

In the proposed diff, the kernel does not set MAP_SEALABLE.

You think you can seal the stack in the kernel??  Sorry to be the bearer
of bad news, but glibc has code which on occasion will mprotects the
stack executable.

But if userland decides that mprotect case won't occur -- how does a
userland program seal its stack?  It is now too late to set MAP_SEALABLE.

So the stack must remain unsealed.

4. What about the text segment?

5. Do you know what a text-relocation is?  They are now rare, but there
   are still compile/linker stages which will produce them, and there is
   software which requires that to work.  It means userland fixes it's
   own .text, then calls mprotect.  The kernel does not know if this will
   happen.

6. When execve() maps the .text segment, will it set MAP_SEALABLE?

If it doesn't set it, userland cannot seal it's text after it makes the
decision to do.


You can continue to extrapolate those same points for all other segments
of a static binary, all segments of a dynamic binary, all segments of the
shared library linker.

And then you can go further, and recognize the logic that will be needed
in the shared library linker to *make the same decisions*.

In each case, the *decision* to make a mapping happens in one piece of
code, and the decision to use and NOW SEAL THAT MAPPING, happens in a
different piece of code.


The only answer to these problems will be to always set MAP_SEALABLE.
To go through the entire Linux ecosystem, and change every call to mmap()
to use this new MAP_SEALABLE flag, and it will look something like this:

+#ifndef MAP_SEALABLE
+#define MAP_SEALABLE 0
+#endif
-	ptr = mmap(...., MAP...
-	ptr = mmap(...., MAP_SEALABLE | MAP...

Every single one of them, and you'll need to do it in the kernel.




If you had spent a second trying to make this work in a second piece of
software, you would have realized that the ONLY way this could work
is by adding a flag with the opposite meaning:

   MAP_NOTSEALABLE

But nothing will use that.  I promise you


> I would love to hear more from Linux developers on this.

I'm not sure you are capable of listening.
Theo de Raadt Feb. 1, 2024, 1:55 a.m. UTC | #4
I'd like to propose a new flag to the Linux open() system call.

It is

   O_DUPABLE

You mix it with other O_* flags to the open call, everyone is familiar
with this, it is very easy to use.

If the O_DUPABLE flag is set, the file descriptor may be cloned with
dup(), dup2() or similar call.  If not set, those calls will return with
-1 EPERM.

I know it goes strongly against the grain of ancient assumptions that
file descriptors (just like memory) are fully mutable, and therefore
managed with care.  But in these trying times, we need protection against
file descriptor desecration.

It protects programmers from accidentally making clones of file
descriptors and leaking them out of programs, like I dunno, runc.
OK, besides this one very specific place that could (maybe) use
it today, there is other code which can use this but the margin is too narrow to contain.

The documentation can describe the behaviour as similar to MAP_SEALABLE,
so that noone is shocked.

/sarc
Bird, Tim Feb. 1, 2024, 4:56 p.m. UTC | #5
> -----Original Message-----
> From: Theo de Raadt <deraadt@openbsd.org>
> > I would love to hear more from Linux developers on this.
> 
> I'm not sure you are capable of listening.
> 

Theo,

It is possible to make your technical points, and even to express frustration that it has
been difficult to get them across, without resorting to personal attacks.

 -- Tim
Liam R. Howlett Feb. 1, 2024, 8:45 p.m. UTC | #6
* Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> >
> > Please add me to the Cc list of these patches.
> Ok.
> >
> > * jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > This patchset proposes a new mseal() syscall for the Linux kernel.
> > >
> > > In a nutshell, mseal() protects the VMAs of a given virtual memory
> > > range against modifications, such as changes to their permission bits.
> > >
> > > Modern CPUs support memory permissions, such as the read/write (RW)
> > > and no-execute (NX) bits. Linux has supported NX since the release of
> > > kernel version 2.6.8 in August 2004 [1]. The memory permission feature
> > > improves the security stance on memory corruption bugs, as an attacker
> > > cannot simply write to arbitrary memory and point the code to it. The
> > > memory must be marked with the X bit, or else an exception will occur.
> > > Internally, the kernel maintains the memory permissions in a data
> > > structure called VMA (vm_area_struct). mseal() additionally protects
> > > the VMA itself is against modifications of the selected seal type.
> >
> > ... The v8 cut Jonathan's email discussion [1] off and
> > instead of
> > replying there, I'm going to add my question here.
> >
> > The best plan to ensure it is a general safety measure for all of linux
> > is to work with the community before it lands upstream.  It's much
> > harder to change functionality provided to users after it is upstream.
> > I'm happy to hear google is super excited about sharing this, but so
> > far, the community isn't as excited.
> >
> > It seems Theo has a lot of experience trying to add a feature very close
> > to what you are doing and has real data on how this went [2].  Can we
> > see if there is a solution that is, at least, different enough from what
> > he tried to do for a shot of success?  Do we have anyone in the
> > toolchain groups that sees this working well?  If this means Stephen
> > needs to do something, can we get that to happen please?
> >
> For Theo's input from OpenBSD's perspective;
> IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same
> scope on what operations to seal, e.g. considering the progress made
> on both sides since the beginning of the RFC:
> - mseal(Linux): dropped "multiple-bit" approach.
> - mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED).
> 
> The difference is in mmap(), i.e.
> - mseal(Linux): support of PROT_SEAL in mmap().
> - mseal(Linux): use of MAP_SEALABLE in mmap().
> 
> I considered Theo's inputs from OpenBSD's perspective regarding the
> difference, and I wasn't convinced that Linux should remove these. In
> my view, those are two different kernels code, and the difference in
> Linux is not added without reasons (for MAP_SEALABLE, there is a note
> in the documentation section with details).
> 
> I would love to hear more from Linux developers on this.

Linus said it was really important to get the semantics correct, but you
took his (unfinished) list and kept going.  I think there are some
unanswered questions and that's frustrating some people as you may not
be valuing the experience they have in this area.

You dropped the RFC from the topic and incremented the version numbering
on the patch set. I thought it was customary to restart counting after
the RFC was complete?  Maybe I'm wrong, but it seemed a bit odd to see
that happen.  The documentation also implies there are still questions
to be answered, so it seems this is still an RFC in some ways?


I'd like to talk about the design some more.

Having to opt-in to allowing mseal will probably not work well.

Initial library mappings happen in one huge chunk then it's cut up into
smaller VMAs, at least that's what I see with my maple tree tracing.  If
you opt-in, then the entire library will have to opt-in and so the
'discourage inadvertent sealing' argument is not very strong.

It also makes a somewhat messy tracking of inheritance of the attribute
across splitting, MAP_FIXED replacement, vma_move, vma_copy.  I think
most of this is forced on the user?

It makes your call less flexible, it means you have to hope that the VMA
origin was blessed before you decide you want to mseal it.

What if you want to ensure the library mapped by a parent or on launch
is mseal'ed?

What about the initial relocated VMA (expand/shrink of VMA)?

Creating something as "non-sealable" is pointless.  If you don't want it
sealed, then don't mseal() that region.

If your use case doesn't need it, then can we please drop the opt-in
behaviour and just have all VMAs treated the same?

If it does need it, can you explain why?

The glibc relocation/fixup will then work.  glibc could mseal once it is
complete - or an application could bypass glibc support and use the
feature itself.

If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the
heap/stack concerns.  We can either let people shoot their own feet off
or try to protect them.

Right now, you seem to be trying to protect them.  Keeping with that, I
guess we could either get the kernel to mark those VMAs or tell some
other way?  I'd suggest a range, but people do very strange things with
these special VMAs [1].  I don't think you can predict enough crazy
actions to make a difference in trying to protect people.

There are far fewer VMAs that should not be allowed to be mseal'ed than
should be, and the kernel creates those so it seems logical to only let
the kernel opt-out on those ones.

I'd rather just let people shoot themselves and return an error.

I also hope it reduces the complexity of this code while increasing the
flexibility of the feature.  As stated before, we remove the dependency
of needing support from the initial loader.

Merging VMAs
I can see this going Very Bad with brk + mseal.  But, again, if someone
decides to mseal these VMAs then they should expect Bad Things to
happen (or maybe they know what they are doing even in some complex
situation?)

vma_merge() can also expand a VMA.  I think this is okay as it checks
for the same flags, so you will allow VMA expansion of two (or three)
vma areas to become one.  Is this okay in your model?

> 
> > I mean, you specifically state that this is a 'very specific
> > requirement' in your cover letter.  Does this mean even other browsers
> > have no use for it?
> >
> No, I don’t mean “other browsers have no use for it”.
> 
> About specific requirements from Chrome, that refers to "The lifetime
> of those mappings are not tied to the lifetime of the process, which
> is not the case of libc" as in the cover letter. This addition to the
> cover letter was made in V3, thus, it might be beneficial to provide
> additional context to help answer the question.
> 
> This patch series begins with multiple-bit approaches (v1,v2,v3), the
> rationale for this is that I am uncertain if Chrome's specific needs
> are common enough for other use cases.  Consequently, I am unable to
> make this decision myself without input from the community. To
> accommodate this, multiple bits are selected initially due to their
> adaptability.
> 
> Since V1, after hearing from the community, Chrome has changed its
> design (no longer relying on separating out mprotect), and Linus
> acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
> today mseal() has a simple design that:
>  - meet Chrome's specific needs.

How many VMAs will chrome have that are mseal'ed?  Is this a common
operation?

PROT_SEAL seems like an extra flag we could drop.  I don't expect we'll
be sealing enough VMAs that a hand full of extra syscalls would make a
difference?

>  - meet Libc's needs.

What needs of libc are you referring to?  I'm looking through the
version changelog and I guess you mean return EPERM?

>  - Chrome's specific need doesn't interfere with Libc's.
> 
> [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

Linus said he'd be happier if we made the change in general.

> 
> > I am very concerned this feature will land and have to be maintained by
> > the core mm people for the one user it was specifically targeting.
> >
> See above. This feature is not specifically targeting Chrome.
> 
> > Can we also get some benchmarking on the impact of this feature?  I
> > believe my answer in v7 removed the worst offender, but since there is
> > no benchmarking we really are guessing (educated or not, hard data would
> > help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> > (and mreamp syscall?).
> >
> Yes. There is an extra loop in mmap(FIXED), munmap(),
> madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
> address range. I suspect the impact would be low, but having some hard
> data would be good. I will see what I can find to assist the perf
> testing. If you have a specific test suite in mind, I can also try it.

You should look at mmtests [2]. But since you are adding loops across
VMA ranges, you need to test loops across several ranges of VMAs.  That
is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or
some subset of small and large numbers to get an idea of complexity we
are adding.  My hope is that the looping will be cache-hot in the maple
tree and have minimum effect.

In my personal testing, I've seen munmap often do a single VMA, or 3, or
more rare 7 on x86_64.  There should be some good starting points in
mmtests for the common operations.

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c
[2] https://github.com/gormanm/mmtests

Thanks,
Liam
Theo de Raadt Feb. 1, 2024, 10:24 p.m. UTC | #7
There is another problem with adding PROT_SEAL to the mprotect()
call.

What are the precise semantics?

If one reviews how mprotect() behaves, it is quickly clear that
it is very sloppy specification.  We spent quite a bit of effort
making our manual page as clear as possible to the most it gaurantees,
in the standard, and in all the various Unix:

     Not all implementations will guarantee protection on a page basis; the
     granularity of protection changes may be as large as an entire region.
     Nor will all implementations guarantee to give exactly the requested
     permissions; more permissions may be granted than requested by prot.
     However, if PROT_WRITE was not specified then the page will not be
     writable.

Anything else is different.

That is the specification in case of PROT_READ, PROT_WRITE, and PROT_EXEC.

What happens if you add additional PROT_* flags?

Does mprotect still behave just as sloppy (as specified)?

Or it now return an error partway through an operation?

When it returns the error, does it skip doing the work on the remaining
region?

Or does it skip doing any protection operation at all? (That means the code
has to do two passes over the region; first one checks if it may proceed,
second pass performs the change.  I think I've reat PROT_SEAL was supposed
to try to do things as one pass; is that actually possible without requiring
a second pass in the kernel?

To wit, do these two sequences have _exactly_ the same behaviour in
all cases that we can think of
    - unmapped sub-regions
    - sealed sub-regions
    - and who knows what else mprotect() may encounter

a)

    mprotect(addr, len, PROT_READ);
    mseal(addr, len, 0);

b)

    mprotect(addr, len, PROT_READ | PROT_SEAL);

Are they the same, or are they different?

Here's what I think: mprotect() behaves quite differently if you add
the PROT_SEAL flag, but I can't quite tell precisely what happens because
I don't understand the linux vm system enough.


(As an outsider, I have glanced at the new PROT_MTE flag changes; that
one seem to just "set a flag where possible", rather than performing
an action which could result in an error, and seems to not have this
problem).


As an outsider, Linux development is really strange:

Two sub-features are being pushed very hard, and the primary developer
doesn't have code which uses either of them.  And once it goes in, it
cannot be changed.

It's very different from my world, where the absolutely minimal
interface was written to apply to a whole operating system plus 10,000+
applications, and then took months of testing before it was approved for
inclusion.  And if it was subtly wrong, we would be able to change it.
Jeff Xu Feb. 1, 2024, 10:37 p.m. UTC | #8
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > I would love to hear more from Linux developers on this.
>
> Linus said it was really important to get the semantics correct, but you
> took his (unfinished) list and kept going.  I think there are some
> unanswered questions and that's frustrating some people as you may not
> be valuing the experience they have in this area.
>
Perhaps you didn't follow the discussions closely during the RFCs, so
I like to clarify the timeline:

- Dec.12:
RFC V3 was  out for comments: [1]
This version added MAP_SEALABLE and sealing type in mmap()
The sealing type in mmap() was suggested by  Pedro Falcato during V1. [2]
And MAP_SEALABLE is new to V3 and I added an open discussion in the
cover letter.

- Dec.14
Linus made a set of recommendations based on V3 [3], this is where
Linus mentioned the semantics.

Quoted below:
"Particularly for new system calls with fairly specialized use, I think
it's very important that the semantics are sensible on a conceptual
level, and that we do not add system calls that are based on "random
implementation issue of the day".

- Jan.4:
I sent out V4 of that patch for comments [5]
This version implements all of Linus's recommendations made in V3.

In V3, I didn't receive comments about MAP_SEALABLE, so I kept that as
an open discussion item in V4 and specifically mentioned it in the
first sentence of the V4 cover letter.

"This is V4 of the patch, the patch has improved significantly since V1,
thanks to diverse inputs, a few discussions remain, please read those
in the open discussion section of v4 of change history."

- Jan.4:
Linus  gave a comment on V4: [6]

Quoted below:
"Other than that, this seems all reasonable to me now."

To me, this means Linus is OK with the general signatures of the APIs.

-Jan.9
During comments for V5.
[7]  Kees suggested dropping RFC from subsequent versions, given
Linus's general approval
on the v4.

[1] https://lore.kernel.org/all/80897.1705769947@cvs.openbsd.org/T/#mbf4749d465b80a575e1eda3c6f0c66d995abfc39

[2]
https://lore.kernel.org/lkml/CAKbZUD2A+=bp_sd+Q0Yif7NJqMu8p__eb4yguq0agEcmLH8SDQ@mail.gmail.com/

[3]
https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/

[4]
https://lore.kernel.org/all/CABi2SkUTdF6PHrudHTZZ0oWK-oU+T-5+7Eqnei4yCj2fsW2jHg@mail.gmail.com/#t

[5]
https://lore.kernel.org/lkml/796b6877-0548-4d2a-a484-ba4156104a20@infradead.org/T/#mb5c8bfe234759589cadf0bcee10eaa7e07b2301a

[6]
https://lore.kernel.org/lkml/CAHk-=wiy0nHG9+3rXzQa=W8gM8F6-MhsHrs_ZqWaHtjmPK4=FA@mail.gmail.com/

[7]
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/T/#m657fffd96ffff91902da53dc9dbc1bb093fe367c

> You dropped the RFC from the topic and increment the version numbering
> on the patch set. I thought it was customary to restart counting after
> the RFC was complete?  Maybe I'm wrong, but it seemed a bit odd to see
> that happen.  The documentation also implies there are still questions
> to be answered, so it seems this is still an RFC in some ways?
>
The RFC has been dropped since V6.
That said, I'm open to feedback from Linux developers.
I will respond to the rest of your email in seperate emails.

Best Regards.
-Jeff
Theo de Raadt Feb. 1, 2024, 10:54 p.m. UTC | #9
> To me, this means Linus is OK with the general signatures of the APIs.


Linus, you are in for a shock when the proposal doesn't work for glibc
and all the applications!
Linus Torvalds Feb. 1, 2024, 11:15 p.m. UTC | #10
On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Linus, you are in for a shock when the proposal doesn't work for glibc
> and all the applications!

Heh. I've enjoyed seeing your argumentative style that made you so
famous back in the days. Maybe it's always been there, but I haven't
seen the BSD people in so long that I'd forgotten all about it.

That said, famously argumentative or not, I think Theo is right, and I
do think the MAP_SEALABLE bit is nonsensical.

If somebody wants to mseal() a memory region, why would they need to
express that ahead of time?

So the part I think is sane is the mseal() system call itself, in that
it allows *potential* future expansion of the semantics.

But hopefully said future expansion isn't even needed, and all users
want the base experience, which is why I think PROT_SEAL (both to mmap
and to mprotect) makes sense as an alternative form.

So yes, to my mind

    mprotect(addr, len, PROT_READ);
    mseal(addr, len, 0);

should basically give identical results to

    mprotect(addr, len, PROT_READ | PROT_SEAL);

and using PROT_SEAL at mmap() time is similarly the same obvious
notion of "map this, and then seal that mapping".

The reason for having "mseal()" as a separate call at all from the
PROT_SEAL bit is that it does allow possible future expansion (while
PROT_SEAL is just a single bit, and it won't change semantics) but
also so that you can do whatever prep-work in stages if you want to,
and then just go "now we seal it all".

          Linus
Theo de Raadt Feb. 1, 2024, 11:43 p.m. UTC | #11
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So yes, to my mind
> 
>     mprotect(addr, len, PROT_READ);
>     mseal(addr, len, 0);
> 
> should basically give identical results to
> 
>     mprotect(addr, len, PROT_READ | PROT_SEAL);
> 
> and using PROT_SEAL at mmap() time is similarly the same obvious
> notion of "map this, and then seal that mapping".

I think that isn't easy to do.  Let's expand it to show error checking.

    if (mprotect(addr, len, PROT_READ) == -1)
       react to the errno value
    if (mseal(addr, len, 0) == -1)
       react to the errno value

and

    if (mprotect(addr, len, PROT_READ | PROT_SEAL) == -1)
       react to the errno value

For current mprotect(), the errno values are mostly related to range
issues with the parameters.

After sealing a region, mprotect() also has the new errno EPERM.

But what is the return value supposed to be from "PROT_READ | PROT_SEAL"
over various sub-region types?

Say I have a region 3 pages long.  One page is unmapped, one page is
regular, and one page is sealed.  Re-arrange those 3 pages in all 6
permutations.  Try them all.

Does the returned errno change, based upon the order?
Does it do part of the operation, or all of the operation?

If the sealed page is first, the regular page is second, and the unmapped
page is 3rd, does it return an error or return 0?  Does it change the
permission on the 3rd page?  If it returns an error, has it changed any
permissions?

I don't think the diff follows the principle of

if an error is returned --> we know nothing was changed.
if success is returned --> we know all the requests were satisfied

> The reason for having "mseal()" as a separate call at all from the
> PROT_SEAL bit is that it does allow possible future expansion (while
> PROT_SEAL is just a single bit, and it won't change semantics) but
> also so that you can do whatever prep-work in stages if you want to,
> and then just go "now we seal it all".




How about you add basic mseal() that is maximum compatible with mimmutable(),
and then we can all talk about whether PROT_SEAL makes sense once there
are applications that demand it, and can prove they need it?
Theo de Raadt Feb. 2, 2024, 12:26 a.m. UTC | #12
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> and using PROT_SEAL at mmap() time is similarly the same obvious
> notion of "map this, and then seal that mapping".

The usual way is:

    ptr = mmap(NULL, len PROT_READ|PROT_WRITE, ...)

    initialize region between ptr, ptr+len

    mprotect(ptr, len, PROT_READ)
    mseal(ptr, len, 0);


Our source tree contains one place where a locking happens very close
to a mmap().

It is the shared-library-linker 'hints file', this is a file that gets
mapped PROT_READ and then we lock it.

It feels like that could be one operation?  It can't be.

        addr = (void *)mmap(0, hsize, PROT_READ, MAP_PRIVATE, hfd, 0);
        if (_dl_mmap_error(addr))
                goto bad_hints;

        hheader = (struct hints_header *)addr;
        if (HH_BADMAG(*hheader) || hheader->hh_ehints > hsize)
                goto bad_hints;

	/* couple more error checks */

	mimmutable(addr, hsize);
	close(hfd);
	return (0);
bad_hints:
	munmap(addr, hsize);
	...

See the problem?  It unmaps it if the contents are broken.  So even that
case cannot use something like "PROT_SEAL".

These are not hypotheticals.  I'm grepping an entire Unix kernel and
userland source tree, and I know what 100,000+ applications do.  I found
piece of code that could almost use it, but upon inspection it can't,
and it is obvious why: it is best idiom to allow a programmer to insert
an inspection operation between two disctinct operations, and especially
critical if the 2nd operation cannot be reversed.

Noone needs PROT_SEAL as a shortcut operation in mmap() or mprotect().

Throwing around ideas without proving their use in practice is very
unscientific.
Greg Kroah-Hartman Feb. 2, 2024, 1:06 a.m. UTC | #13
On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
> As an outsider, Linux development is really strange:
> 
> Two sub-features are being pushed very hard, and the primary developer
> doesn't have code which uses either of them.  And once it goes in, it
> cannot be changed.
> 
> It's very different from my world, where the absolutely minimal
> interface was written to apply to a whole operating system plus 10,000+
> applications, and then took months of testing before it was approved for
> inclusion.  And if it was subtly wrong, we would be able to change it.

No, it's this "feature" submission that is strange to think that we
don't need that.  We do need, and will require, an actual working
userspace something to use it, otherwise as you say, there's no way to
actually know if it works properly or not and we can't change it once we
accept it.

So along those lines, Jeff, do you have a pointer to the Chrome patches,
or glibc patches, that use this new interface that proves that it
actually works?  Those would be great to see to at least verify it's
been tested in a real-world situation and actually works for your use
case.

thanks,

greg k-h
Jeff Xu Feb. 2, 2024, 3:14 a.m. UTC | #14
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > >
>
> Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.

> Initial library mappings happen in one huge chunk then it's cut up into
> smaller VMAs, at least that's what I see with my maple tree tracing.  If
> you opt-in, then the entire library will have to opt-in and so the
> 'discourage inadvertent sealing' argument is not very strong.
>
Regarding "The initial library mappings happen in one huge chunk then
it is cut up into smaller VMAS", this is not a problem.

As example of elf loading (fs/binfmt_elf.c), there is just a few
places to pass in what type of memory to be allocated, e.g.
MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
places.
If glic does additional splitting on the memory range, by using
mprotect(), then the MAP_SEALABLE is automatically applied after
splitting.
If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).

> It also makes a somewhat messy tracking of inheritance of the attribute
> across splitting, MAP_FIXED replacement, vma_move, vma_copy.  I think
> most of this is forced on the user?
>
The inheritance is the same as other VMA flags.

> It makes your call less flexible, it means you have to hope that the VMA
> origin was blessed before you decide you want to mseal it.
>
> What if you want to ensure the library mapped by a parent or on launch
> is mseal'ed?
>
> What about the initial relocated VMA (expand/shrink of VMA)?
>
> Creating something as "non-sealable" is pointless.  If you don't want it
> sealed, then don't mseal() that region.
>
> If your use case doesn't need it, then can we please drop the opt-in
> behaviour and just have all VMAs treated the same?
>
> If it does need it, can you explain why?
>
> The glibc relocation/fixup will then work.  glibc could mseal once it is
> complete - or an application could bypass glibc support and use the
> feature itself.

Yes. That is the idea.

>
> If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the
> heap/stack concerns.  We can either let people shoot their own feet off
> or try to protect them.
>
> Right now, you seem to be trying to protect them.  Keeping with that, I
> guess we could either get the kernel to mark those VMAs or tell some
> other way?  I'd suggest a range, but people do very strange things with
> these special VMAs [1].  I don't think you can predict enough crazy
> actions to make a difference in trying to protect people.
>
> There are far fewer VMAs that should not be allowed to be mseal'ed than
> should be, and the kernel creates those so it seems logical to only let
> the kernel opt-out on those ones.
>
> I'd rather just let people shoot themselves and return an error.
>
> I also hope it reduces the complexity of this code while increasing the
> flexibility of the feature.  As stated before, we remove the dependency
> of needing support from the initial loader.
>
> Merging VMAs
> I can see this going Very Bad with brk + mseal.  But, again, if someone
> decides to mseal these VMAs then they should expect Bad Things to
> happen (or maybe they know what they are doing even in some complex
> situation?)
>
> vma_merge() can also expand a VMA.  I think this is okay as it checks
> for the same flags, so you will allow VMA expansion of two (or three)
> vma areas to become one.  Is this okay in your model?
>
> >
> > > I mean, you specifically state that this is a 'very specific
> > > requirement' in your cover letter.  Does this mean even other browsers
> > > have no use for it?
> > >
> > No, I don’t mean “other browsers have no use for it”.
> >
> > About specific requirements from Chrome, that refers to "The lifetime
> > of those mappings are not tied to the lifetime of the process, which
> > is not the case of libc" as in the cover letter. This addition to the
> > cover letter was made in V3, thus, it might be beneficial to provide
> > additional context to help answer the question.
> >
> > This patch series begins with multiple-bit approaches (v1,v2,v3), the
> > rationale for this is that I am uncertain if Chrome's specific needs
> > are common enough for other use cases.  Consequently, I am unable to
> > make this decision myself without input from the community. To
> > accommodate this, multiple bits are selected initially due to their
> > adaptability.
> >
> > Since V1, after hearing from the community, Chrome has changed its
> > design (no longer relying on separating out mprotect), and Linus
> > acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
> > today mseal() has a simple design that:
> >  - meet Chrome's specific needs.
>
> How many VMAs will chrome have that are mseal'ed?  Is this a common
> operation?
>
> PROT_SEAL seems like an extra flag we could drop.  I don't expect we'll
> be sealing enough VMAs that a hand full of extra syscalls would make a
> difference?
>
> >  - meet Libc's needs.
>
> What needs of libc are you referring to?  I'm looking through the
> version changelog and I guess you mean return EPERM?
>
I meant libc's sealing RO part of the elf binary, those memory's
lifetime are associated with the lifetime of the process.

> >  - Chrome's specific need doesn't interfere with Libc's.
> >
> > [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
>
> Linus said he'd be happier if we made the change in general.
>
> >
> > > I am very concerned this feature will land and have to be maintained by
> > > the core mm people for the one user it was specifically targeting.
> > >
> > See above. This feature is not specifically targeting Chrome.
> >
> > > Can we also get some benchmarking on the impact of this feature?  I
> > > believe my answer in v7 removed the worst offender, but since there is
> > > no benchmarking we really are guessing (educated or not, hard data would
> > > help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> > > (and mreamp syscall?).
> > >
> > Yes. There is an extra loop in mmap(FIXED), munmap(),
> > madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
> > address range. I suspect the impact would be low, but having some hard
> > data would be good. I will see what I can find to assist the perf
> > testing. If you have a specific test suite in mind, I can also try it.
>
> You should look at mmtests [2]. But since you are adding loops across
> VMA ranges, you need to test loops across several ranges of VMAs.  That
> is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or
> some subset of small and large numbers to get an idea of complexity we
> are adding.  My hope is that the looping will be cache-hot in the maple
> tree and have minimum effect.
>
> In my personal testing, I've seen munmap often do a single VMA, or 3, or
> more rare 7 on x86_64.  There should be some good starting points in
> mmtests for the common operations.
>
Thanks. Will do.


> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c
> [2] https://github.com/gormanm/mmtests
>
> Thanks,
> Liam
Jeff Xu Feb. 2, 2024, 3:20 a.m. UTC | #15
On Thu, Feb 1, 2024 at 3:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Linus, you are in for a shock when the proposal doesn't work for glibc
> > and all the applications!
>
> Heh. I've enjoyed seeing your argumentative style that made you so
> famous back in the days. Maybe it's always been there, but I haven't
> seen the BSD people in so long that I'd forgotten all about it.
>
> That said, famously argumentative or not, I think Theo is right, and I
> do think the MAP_SEALABLE bit is nonsensical.
>
> If somebody wants to mseal() a memory region, why would they need to
> express that ahead of time?
>
I like to look at things from the point of view of average Linux
userspace developers,  they might not have the same level of expertise
as the other folks on this email list or they might not have time and
mileage for those details.

To me, the most important thing is to deliver a feature that's easy to
use and works well. I don't want users to mess things up, so if I'm
the one giving them the tools, I'm going to make sure they have all
the information they need and that there are safeguards in place.

e.g. considering the following user case:
1> a security sensitive data is allocated from heap, using malloc,
from the software component A, and filled with information.
2> software component B then uses mprotect to change it to RO, and
seal it using mseal().

Yes. we could choose to allow it. But there are complications:

1> Is this the right pattern ? why don't component A already seal it
if they think it is important ?
2> Why  heap, why not mmap() a new memory mapping for that security data ?
3>  free() will not respect the situation of whether the memory is
sealed or not. How would a new developer know they probably shall
never free the sealed memory ?
4>  brk-shrink will never be able to pass the VMA that gets splited
out by mseal(), there are memory footprint implications to the
process.
5>  what if the security sensitive data happens to be the first VMA or
last VMA of the heap, will sealing the first VMA/last VMA cause any
issue there ? since they might carry important VMA flags ? ( I don't
know enough about brk.)
6> If we ever support sealing the heap for its entirety (make it not
executable), and still want to support other brk behaviors, such as
shrink/grow, would that conflict with current mseal(), if we allow it
on heap from beginning ?

Questions like that, without clear answers, to me it is premature to
already let developers start using  mseal() for heap.

And even if we have all the answers for heap, how about stack, or
other types of virtual memory ?

Again, I don't have enough knowledge to get a complete list that
shouldn't be sealed,  the input from Theo is none should I worry
about.  However it  is clearly not none to me, besides heap mentioned,
there is also aio/shm.

So MAP_SEALABLE is a conservative approach to limit the scope to ***
two known use cases *** that  I want to work on (libc and chrome) and
give  time needed to answer those questions. It is like a claim: only
those marked by MAP_SEALABLE support the sealing at this point of
time.

And MAP_SEALABLE is reversible, e.g. a sysctl could be added to make
all memory sealable in the future, or we could obsoleted it entirely
when time comes, an application that already passes MAP_SEALABLE can
be treated as noop. However, if all memory were allowed to be sealable
from the beginning, reversing that decision would be hard.

After those considerations, if MAP_SEALABLE is still not preferred by
you. Then I have the following options for you to choose:

1. MAP_NOT_SEALABLE in the mmap(). And I will use them for the
heap/aio/shm case.
This basically says Linux does not officially support sealing on
those,  until we support them, we discourage the sealing on those
mappings.

2. make MAP_NOT_SEALABLE only a kernel visible flag. So application
space won't be able to use it.

3. open for all, and list as much as details in the documentation.
 If we choose this route, I would like to have more discussion on the
heap/stack, at least the Linux developers will learn from those
discussions.

> So the part I think is sane is the mseal() system call itself, in that
> it allows *potential* future expansion of the semantics.
>
> But hopefully said future expansion isn't even needed, and all users
> want the base experience, which is why I think PROT_SEAL (both to mmap
> and to mprotect) makes sense as an alternative form.
>
> So yes, to my mind
>
>     mprotect(addr, len, PROT_READ);
>     mseal(addr, len, 0);
>
> should basically give identical results to
>
>     mprotect(addr, len, PROT_READ | PROT_SEAL);
>
> and using PROT_SEAL at mmap() time is similarly the same obvious
> notion of "map this, and then seal that mapping".
>
> The reason for having "mseal()" as a separate call at all from the
> PROT_SEAL bit is that it does allow possible future expansion (while
> PROT_SEAL is just a single bit, and it won't change semantics) but
> also so that you can do whatever prep-work in stages if you want to,
> and then just go "now we seal it all".
>

To clarify: do you mean to have the following ?

mmap(PROT_READ|PROT_SEAL)
mseal(addr,len,0)
mprotect(addr,len,PROT_READ|PROT_SEAL) ?

I have to think about the mprotect() case.

For mmap(PROT_READ|PROT_SEAL),  I might  have a use case already:

fs/binfmt_elf.c
if (current->personality & MMAP_PAGE_ZERO) {
                /* Why this, you ask???  Well SVr4 maps page 0 as read-only,
                   and some applications "depend" upon this behavior.
                   Since we do not have the power to recompile these, we
                   emulate the SVr4 behavior. Sigh. */

                error = vm_mmap(NULL, 0, PAGE_SIZE,
                                PROT_READ | PROT_EXEC,   <-- add PROT_SEAL
                                MAP_FIXED | MAP_PRIVATE, 0);
        }

I don't see the benefit of RWX page 0, which might make a null
pointers error to become executable for some code.

Best Regards,
-Jeff

>           Linus
Jeff Xu Feb. 2, 2024, 3:24 a.m. UTC | #16
On Thu, Feb 1, 2024 at 5:06 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
> > As an outsider, Linux development is really strange:
> >
> > Two sub-features are being pushed very hard, and the primary developer
> > doesn't have code which uses either of them.  And once it goes in, it
> > cannot be changed.
> >
> > It's very different from my world, where the absolutely minimal
> > interface was written to apply to a whole operating system plus 10,000+
> > applications, and then took months of testing before it was approved for
> > inclusion.  And if it was subtly wrong, we would be able to change it.
>
> No, it's this "feature" submission that is strange to think that we
> don't need that.  We do need, and will require, an actual working
> userspace something to use it, otherwise as you say, there's no way to
> actually know if it works properly or not and we can't change it once we
> accept it.
>
> So along those lines, Jeff, do you have a pointer to the Chrome patches,
> or glibc patches, that use this new interface that proves that it
> actually works?  Those would be great to see to at least verify it's
> been tested in a real-world situation and actually works for your use
> case.
>
The MAP_SEALABLE is raised because of other concerns not related to libc.

The patch Stephan developed was based on V1 of the patch, IIRC, which
is really ancient, and it is not based on MAP_SEALABLE, which is a
more recent development entirely from me.

I don't see unresolvable problems  with glibc though,  E.g. For the
elf case (binfmt_elf.c), there are two places I need to add
MAP_SEALABLE, then the memory  to user space is marked with sealable.
There might be cases where glibc needs to add MAP_SEALABLE it uses
mmap(FIXED) to split the memory.

If the decision of MAP_SELABLE depends on the glibc case being able to
use it, we can develop such a patch, but it will take a while, say a
few weeks to months, due to vacation, work load, etc.

Best Regards,
-Jeff

> thanks,
>
> greg k-h
Linus Torvalds Feb. 2, 2024, 3:29 a.m. UTC | #17
On Thu, 1 Feb 2024 at 19:24, Jeff Xu <jeffxu@chromium.org> wrote:
>
> The patch Stephan developed was based on V1 of the patch, IIRC, which
> is really ancient, and it is not based on MAP_SEALABLE, which is a
> more recent development entirely from me.

So the problem with this whole patch series from the very beginning
was that it was very specialized, and COMPLETELY OVER-ENGINEERED.

It got simpler at one point. And then you started adding these
features that have absolutely no reason for them. Again.

It's frustrating. And it's not making it more likely to be ever merged.

               Linus
Jeff Xu Feb. 2, 2024, 3:46 a.m. UTC | #18
On Thu, Feb 1, 2024 at 7:29 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 1 Feb 2024 at 19:24, Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > The patch Stephan developed was based on V1 of the patch, IIRC, which
> > is really ancient, and it is not based on MAP_SEALABLE, which is a
> > more recent development entirely from me.
>
> So the problem with this whole patch series from the very beginning
> was that it was very specialized, and COMPLETELY OVER-ENGINEERED.
>
> It got simpler at one point. And then you started adding these
> features that have absolutely no reason for them. Again.
>
> It's frustrating. And it's not making it more likely to be ever merged.
>
I'm sorry for over-thinking.
Remove the MAP_SEALABLE it is then.

Keep with mseal(addr,len,0) only ?

-Jeff
>
Theo de Raadt Feb. 2, 2024, 4:05 a.m. UTC | #19
Jeff Xu <jeffxu@google.com> wrote:

> To me, the most important thing is to deliver a feature that's easy to
> use and works well. I don't want users to mess things up, so if I'm
> the one giving them the tools, I'm going to make sure they have all
> the information they need and that there are safeguards in place.
> 
> e.g. considering the following user case:
> 1> a security sensitive data is allocated from heap, using malloc,
> from the software component A, and filled with information.
> 2> software component B then uses mprotect to change it to RO, and
> seal it using mseal().

  p = malloc(80);
  mprotect(p & ~4095, 4096, PROT_NONE);
  free(p);

Will you save such a developer also?  No.

Since the same problem you describe already exists with mprotect() what
does mseal() even have to do with your proposal?

What about this?

  p = malloc(80);
  munmap(p & ~4095, 4096);
  free(p);

And since it is not sealed, how about madvise operations on a proper
non-malloc memory allocation?  Well, the process smashes it's own
memory.  And why is it not sealed?  You make it harder to seal memory!

How about this?

  p = malloc(80);
  bzero(p, 100000;

Yes it is a buffer overflow.  But this is all the same class of software
problem:

Memory belongs to processes, which belongs to the program, which is coded
by the programmer, who has to learn to be careful and handle the memory correctly.

mseal() / mimmutable() add *no new expectation* to a careful programmer,
because they expected to only use it on memory that they *promise will never
be de-allocated or re-permissioned*.

What you are proposing is not a "mitigation", it entirely cripples the
proposed subsystem because you are afraid of it; because you have cloned a
memory subsystem primitive you don't fully understand; and this is because
you've not seen a complete operating system using it.

When was the last time you developed outside of Chrome?

This is systems programming.  The kernel supports all the programs, not
just the one holy program from god.
Jeff Xu Feb. 2, 2024, 4:54 a.m. UTC | #20
On Thu, Feb 1, 2024 at 8:05 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@google.com> wrote:
>
> > To me, the most important thing is to deliver a feature that's easy to
> > use and works well. I don't want users to mess things up, so if I'm
> > the one giving them the tools, I'm going to make sure they have all
> > the information they need and that there are safeguards in place.
> >
> > e.g. considering the following user case:
> > 1> a security sensitive data is allocated from heap, using malloc,
> > from the software component A, and filled with information.
> > 2> software component B then uses mprotect to change it to RO, and
> > seal it using mseal().
>
>   p = malloc(80);
>   mprotect(p & ~4095, 4096, PROT_NONE);
>   free(p);
>
> Will you save such a developer also?  No.
>
> Since the same problem you describe already exists with mprotect() what
> does mseal() even have to do with your proposal?
>
> What about this?
>
>   p = malloc(80);
>   munmap(p & ~4095, 4096);
>   free(p);
>
> And since it is not sealed, how about madvise operations on a proper
> non-malloc memory allocation?  Well, the process smashes it's own
> memory.  And why is it not sealed?  You make it harder to seal memory!
>
> How about this?
>
>   p = malloc(80);
>   bzero(p, 100000;
>
> Yes it is a buffer overflow.  But this is all the same class of software
> problem:
>
> Memory belongs to processes, which belongs to the program, which is coded
> by the programmer, who has to learn to be careful and handle the memory correctly.
>
> mseal() / mimmutable() add *no new expectation* to a careful programmer,
> because they expected to only use it on memory that they *promise will never
> be de-allocated or re-permissioned*.
>
> What you are proposing is not a "mitigation", it entirely cripples the
> proposed subsystem because you are afraid of it; because you have cloned a
> memory subsystem primitive you don't fully understand; and this is because
> you've not seen a complete operating system using it.
>
> When was the last time you developed outside of Chrome?
>
> This is systems programming.  The kernel supports all the programs, not
> just the one holy program from god.
>
Even without free.
I personally do not like the heap getting sealed like that.

Component A.
p=malloc(4096);
writing something to p.

Component B:
mprotect(p,4096, RO)
mseal(p,4096)

This will split the heap VMA, and prevent the heap from shrinking, if
this is in a frequent code path, then it might hurt the process's
memory usage.

The existing code is more likely to use malloc than mmap(), so it is
easier for dev to seal a piece of data belonging to another component.
I hope this pattern is not wide-spreading.

The ideal way will be just changing the library A to use mmap.
Theo de Raadt Feb. 2, 2024, 5 a.m. UTC | #21
Jeff Xu <jeffxu@chromium.org> wrote:

> Even without free.
> I personally do not like the heap getting sealed like that.
> 
> Component A.
> p=malloc(4096);
> writing something to p.
> 
> Component B:
> mprotect(p,4096, RO)
> mseal(p,4096)
> 
> This will split the heap VMA, and prevent the heap from shrinking, if
> this is in a frequent code path, then it might hurt the process's
> memory usage.
> 
> The existing code is more likely to use malloc than mmap(), so it is
> easier for dev to seal a piece of data belonging to another component.
> I hope this pattern is not wide-spreading.
> 
> The ideal way will be just changing the library A to use mmap.

I think you are lacking some test programs to see how it actually
behaves; the effect is worse than you think, and the impact is immediately
visible to the programmer, and the lesson is clear:

	you can only seal objects which you gaurantee never get recycled.

	Pushing a sealed object back into reuse is a disasterous bug.

	Noone should call this interface, unless they understand that.

I'll say again, you don't have a test program for various allocators to
understand how it behaves.  The failure modes described in your docuemnts
are not correct.
Liam R. Howlett Feb. 2, 2024, 3:13 p.m. UTC | #22
* Jeff Xu <jeffxu@google.com> [240201 22:15]:
> On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > > <Liam.Howlett@oracle.com> wrote:
> > > >
> >
> > Having to opt-in to allowing mseal will probably not work well.
> I'm leaving the opt-in discussion in Linus's thread.
> 
> > Initial library mappings happen in one huge chunk then it's cut up into
> > smaller VMAs, at least that's what I see with my maple tree tracing.  If
> > you opt-in, then the entire library will have to opt-in and so the
> > 'discourage inadvertent sealing' argument is not very strong.
> >
> Regarding "The initial library mappings happen in one huge chunk then
> it is cut up into smaller VMAS", this is not a problem.
> 
> As example of elf loading (fs/binfmt_elf.c), there is just a few
> places to pass in what type of memory to be allocated, e.g.
> MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
> places.
> If glic does additional splitting on the memory range, by using
> mprotect(), then the MAP_SEALABLE is automatically applied after
> splitting.
> If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).

You are adding a flag that requires a new glibc.  When I try to point
out how this is unnecessary and excessive, you tell me it's fine and
probably not a whole lot of work.

This isn't working with developers, you are dismissing the developers
who are trying to help you.

Can you please:

Provide code that uses this feature.

Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
32 VMAs.

Provide code that tests and checks the failure paths.  Failures at the
start, middle, and end of the modifications.

Document what happens in those failure paths.

And, most importantly: keep an open mind and allow your opinion to
change when presented with new information.

All of these things are to help you.  We need to know what needs fixing
so you can be successful.


Thanks,
Liam
Greg Kroah-Hartman Feb. 2, 2024, 3:18 p.m. UTC | #23
On Thu, Feb 01, 2024 at 07:24:02PM -0800, Jeff Xu wrote:
> On Thu, Feb 1, 2024 at 5:06 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >
> > On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote:
> > > As an outsider, Linux development is really strange:
> > >
> > > Two sub-features are being pushed very hard, and the primary developer
> > > doesn't have code which uses either of them.  And once it goes in, it
> > > cannot be changed.
> > >
> > > It's very different from my world, where the absolutely minimal
> > > interface was written to apply to a whole operating system plus 10,000+
> > > applications, and then took months of testing before it was approved for
> > > inclusion.  And if it was subtly wrong, we would be able to change it.
> >
> > No, it's this "feature" submission that is strange to think that we
> > don't need that.  We do need, and will require, an actual working
> > userspace something to use it, otherwise as you say, there's no way to
> > actually know if it works properly or not and we can't change it once we
> > accept it.
> >
> > So along those lines, Jeff, do you have a pointer to the Chrome patches,
> > or glibc patches, that use this new interface that proves that it
> > actually works?  Those would be great to see to at least verify it's
> > been tested in a real-world situation and actually works for your use
> > case.
> >
> The MAP_SEALABLE is raised because of other concerns not related to libc.
> 
> The patch Stephan developed was based on V1 of the patch, IIRC, which
> is really ancient, and it is not based on MAP_SEALABLE, which is a
> more recent development entirely from me.
> 
> I don't see unresolvable problems  with glibc though,  E.g. For the
> elf case (binfmt_elf.c), there are two places I need to add
> MAP_SEALABLE, then the memory  to user space is marked with sealable.
> There might be cases where glibc needs to add MAP_SEALABLE it uses
> mmap(FIXED) to split the memory.
> 
> If the decision of MAP_SELABLE depends on the glibc case being able to
> use it, we can develop such a patch, but it will take a while, say a
> few weeks to months, due to vacation, work load, etc.

There's no rush here, and no deadlines in kernel development.  If you
don't have a working userspace user for your new feature(s), there is no
way we can accept the changes to the kernel (and hint, you don't want us
to either...)

good luck!

greg k-h
Theo de Raadt Feb. 2, 2024, 5:05 p.m. UTC | #24
Another interaction to consider is sigaltstack().

In OpenBSD, sigaltstack() forces MAP_STACK onto the specified
(pre-allocated) region, because on kernel-entry we require the "sp"
register to point to a MAP_STACK region (this severely damages ROP pivot
methods).  Linux does not have MAP_STACK enforcement (yet), but one day
someone may try to do that work.

This interacted poorly with mimmutable() because some applications
allocate the memory being provided poorly.  I won't get into the details
unless pushed, because what we found makes me upset.  Over the years,
we've upstreamed diffs to applications to resolve all the nasty
allocation patterns.  I think the software ecosystem is now mostly
clean.

I suggest someone in Linux look into whether sigaltstack() is a mseal()
bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the
correct strategy.

This is our documented strategy:

     On OpenBSD some additional restrictions prevent dangerous address space
     modifications.  The proposed space at ss_sp is verified to be
     contiguously mapped for read-write permissions (no execute) and incapable
     of syscall entry (see msyscall(2)).  If those conditions are met, a page-
     aligned inner region will be freshly mapped (all zero) with MAP_STACK
     (see mmap(2)), destroying the pre-existing data in the region.  Once the
     sigaltstack is disabled, the MAP_STACK attribute remains on the memory,
     so it is best to deallocate the memory via a method that results in
     munmap(2).

OK, I better provide the details of what people were doing.
sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the
stack, we even found one creating a sigaltstack inside a buffer on a
pthread stack.  We told everyone to use mmap() and munmap(), with MAP_STACK
if #ifdef MAP_STACK finds a definition.
Jeff Xu Feb. 2, 2024, 5:24 p.m. UTC | #25
On Fri, Feb 2, 2024 at 7:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@google.com> [240201 22:15]:
> > On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > > > <Liam.Howlett@oracle.com> wrote:
> > > > >
> > >
> > > Having to opt-in to allowing mseal will probably not work well.
> > I'm leaving the opt-in discussion in Linus's thread.
> >
> > > Initial library mappings happen in one huge chunk then it's cut up into
> > > smaller VMAs, at least that's what I see with my maple tree tracing.  If
> > > you opt-in, then the entire library will have to opt-in and so the
> > > 'discourage inadvertent sealing' argument is not very strong.
> > >
> > Regarding "The initial library mappings happen in one huge chunk then
> > it is cut up into smaller VMAS", this is not a problem.
> >
> > As example of elf loading (fs/binfmt_elf.c), there is just a few
> > places to pass in what type of memory to be allocated, e.g.
> > MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
> > places.
> > If glic does additional splitting on the memory range, by using
> > mprotect(), then the MAP_SEALABLE is automatically applied after
> > splitting.
> > If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).
>
> You are adding a flag that requires a new glibc.  When I try to point
> out how this is unnecessary and excessive, you tell me it's fine and
> probably not a whole lot of work.
>
> This isn't working with developers, you are dismissing the developers
> who are trying to help you.
>
> Can you please:
>
> Provide code that uses this feature.
>
> Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
> 32 VMAs.
>
I will prepare for the benchmark tests.

> Provide code that tests and checks the failure paths.  Failures at the
> start, middle, and end of the modifications.
>
Regarding, "Failures at the start, middle, and end of the modifications."

With the current implementation, e.g. it checks if the sealing is
applied before actual modification of VMAs, so partial modifications
are avoided in mprotect, mremap, munmap.

There are test cases in the selftests to cover the failure path,
including the beginning, middle and end of VMAs.
test_seal_unmapped_start
test_seal_unmapped_middle
test_seal_unmapped_end
test_seal_invalid_input
test_seal_start_mprotect
test_seal_end_mprotect
etc.

Are those what you are looking for ?

> Document what happens in those failure paths.
>
> And, most importantly: keep an open mind and allow your opinion to
> change when presented with new information.
>
> All of these things are to help you.  We need to know what needs fixing
> so you can be successful.
>
Thanks for those feedbacks.

I sincerely wish for more of those help so this syscall can be useful.

Thanks.
Best Regards,
-Jeff

>
> Thanks,
> Liam
Jeff Xu Feb. 2, 2024, 5:58 p.m. UTC | #26
On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Jeff Xu <jeffxu@chromium.org> wrote:
>
> > Even without free.
> > I personally do not like the heap getting sealed like that.
> >
> > Component A.
> > p=malloc(4096);
> > writing something to p.
> >
> > Component B:
> > mprotect(p,4096, RO)
> > mseal(p,4096)
> >
> > This will split the heap VMA, and prevent the heap from shrinking, if
> > this is in a frequent code path, then it might hurt the process's
> > memory usage.
> >
> > The existing code is more likely to use malloc than mmap(), so it is
> > easier for dev to seal a piece of data belonging to another component.
> > I hope this pattern is not wide-spreading.
> >
> > The ideal way will be just changing the library A to use mmap.
>
> I think you are lacking some test programs to see how it actually
> behaves; the effect is worse than you think, and the impact is immediately
> visible to the programmer, and the lesson is clear:
>
>         you can only seal objects which you gaurantee never get recycled.
>
>         Pushing a sealed object back into reuse is a disasterous bug.
>
>         Noone should call this interface, unless they understand that.
>
> I'll say again, you don't have a test program for various allocators to
> understand how it behaves.  The failure modes described in your docuemnts
> are not correct.
>
I understand what you mean: I will add that part to the document:
Try to recycle a sealed memory is disastrous, e.g.
p=malloc(4096);
mprotect(p,4096,RO)
mseal(p,4096)
free(p);

My point is:
I think sealing an object from the heap is a bad pattern in general,
even dev doesn't free it. That was one of the reasons for the sealable
flag, I hope saying this doesn't be perceived as looking for excuses.

>
Pedro Falcato Feb. 2, 2024, 6:51 p.m. UTC | #27
On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > > Even without free.
> > > I personally do not like the heap getting sealed like that.
> > >
> > > Component A.
> > > p=malloc(4096);
> > > writing something to p.
> > >
> > > Compohave nent B:
> > > mprotect(p,4096, RO)
> > > mseal(p,4096)
> > >
> > > This will split the heap VMA, and prevent the heap from shrinking, if
> > > this is in a frequent code path, then it might hurt the process's
> > > memory usage.
> > >
> > > The existing code is more likely to use malloc than mmap(), so it is
> > > easier for dev to seal a piece of data belonging to another component.
> > > I hope this pattern is not wide-spreading.
> > >
> > > The ideal way will be just changing the library A to use mmap.
> >
> > I think you are lacking some test programs to see how it actually
> > behaves; the effect is worse than you think, and the impact is immediately
> > visible to the programmer, and the lesson is clear:
> >
> >         you can only seal objects which you gaurantee never get recycled.
> >
> >         Pushing a sealed object back into reuse is a disasterous bug.
> >
> >         Noone should call this interface, unless they understand that.
> >
> > I'll say again, you don't have a test program for various allocators to
> > understand how it behaves.  The failure modes described in your docuemnts
> > are not correct.
> >
> I understand what you mean: I will add that part to the document:
> Try to recycle a sealed memory is disastrous, e.g.
> p=malloc(4096);
> mprotect(p,4096,RO)
> mseal(p,4096)
> free(p);
>
> My point is:
> I think sealing an object from the heap is a bad pattern in general,
> even dev doesn't free it. That was one of the reasons for the sealable
> flag, I hope saying this doesn't be perceived as looking for excuses.

The point you're missing is that adding MAP_SEALABLE reduces
composability. With MAP_SEALABLE, everything that mmaps some part of
the address space that may ever be sealed will need to be modified to
know about MAP_SEALABLE.

Say you did the same thing for mprotect. MAP_PROTECT would control the
mprotectability of the map. You'd stop:

p = malloc(4096);
mprotect(p, 4096, PROT_READ);
free(p);

! But you'd need to change every spot that mmap()'s something to know
about and use MAP_PROTECT: all "producers" of mmap memory would need
to know about the consumers doing mprotect(). So now either all mmap()
callers mindlessly add MAP_PROTECT out of fear the consumers do
mprotect (and you gain nothing from MAP_PROTECT), or the mmap()
callers need to know the consumers call mprotect(), and thus you
introduce a huge layering violation (and you actually lose from having
MAP_PROTECT).

Hopefully you can map the above to MAP_SEALABLE. Or to any other m*()
operation. For example, if chrome runs on an older glibc that does not
know about MAP_SEALABLE, it will not be able to mseal() its own shared
libraries' .text (even if, yes, that should ideally be left to ld.so).

IMO, UNIX API design has historically mostly been "play stupid games,
win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
work. If you close stdout (and don't dup/reopen something to stdout)
and printf(), things will break, and you get to keep both pieces.
There's no O_CLOSEABLE, just as there's no O_DUPABLE.
Liam R. Howlett Feb. 2, 2024, 7:21 p.m. UTC | #28
* Jeff Xu <jeffxu@chromium.org> [240202 12:24]:

...

> > Provide code that uses this feature.

Please do this too :)

> >
> > Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
> > 32 VMAs.
> >
> I will prepare for the benchmark tests.

Thank you, please also include runs of calls that you are modifying for
checking for mseal() as we are adding loops there.

> 
> > Provide code that tests and checks the failure paths.  Failures at the
> > start, middle, and end of the modifications.
> >
> Regarding, "Failures at the start, middle, and end of the modifications."
> 
> With the current implementation, e.g. it checks if the sealing is
> applied before actual modification of VMAs, so partial modifications
> are avoided in mprotect, mremap, munmap.
> 
> There are test cases in the selftests to cover the failure path,
> including the beginning, middle and end of VMAs.
> test_seal_unmapped_start
> test_seal_unmapped_middle
> test_seal_unmapped_end
> test_seal_invalid_input
> test_seal_start_mprotect
> test_seal_end_mprotect
> etc.
> 
> Are those what you are looking for ?

Those are certainly good, but we need more checking in there.  You have
a seal_split test that splits the vma by mseal but you don't check the
flags on the VMAs.

What I'm more concerned about is what happens if you call mseal() on a
range and it can mseal a portion.  Like, what happens to the first vma
in your test_seal_unmapped_middle case?  I see it returns an error, but
is the first VMA mseal()'ed? (no it's not, but test that)

What about the other system calls that will be denied on an mseal() VMA?
Do they still behave the same?  do_mprotect_pkey() will break out of the
loop on the first error it sees - but it has modified some VMAs up to
that point, I believe?  You have changed this to abort before anything
is modified.  This is probably acceptable because it won't affect
existing applications unless they start using mseal(), but that's just
my opinion.

It would be good to state the change in behaviour because it is changing
the fundamental model of changing mprotect/madvise until an issue is
hit.  I think you are covering this by "it blocks X" but it's doing more
than, say, a flag verification.  One could reasonably assume this is
just another flag verification.

> 
> > Document what happens in those failure paths.

I'd like to know how this affects other system calls in the partial
success cases/return error cases.  Some will now return new error codes
and some may change the behaviour.

It may even be okay to allow munmap() to split VMAs at the start/end of
the region and fail to munmap because some VMA in the middle is
mseal()'ed - but maybe not?  I haven't put a whole lot of thought into
it.

Thanks,
Liam
Theo de Raadt Feb. 2, 2024, 7:32 p.m. UTC | #29
> What I'm more concerned about is what happens if you call mseal() on a
> range and it can mseal a portion.  Like, what happens to the first vma
> in your test_seal_unmapped_middle case?  I see it returns an error, but
> is the first VMA mseal()'ed? (no it's not, but test that)

That is correct, Liam.

Unix system calls must be atomic.

They either return an error, and that is a promise they made no changes.

Or they do the work required, and then return success.

In OpenBSD, all mimmutable() aspects were carefully studied to gaurantee
this behaviour.

I am not an expert in the Linux kernel to make the assessment; someone
who is qualified must make that assessment.  Fuzzing with tests is a good
way to judge it simpler.
Jeff Xu Feb. 2, 2024, 8:14 p.m. UTC | #30
On Fri, Feb 2, 2024 at 11:21 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@chromium.org> [240202 12:24]:
>
> ...
>
> > > Provide code that uses this feature.
>
> Please do this too :)
>
Yes. Will do.


> > >
> > > Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and
> > > 32 VMAs.
> > >
> > I will prepare for the benchmark tests.
>
> Thank you, please also include runs of calls that you are modifying for
> checking for mseal() as we are adding loops there.
>
It will includes mmap/mremap/mprotect/munmap

> >
> > > Provide code that tests and checks the failure paths.  Failures at the
> > > start, middle, and end of the modifications.
> > >
> > Regarding, "Failures at the start, middle, and end of the modifications."
> >
> > With the current implementation, e.g. it checks if the sealing is
> > applied before actual modification of VMAs, so partial modifications
> > are avoided in mprotect, mremap, munmap.
> >
> > There are test cases in the selftests to cover the failure path,
> > including the beginning, middle and end of VMAs.
> > test_seal_unmapped_start
> > test_seal_unmapped_middle
> > test_seal_unmapped_end
> > test_seal_invalid_input
> > test_seal_start_mprotect
> > test_seal_end_mprotect
> > etc.
> >
> > Are those what you are looking for ?
>
> Those are certainly good, but we need more checking in there.  You have
> a seal_split test that splits the vma by mseal but you don't check the
> flags on the VMAs.
>
I can add the flag check.

> What I'm more concerned about is what happens if you call mseal() on a
> range and it can mseal a portion.  Like, what happens to the first vma
> in your test_seal_unmapped_middle case?  I see it returns an error, but
> is the first VMA mseal()'ed? (no it's not, but test that)
>
The first VMA is not sealed.
That was covered by test_seal_mprotect_two_vma_with_gap.

> What about the other system calls that will be denied on an mseal() VMA?
The other system call's behavior is kept as is, if the memory is not sealed.

> Do they still behave the same?  do_mprotect_pkey() will break out of the
> loop on the first error it sees - but it has modified some VMAs up to
> that point, I believe?
Yes. The description about do_mprotect_pkey() is correct.

> You have changed this to abort before anything
> is modified.  This is probably acceptable because it won't affect
> existing applications unless they start using mseal(), but that's just
> my opinion.
>
To make sure this, the test was written with sealing=false, those
tests are passed in the main (before applying my patch) to make sure
the test is correct.

> It would be good to state the change in behaviour because it is changing
> the fundamental model of changing mprotect/madvise until an issue is
> hit.  I think you are covering this by "it blocks X" but it's doing more
> than, say, a flag verification.  One could reasonably assume this is
> just another flag verification.
>
Will add more in documentation.

> >
> > > Document what happens in those failure paths.
>
> I'd like to know how this affects other system calls in the partial
> success cases/return error cases.  Some will now return new error codes
> and some may change the behaviour.
>
For the mapping that is not sealed, all remain unchanged, including
the error handling path.
For the mapping that is sealed, EPREM is returned if the sealing check
fails, and all of VMAs remain unchanged.

> It may even be okay to allow munmap() to split VMAs at the start/end of
> the region and fail to munmap because some VMA in the middle is
> mseal()'ed - but maybe not?  I haven't put a whole lot of thought into
> it.
If you are referring to something like below
[unmapped][map1][unmapped][map2][unmapped][map3][unmapped]
and map2 is sealed.

unmap(start of map1,end of map3) will fail.
mmap/mremap/unmap/mprotect on an address range that includes map2 will
fail with EPERM, with map1/map2/map3 unchanged.

Thanks
-Jeff

>
> Thanks,
> Liam
Linus Torvalds Feb. 2, 2024, 8:36 p.m. UTC | #31
On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Unix system calls must be atomic.
>
> They either return an error, and that is a promise they made no changes.

That's actually not true, and never has been.

It's a good thing to aim for, but several errors means "some or all
may have been done".

EFAULT (for various system calls), ENOMEM and other errors are all
things that can happen after some of the system call has already been
done, and the rest failed.

There are lots of examples, but to pick one obvious VM example,
something like mlock() may well return an error after the area has
been successfully locked, but then the population of said pages failed
for some reason.

Of course, implementations can differ, and POSIX sometimes has insane
language that is actively incorrect.

Furthermore, the definition of "atomic" is unclear. For example, POSIX
claims that a "write()" system call is one atomic thing for regular
files, and some people think that means that you see all or nothing.
That's simply not true, and you'll see the write progress in various
indirect ways (look at intermediate file size with 'stat', look at
intermediate contents with 'mmap' etc etc).

So I agree that atomicity is something that people should always
*strive* for, but it's not some kind of final truth or absolute
requirement.

In the specific case of mseal(), I suspect there are very few reasons
ever *not* to be atomic, so in this particular context atomicity is
likely always something that should be guaranteed. But I just wanted
to point out that it's most definitely not a black-and-white issue in
the general case.

             Linus
Jeff Xu Feb. 2, 2024, 8:57 p.m. UTC | #32
On Fri, Feb 2, 2024 at 12:37 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Unix system calls must be atomic.
> >
> > They either return an error, and that is a promise they made no changes.
>
> That's actually not true, and never has been.
>
> It's a good thing to aim for, but several errors means "some or all
> may have been done".
>
> EFAULT (for various system calls), ENOMEM and other errors are all
> things that can happen after some of the system call has already been
> done, and the rest failed.
>
> There are lots of examples, but to pick one obvious VM example,
> something like mlock() may well return an error after the area has
> been successfully locked, but then the population of said pages failed
> for some reason.
>
> Of course, implementations can differ, and POSIX sometimes has insane
> language that is actively incorrect.
>
> Furthermore, the definition of "atomic" is unclear. For example, POSIX
> claims that a "write()" system call is one atomic thing for regular
> files, and some people think that means that you see all or nothing.
> That's simply not true, and you'll see the write progress in various
> indirect ways (look at intermediate file size with 'stat', look at
> intermediate contents with 'mmap' etc etc).
>
> So I agree that atomicity is something that people should always
> *strive* for, but it's not some kind of final truth or absolute
> requirement.
>
> In the specific case of mseal(), I suspect there are very few reasons
> ever *not* to be atomic, so in this particular context atomicity is
> likely always something that should be guaranteed. But I just wanted
> to point out that it's most definitely not a black-and-white issue in
> the general case.
>
Thanks.
At least I got this part done right for mseal() :-)

-Jeff


>              Linus
>
Jeff Xu Feb. 2, 2024, 9:02 p.m. UTC | #33
On Fri, Feb 2, 2024 at 9:05 AM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Another interaction to consider is sigaltstack().
>
> In OpenBSD, sigaltstack() forces MAP_STACK onto the specified
> (pre-allocated) region, because on kernel-entry we require the "sp"
> register to point to a MAP_STACK region (this severely damages ROP pivot
> methods).  Linux does not have MAP_STACK enforcement (yet), but one day
> someone may try to do that work.
>
> This interacted poorly with mimmutable() because some applications
> allocate the memory being provided poorly.  I won't get into the details
> unless pushed, because what we found makes me upset.  Over the years,
> we've upstreamed diffs to applications to resolve all the nasty
> allocation patterns.  I think the software ecosystem is now mostly
> clean.
>
> I suggest someone in Linux look into whether sigaltstack() is a mseal()
> bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the
> correct strategy.
>

Thanks for bringing this up. I will follow up on sigaltstack() in Linux.

> This is our documented strategy:
>
>      On OpenBSD some additional restrictions prevent dangerous address space
>      modifications.  The proposed space at ss_sp is verified to be
>      contiguously mapped for read-write permissions (no execute) and incapable
>      of syscall entry (see msyscall(2)).  If those conditions are met, a page-
>      aligned inner region will be freshly mapped (all zero) with MAP_STACK
>      (see mmap(2)), destroying the pre-existing data in the region.  Once the
>      sigaltstack is disabled, the MAP_STACK attribute remains on the memory,
>      so it is best to deallocate the memory via a method that results in
>      munmap(2).
>
> OK, I better provide the details of what people were doing.
> sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the
> stack, we even found one creating a sigaltstack inside a buffer on a
> pthread stack.  We told everyone to use mmap() and munmap(), with MAP_STACK
> if #ifdef MAP_STACK finds a definition.
>
Liam R. Howlett Feb. 2, 2024, 9:18 p.m. UTC | #34
* Linus Torvalds <torvalds@linux-foundation.org> [240202 15:37]:
> On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Unix system calls must be atomic.
> >
> > They either return an error, and that is a promise they made no changes.
> 
> That's actually not true, and never has been.

...

> 
> In the specific case of mseal(), I suspect there are very few reasons
> ever *not* to be atomic, so in this particular context atomicity is
> likely always something that should be guaranteed. But I just wanted
> to point out that it's most definitely not a black-and-white issue in
> the general case.

There will be a larger performance cost to checking up front without
allowing the partial completion.  I don't expect these to be high, but
it's something to keep in mind if we are okay with the flexibility and
less atomic operation.

Thanks,
Liam
Jeff Xu Feb. 2, 2024, 9:20 p.m. UTC | #35
On Fri, Feb 2, 2024 at 10:52 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote:
> > >
> > > Jeff Xu <jeffxu@chromium.org> wrote:
> > >
> > > > Even without free.
> > > > I personally do not like the heap getting sealed like that.
> > > >
> > > > Component A.
> > > > p=malloc(4096);
> > > > writing something to p.
> > > >
> > > > Compohave nent B:
> > > > mprotect(p,4096, RO)
> > > > mseal(p,4096)
> > > >
> > > > This will split the heap VMA, and prevent the heap from shrinking, if
> > > > this is in a frequent code path, then it might hurt the process's
> > > > memory usage.
> > > >
> > > > The existing code is more likely to use malloc than mmap(), so it is
> > > > easier for dev to seal a piece of data belonging to another component.
> > > > I hope this pattern is not wide-spreading.
> > > >
> > > > The ideal way will be just changing the library A to use mmap.
> > >
> > > I think you are lacking some test programs to see how it actually
> > > behaves; the effect is worse than you think, and the impact is immediately
> > > visible to the programmer, and the lesson is clear:
> > >
> > >         you can only seal objects which you gaurantee never get recycled.
> > >
> > >         Pushing a sealed object back into reuse is a disasterous bug.
> > >
> > >         Noone should call this interface, unless they understand that.
> > >
> > > I'll say again, you don't have a test program for various allocators to
> > > understand how it behaves.  The failure modes described in your docuemnts
> > > are not correct.
> > >
> > I understand what you mean: I will add that part to the document:
> > Try to recycle a sealed memory is disastrous, e.g.
> > p=malloc(4096);
> > mprotect(p,4096,RO)
> > mseal(p,4096)
> > free(p);
> >
> > My point is:
> > I think sealing an object from the heap is a bad pattern in general,
> > even dev doesn't free it. That was one of the reasons for the sealable
> > flag, I hope saying this doesn't be perceived as looking for excuses.
>
> The point you're missing is that adding MAP_SEALABLE reduces
> composability. With MAP_SEALABLE, everything that mmaps some part of
> the address space that may ever be sealed will need to be modified to
> know about MAP_SEALABLE.
>
> Say you did the same thing for mprotect. MAP_PROTECT would control the
> mprotectability of the map. You'd stop:
>
> p = malloc(4096);
> mprotect(p, 4096, PROT_READ);
> free(p);
>
> ! But you'd need to change every spot that mmap()'s something to know
> about and use MAP_PROTECT: all "producers" of mmap memory would need
> to know about the consumers doing mprotect(). So now either all mmap()
> callers mindlessly add MAP_PROTECT out of fear the consumers do
> mprotect (and you gain nothing from MAP_PROTECT), or the mmap()
> callers need to know the consumers call mprotect(), and thus you
> introduce a huge layering violation (and you actually lose from having
> MAP_PROTECT).
>
> Hopefully you can map the above to MAP_SEALABLE. Or to any other m*()
> operation. For example, if chrome runs on an older glibc that does not
> know about MAP_SEALABLE, it will not be able to mseal() its own shared
> libraries' .text (even if, yes, that should ideally be left to ld.so).
>
I think I have heard enough complaints about MAP_SEALABLE from Linux
developers and Linus in the last two days to convince myself that it
is a bad idea :)

For the last time, I was trying to limit the scope of mseal() limited
to two known cases. And MAP_SEALABLE is a reversible decision, a
system ctrl can turn it off, or we can obsolete it in future. (this
was mentioned in the document of V8).

I will rest my case. Obviously from the feedback,  it is loud and
clear that we want to be able to seal all the memory.

> IMO, UNIX API design has historically mostly been "play stupid games,
> win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
> work. If you close stdout (and don't dup/reopen something to stdout)
> and printf(), things will break, and you get to keep both pieces.
> There's no O_CLOSEABLE, just as there's no O_DUPABLE.
>
> --
> Pedro
Linus Torvalds Feb. 2, 2024, 11:36 p.m. UTC | #36
On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> There will be a larger performance cost to checking up front without
> allowing the partial completion.

I suspect that for mseal(), the only half-way common case will be
sealing an area that is entirely contained within one vma.

So the cost will be the vma splitting (if it's not the whole vma), and
very unlikely to be any kind of "walk the vma's to check that they can
all be sealed" loop up-front.

We'll see, but that's my gut feel, at least.

               Linus
Liam R. Howlett Feb. 3, 2024, 4:45 a.m. UTC | #37
* Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]:
> On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > There will be a larger performance cost to checking up front without
> > allowing the partial completion.
> 
> I suspect that for mseal(), the only half-way common case will be
> sealing an area that is entirely contained within one vma.

Agreed.

> 
> So the cost will be the vma splitting (if it's not the whole vma), and
> very unlikely to be any kind of "walk the vma's to check that they can
> all be sealed" loop up-front.

That's the cost of calling mseal(), and I think that will be totally
reasonable.

I'm more concerned with the other calls that do affect more than one vma
that will now have to ensure there is not an mseal'ed vma among the
affected area.

As you pointed out, we don't do atomic updates and so we have to add a
loop at the beginning to check this new special case, which is what this
patch set does today.  That means we're going to be looping through
twice for any call that could fail if one is mseal'ed. This includes
munmap() and mprotect().

The impact will vary based on how many vma's are handled. I'd like some
numbers on this so we can see if it is a concern, which Jeff has agreed
to provide in the future - Thank you, Jeff.

It also means we're modifying the behaviour of those calls so they could
fail before anything changes (regardless of where the failure would
occur), and we could still fail later when another aspect of a vma would
cause a failure as we do today.  We are paying the price for a more
atomic update, but we aren't trying very hard to be atomic with our
updates - we don't have many (virtually no) vma checks before
modifications start.

For instance, we could move the mprotect check for map_deny_write_exec()
to the pre-update loop to make it more atomic in nature.  This one seems
somewhat related to mseal, so it would be better if they were both
checked atomic(ish) together.  Although, I wonder if the user visible
changes would be acceptable and worth the risk.

We will have two classes of updates to vma's: the more atomic view and
the legacy view.  The question of what happens when the two mix, or
where a specific check should go will get (more) confusing.

Thanks,
Liam
David Laight Feb. 4, 2024, 7:39 p.m. UTC | #38
...
> IMO, UNIX API design has historically mostly been "play stupid games,
> win stupid prizes", which is e.g: why things like close(STDOUT_FILENO)
> work. If you close stdout (and don't dup/reopen something to stdout)
> and printf(), things will break, and you get to keep both pieces.

That is pretty much why libraries must never use printf().
(Try telling that to people at work!)

In the days when processes could only have 20 files open
it was a much bigger problem.
You couldn't afford to not use 0, 1 and 2.
A certain daemon ended up using fd 1 as a pipe to another daemon.
Someone accidentally used printf() instead of fprintf() for a trace.
When the 10k stdio buffer filled the text got written to the pipe.
The expected fixed size message had a 32bit 'trailer' size.
Although no defined messages supported trailers the second daemon
synchronously discarded the trailer - with the expected side effect.

Wasn't my bug, and someone else found it, but I'd read the broken
code a few times without seeing the fubar.

Trouble is it all worked for quite a long time...

	David
 

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Suren Baghdasaryan Feb. 5, 2024, 10:13 p.m. UTC | #39
On Fri, Feb 2, 2024 at 8:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]:
> > On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > There will be a larger performance cost to checking up front without
> > > allowing the partial completion.
> >
> > I suspect that for mseal(), the only half-way common case will be
> > sealing an area that is entirely contained within one vma.
>
> Agreed.
>
> >
> > So the cost will be the vma splitting (if it's not the whole vma), and
> > very unlikely to be any kind of "walk the vma's to check that they can
> > all be sealed" loop up-front.
>
> That's the cost of calling mseal(), and I think that will be totally
> reasonable.
>
> I'm more concerned with the other calls that do affect more than one vma
> that will now have to ensure there is not an mseal'ed vma among the
> affected area.
>
> As you pointed out, we don't do atomic updates and so we have to add a
> loop at the beginning to check this new special case, which is what this
> patch set does today.  That means we're going to be looping through
> twice for any call that could fail if one is mseal'ed. This includes
> munmap() and mprotect().
>
> The impact will vary based on how many vma's are handled. I'd like some
> numbers on this so we can see if it is a concern, which Jeff has agreed
> to provide in the future - Thank you, Jeff.

Yes please. The additional walk Liam points to seems to be happening
even if we don't use mseal at all. Android apps often create thousands
of VMAs, so a small regression to a syscall like mprotect might cause
a very visible regression to app launch times (one of the key metrics
for Android). Having performance impact numbers here would be very
helpful.

>
> It also means we're modifying the behaviour of those calls so they could
> fail before anything changes (regardless of where the failure would
> occur), and we could still fail later when another aspect of a vma would
> cause a failure as we do today.  We are paying the price for a more
> atomic update, but we aren't trying very hard to be atomic with our
> updates - we don't have many (virtually no) vma checks before
> modifications start.
>
> For instance, we could move the mprotect check for map_deny_write_exec()
> to the pre-update loop to make it more atomic in nature.  This one seems
> somewhat related to mseal, so it would be better if they were both
> checked atomic(ish) together.  Although, I wonder if the user visible
> changes would be acceptable and worth the risk.
>
> We will have two classes of updates to vma's: the more atomic view and
> the legacy view.  The question of what happens when the two mix, or
> where a specific check should go will get (more) confusing.
>
> Thanks,
> Liam
>