Message ID | 20240131175027.3287009-1-jeffxu@chromium.org (mailing list archive) |
---|---|
Headers | show |
Series | Introduce mseal | expand |
Please add me to the Cc list of these patches. * jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]: > From: Jeff Xu <jeffxu@chromium.org> > > This patchset proposes a new mseal() syscall for the Linux kernel. > > In a nutshell, mseal() protects the VMAs of a given virtual memory > range against modifications, such as changes to their permission bits. > > Modern CPUs support memory permissions, such as the read/write (RW) > and no-execute (NX) bits. Linux has supported NX since the release of > kernel version 2.6.8 in August 2004 [1]. The memory permission feature > improves the security stance on memory corruption bugs, as an attacker > cannot simply write to arbitrary memory and point the code to it. The > memory must be marked with the X bit, or else an exception will occur. > Internally, the kernel maintains the memory permissions in a data > structure called VMA (vm_area_struct). mseal() additionally protects > the VMA itself against modifications of the selected seal type. ... The v8 cut Jonathan's email discussion [1] off and instead of replying there, I'm going to add my question here. The best plan to ensure it is a general safety measure for all of linux is to work with the community before it lands upstream. It's much harder to change functionality provided to users after it is upstream. I'm happy to hear google is super excited about sharing this, but so far, the community isn't as excited. It seems Theo has a lot of experience trying to add a feature very close to what you are doing and has real data on how this went [2]. Can we see if there is a solution that is, at least, different enough from what he tried to do for a shot of success? Do we have anyone in the toolchain groups that sees this working well? If this means Stephen needs to do something, can we get that to happen please? I mean, you specifically state that this is a 'very specific requirement' in your cover letter. Does this mean even other browsers have no use for it? I am very concerned this feature will land and have to be maintained by the core mm people for the one user it was specifically targeting. Can we also get some benchmarking on the impact of this feature? I believe my answer in v7 removed the worst offender, but since there is no benchmarking we really are guessing (educated or not, hard data would help). We still have an extra loop in madvise, mprotect_pkey, mremap_to (and mreamp syscall?). You also did not clean up the loop you copied from mlock, which I pointed out [3]. Stating that your copy/paste is easier to review is not sufficient to keep unneeded assignments around. [1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/ [2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/ [3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/
On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > Please add me to the Cc list of these patches. Ok. > > * jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]: > > From: Jeff Xu <jeffxu@chromium.org> > > > > This patchset proposes a new mseal() syscall for the Linux kernel. > > > > In a nutshell, mseal() protects the VMAs of a given virtual memory > > range against modifications, such as changes to their permission bits. > > > > Modern CPUs support memory permissions, such as the read/write (RW) > > and no-execute (NX) bits. Linux has supported NX since the release of > > kernel version 2.6.8 in August 2004 [1]. The memory permission feature > > improves the security stance on memory corruption bugs, as an attacker > > cannot simply write to arbitrary memory and point the code to it. The > > memory must be marked with the X bit, or else an exception will occur. > > Internally, the kernel maintains the memory permissions in a data > > structure called VMA (vm_area_struct). mseal() additionally protects > > the VMA itself is against modifications of the selected seal type. > > ... The v8 cut Jonathan's email discussion [1] off and > instead of > replying there, I'm going to add my question here. > > The best plan to ensure it is a general safety measure for all of linux > is to work with the community before it lands upstream. It's much > harder to change functionality provided to users after it is upstream. > I'm happy to hear google is super excited about sharing this, but so > far, the community isn't as excited. > > It seems Theo has a lot of experience trying to add a feature very close > to what you are doing and has real data on how this went [2]. Can we > see if there is a solution that is, at least, different enough from what > he tried to do for a shot of success? Do we have anyone in the > toolchain groups that sees this working well? If this means Stephen > needs to do something, can we get that to happen please? > For Theo's input from OpenBSD's perspective; IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same scope on what operations to seal, e.g. considering the progress made on both sides since the beginning of the RFC: - mseal(Linux): dropped "multiple-bit" approach. - mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED). The difference is in mmap(), i.e. - mseal(Linux): support of PROT_SEAL in mmap(). - mseal(Linux): use of MAP_SEALABLE in mmap(). I considered Theo's inputs from OpenBSD's perspective regarding the difference, and I wasn't convinced that Linux should remove these. In my view, those are two different kernels code, and the difference in Linux is not added without reasons (for MAP_SEALABLE, there is a note in the documentation section with details). I would love to hear more from Linux developers on this. > I mean, you specifically state that this is a 'very specific > requirement' in your cover letter. Does this mean even other browsers > have no use for it? > No, I don’t mean “other browsers have no use for it”. About specific requirements from Chrome, that refers to "The lifetime of those mappings are not tied to the lifetime of the process, which is not the case of libc" as in the cover letter. This addition to the cover letter was made in V3, thus, it might be beneficial to provide additional context to help answer the question. This patch series begins with multiple-bit approaches (v1,v2,v3), the rationale for this is that I am uncertain if Chrome's specific needs are common enough for other use cases. Consequently, I am unable to make this decision myself without input from the community. To accommodate this, multiple bits are selected initially due to their adaptability. Since V1, after hearing from the community, Chrome has changed its design (no longer relying on separating out mprotect), and Linus acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs, today mseal() has a simple design that: - meet Chrome's specific needs. - meet Libc's needs. - Chrome's specific need doesn't interfere with Libc's. [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/ > I am very concerned this feature will land and have to be maintained by > the core mm people for the one user it was specifically targeting. > See above. This feature is not specifically targeting Chrome. > Can we also get some benchmarking on the impact of this feature? I > believe my answer in v7 removed the worst offender, but since there is > no benchmarking we really are guessing (educated or not, hard data would > help). We still have an extra loop in madvise, mprotect_pkey, mremap_to > (and mreamp syscall?). > Yes. There is an extra loop in mmap(FIXED), munmap(), madvise(DONOTNEED), mremap(), to emulate the VMAs for the given address range. I suspect the impact would be low, but having some hard data would be good. I will see what I can find to assist the perf testing. If you have a specific test suite in mind, I can also try it. > You also did not clean up the loop you copied from mlock, which I > pointed out [3]. Stating that your copy/paste is easier to review is > not sufficient to keep unneeded assignments around. > OK. > [1]. https://lore.kernel.org/linux-mm/87a5ong41h.fsf@meer.lwn.net/ > [2]. https://lore.kernel.org/linux-mm/86181.1705962897@cvs.openbsd.org/ > [3]. https://lore.kernel.org/linux-mm/20240124200628.ti327diy7arb7byb@revolver/
Jeff Xu <jeffxu@chromium.org> wrote: > I considered Theo's inputs from OpenBSD's perspective regarding the > difference, and I wasn't convinced that Linux should remove these. In > my view, those are two different kernels code, and the difference in > Linux is not added without reasons (for MAP_SEALABLE, there is a note > in the documentation section with details). That note is describing a fiction. > I would love to hear more from Linux developers on this. I'm not sure you are capable of listening. But I'll repeat for others to stop this train wreck: 1. When execve() maps a programs's .data section, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE? Does the kernel seal the .data section? It cannot, because of RELRO and IFUNCS. Do you know what those are? (like in OpenBSD) the kernel cannot and will *not* seal the .data section, it lets later code do that. 2. When execve() maps a programs's .bss section, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE? Does the kernel seal the .bss section? It cannot, because of RELRO and IFUNCS. Do you know what those are? (like in OpenBSD) the kernel cannot and will *not* seal the .bss section, it lets later code do that. In the proposed diff, the kernel does not set MAP_SEALABLE on those regions. How does a userland program seal the .data and .bss regions? It cannot. It is too late to set the MAP_SEALABLE, because the kernel already decided not do to it. So those regions cannot be sealed. 3. When execve() maps a programs's stack, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE? In the proposed diff, the kernel does not set MAP_SEALABLE. You think you can seal the stack in the kernel?? Sorry to be the bearer of bad news, but glibc has code which on occasion will mprotects the stack executable. But if userland decides that mprotect case won't occur -- how does a userland program seal its stack? It is now too late to set MAP_SEALABLE. So the stack must remain unsealed. 4. What about the text segment? 5. Do you know what a text-relocation is? They are now rare, but there are still compile/linker stages which will produce them, and there is software which requires that to work. It means userland fixes it's own .text, then calls mprotect. The kernel does not know if this will happen. 6. When execve() maps the .text segment, will it set MAP_SEALABLE? If it doesn't set it, userland cannot seal it's text after it makes the decision to do. You can continue to extrapolate those same points for all other segments of a static binary, all segments of a dynamic binary, all segments of the shared library linker. And then you can go further, and recognize the logic that will be needed in the shared library linker to *make the same decisions*. In each case, the *decision* to make a mapping happens in one piece of code, and the decision to use and NOW SEAL THAT MAPPING, happens in a different piece of code. The only answer to these problems will be to always set MAP_SEALABLE. To go through the entire Linux ecosystem, and change every call to mmap() to use this new MAP_SEALABLE flag, and it will look something like this: +#ifndef MAP_SEALABLE +#define MAP_SEALABLE 0 +#endif - ptr = mmap(...., MAP... - ptr = mmap(...., MAP_SEALABLE | MAP... Every single one of them, and you'll need to do it in the kernel. If you had spent a second trying to make this work in a second piece of software, you would have realized that the ONLY way this could work is by adding a flag with the opposite meaning: MAP_NOTSEALABLE But nothing will use that. I promise you > I would love to hear more from Linux developers on this. I'm not sure you are capable of listening.
I'd like to propose a new flag to the Linux open() system call. It is O_DUPABLE You mix it with other O_* flags to the open call, everyone is familiar with this, it is very easy to use. If the O_DUPABLE flag is set, the file descriptor may be cloned with dup(), dup2() or similar call. If not set, those calls will return with -1 EPERM. I know it goes strongly against the grain of ancient assumptions that file descriptors (just like memory) are fully mutable, and therefore managed with care. But in these trying times, we need protection against file descriptor desecration. It protects programmers from accidentally making clones of file descriptors and leaking them out of programs, like I dunno, runc. OK, besides this one very specific place that could (maybe) use it today, there is other code which can use this but the margin is too narrow to contain. The documentation can describe the behaviour as similar to MAP_SEALABLE, so that noone is shocked. /sarc
> -----Original Message----- > From: Theo de Raadt <deraadt@openbsd.org> > > I would love to hear more from Linux developers on this. > > I'm not sure you are capable of listening. > Theo, It is possible to make your technical points, and even to express frustration that it has been difficult to get them across, without resorting to personal attacks. -- Tim
* Jeff Xu <jeffxu@chromium.org> [240131 20:27]: > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett > <Liam.Howlett@oracle.com> wrote: > > > > Please add me to the Cc list of these patches. > Ok. > > > > * jeffxu@chromium.org <jeffxu@chromium.org> [240131 12:50]: > > > From: Jeff Xu <jeffxu@chromium.org> > > > > > > This patchset proposes a new mseal() syscall for the Linux kernel. > > > > > > In a nutshell, mseal() protects the VMAs of a given virtual memory > > > range against modifications, such as changes to their permission bits. > > > > > > Modern CPUs support memory permissions, such as the read/write (RW) > > > and no-execute (NX) bits. Linux has supported NX since the release of > > > kernel version 2.6.8 in August 2004 [1]. The memory permission feature > > > improves the security stance on memory corruption bugs, as an attacker > > > cannot simply write to arbitrary memory and point the code to it. The > > > memory must be marked with the X bit, or else an exception will occur. > > > Internally, the kernel maintains the memory permissions in a data > > > structure called VMA (vm_area_struct). mseal() additionally protects > > > the VMA itself is against modifications of the selected seal type. > > > > ... The v8 cut Jonathan's email discussion [1] off and > > instead of > > replying there, I'm going to add my question here. > > > > The best plan to ensure it is a general safety measure for all of linux > > is to work with the community before it lands upstream. It's much > > harder to change functionality provided to users after it is upstream. > > I'm happy to hear google is super excited about sharing this, but so > > far, the community isn't as excited. > > > > It seems Theo has a lot of experience trying to add a feature very close > > to what you are doing and has real data on how this went [2]. Can we > > see if there is a solution that is, at least, different enough from what > > he tried to do for a shot of success? Do we have anyone in the > > toolchain groups that sees this working well? If this means Stephen > > needs to do something, can we get that to happen please? > > > For Theo's input from OpenBSD's perspective; > IIUC: as today, the mseal-Linux and mimmutable-OpenBSD has the same > scope on what operations to seal, e.g. considering the progress made > on both sides since the beginning of the RFC: > - mseal(Linux): dropped "multiple-bit" approach. > - mimmutable(OpenBSD): Dropped "downgradable"; Added madvise(DONOTNEED). > > The difference is in mmap(), i.e. > - mseal(Linux): support of PROT_SEAL in mmap(). > - mseal(Linux): use of MAP_SEALABLE in mmap(). > > I considered Theo's inputs from OpenBSD's perspective regarding the > difference, and I wasn't convinced that Linux should remove these. In > my view, those are two different kernels code, and the difference in > Linux is not added without reasons (for MAP_SEALABLE, there is a note > in the documentation section with details). > > I would love to hear more from Linux developers on this. Linus said it was really important to get the semantics correct, but you took his (unfinished) list and kept going. I think there are some unanswered questions and that's frustrating some people as you may not be valuing the experience they have in this area. You dropped the RFC from the topic and incremented the version numbering on the patch set. I thought it was customary to restart counting after the RFC was complete? Maybe I'm wrong, but it seemed a bit odd to see that happen. The documentation also implies there are still questions to be answered, so it seems this is still an RFC in some ways? I'd like to talk about the design some more. Having to opt-in to allowing mseal will probably not work well. Initial library mappings happen in one huge chunk then it's cut up into smaller VMAs, at least that's what I see with my maple tree tracing. If you opt-in, then the entire library will have to opt-in and so the 'discourage inadvertent sealing' argument is not very strong. It also makes a somewhat messy tracking of inheritance of the attribute across splitting, MAP_FIXED replacement, vma_move, vma_copy. I think most of this is forced on the user? It makes your call less flexible, it means you have to hope that the VMA origin was blessed before you decide you want to mseal it. What if you want to ensure the library mapped by a parent or on launch is mseal'ed? What about the initial relocated VMA (expand/shrink of VMA)? Creating something as "non-sealable" is pointless. If you don't want it sealed, then don't mseal() that region. If your use case doesn't need it, then can we please drop the opt-in behaviour and just have all VMAs treated the same? If it does need it, can you explain why? The glibc relocation/fixup will then work. glibc could mseal once it is complete - or an application could bypass glibc support and use the feature itself. If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the heap/stack concerns. We can either let people shoot their own feet off or try to protect them. Right now, you seem to be trying to protect them. Keeping with that, I guess we could either get the kernel to mark those VMAs or tell some other way? I'd suggest a range, but people do very strange things with these special VMAs [1]. I don't think you can predict enough crazy actions to make a difference in trying to protect people. There are far fewer VMAs that should not be allowed to be mseal'ed than should be, and the kernel creates those so it seems logical to only let the kernel opt-out on those ones. I'd rather just let people shoot themselves and return an error. I also hope it reduces the complexity of this code while increasing the flexibility of the feature. As stated before, we remove the dependency of needing support from the initial loader. Merging VMAs I can see this going Very Bad with brk + mseal. But, again, if someone decides to mseal these VMAs then they should expect Bad Things to happen (or maybe they know what they are doing even in some complex situation?) vma_merge() can also expand a VMA. I think this is okay as it checks for the same flags, so you will allow VMA expansion of two (or three) vma areas to become one. Is this okay in your model? > > > I mean, you specifically state that this is a 'very specific > > requirement' in your cover letter. Does this mean even other browsers > > have no use for it? > > > No, I don’t mean “other browsers have no use for it”. > > About specific requirements from Chrome, that refers to "The lifetime > of those mappings are not tied to the lifetime of the process, which > is not the case of libc" as in the cover letter. This addition to the > cover letter was made in V3, thus, it might be beneficial to provide > additional context to help answer the question. > > This patch series begins with multiple-bit approaches (v1,v2,v3), the > rationale for this is that I am uncertain if Chrome's specific needs > are common enough for other use cases. Consequently, I am unable to > make this decision myself without input from the community. To > accommodate this, multiple bits are selected initially due to their > adaptability. > > Since V1, after hearing from the community, Chrome has changed its > design (no longer relying on separating out mprotect), and Linus > acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs, > today mseal() has a simple design that: > - meet Chrome's specific needs. How many VMAs will chrome have that are mseal'ed? Is this a common operation? PROT_SEAL seems like an extra flag we could drop. I don't expect we'll be sealing enough VMAs that a hand full of extra syscalls would make a difference? > - meet Libc's needs. What needs of libc are you referring to? I'm looking through the version changelog and I guess you mean return EPERM? > - Chrome's specific need doesn't interfere with Libc's. > > [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/ Linus said he'd be happier if we made the change in general. > > > I am very concerned this feature will land and have to be maintained by > > the core mm people for the one user it was specifically targeting. > > > See above. This feature is not specifically targeting Chrome. > > > Can we also get some benchmarking on the impact of this feature? I > > believe my answer in v7 removed the worst offender, but since there is > > no benchmarking we really are guessing (educated or not, hard data would > > help). We still have an extra loop in madvise, mprotect_pkey, mremap_to > > (and mreamp syscall?). > > > Yes. There is an extra loop in mmap(FIXED), munmap(), > madvise(DONOTNEED), mremap(), to emulate the VMAs for the given > address range. I suspect the impact would be low, but having some hard > data would be good. I will see what I can find to assist the perf > testing. If you have a specific test suite in mind, I can also try it. You should look at mmtests [2]. But since you are adding loops across VMA ranges, you need to test loops across several ranges of VMAs. That is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or some subset of small and large numbers to get an idea of complexity we are adding. My hope is that the looping will be cache-hot in the maple tree and have minimum effect. In my personal testing, I've seen munmap often do a single VMA, or 3, or more rare 7 on x86_64. There should be some good starting points in mmtests for the common operations. [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c [2] https://github.com/gormanm/mmtests Thanks, Liam
There is another problem with adding PROT_SEAL to the mprotect() call. What are the precise semantics? If one reviews how mprotect() behaves, it is quickly clear that it is very sloppy specification. We spent quite a bit of effort making our manual page as clear as possible to the most it gaurantees, in the standard, and in all the various Unix: Not all implementations will guarantee protection on a page basis; the granularity of protection changes may be as large as an entire region. Nor will all implementations guarantee to give exactly the requested permissions; more permissions may be granted than requested by prot. However, if PROT_WRITE was not specified then the page will not be writable. Anything else is different. That is the specification in case of PROT_READ, PROT_WRITE, and PROT_EXEC. What happens if you add additional PROT_* flags? Does mprotect still behave just as sloppy (as specified)? Or it now return an error partway through an operation? When it returns the error, does it skip doing the work on the remaining region? Or does it skip doing any protection operation at all? (That means the code has to do two passes over the region; first one checks if it may proceed, second pass performs the change. I think I've reat PROT_SEAL was supposed to try to do things as one pass; is that actually possible without requiring a second pass in the kernel? To wit, do these two sequences have _exactly_ the same behaviour in all cases that we can think of - unmapped sub-regions - sealed sub-regions - and who knows what else mprotect() may encounter a) mprotect(addr, len, PROT_READ); mseal(addr, len, 0); b) mprotect(addr, len, PROT_READ | PROT_SEAL); Are they the same, or are they different? Here's what I think: mprotect() behaves quite differently if you add the PROT_SEAL flag, but I can't quite tell precisely what happens because I don't understand the linux vm system enough. (As an outsider, I have glanced at the new PROT_MTE flag changes; that one seem to just "set a flag where possible", rather than performing an action which could result in an error, and seems to not have this problem). As an outsider, Linux development is really strange: Two sub-features are being pushed very hard, and the primary developer doesn't have code which uses either of them. And once it goes in, it cannot be changed. It's very different from my world, where the absolutely minimal interface was written to apply to a whole operating system plus 10,000+ applications, and then took months of testing before it was approved for inclusion. And if it was subtly wrong, we would be able to change it.
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > I would love to hear more from Linux developers on this. > > Linus said it was really important to get the semantics correct, but you > took his (unfinished) list and kept going. I think there are some > unanswered questions and that's frustrating some people as you may not > be valuing the experience they have in this area. > Perhaps you didn't follow the discussions closely during the RFCs, so I like to clarify the timeline: - Dec.12: RFC V3 was out for comments: [1] This version added MAP_SEALABLE and sealing type in mmap() The sealing type in mmap() was suggested by Pedro Falcato during V1. [2] And MAP_SEALABLE is new to V3 and I added an open discussion in the cover letter. - Dec.14 Linus made a set of recommendations based on V3 [3], this is where Linus mentioned the semantics. Quoted below: "Particularly for new system calls with fairly specialized use, I think it's very important that the semantics are sensible on a conceptual level, and that we do not add system calls that are based on "random implementation issue of the day". - Jan.4: I sent out V4 of that patch for comments [5] This version implements all of Linus's recommendations made in V3. In V3, I didn't receive comments about MAP_SEALABLE, so I kept that as an open discussion item in V4 and specifically mentioned it in the first sentence of the V4 cover letter. "This is V4 of the patch, the patch has improved significantly since V1, thanks to diverse inputs, a few discussions remain, please read those in the open discussion section of v4 of change history." - Jan.4: Linus gave a comment on V4: [6] Quoted below: "Other than that, this seems all reasonable to me now." To me, this means Linus is OK with the general signatures of the APIs. -Jan.9 During comments for V5. [7] Kees suggested dropping RFC from subsequent versions, given Linus's general approval on the v4. [1] https://lore.kernel.org/all/80897.1705769947@cvs.openbsd.org/T/#mbf4749d465b80a575e1eda3c6f0c66d995abfc39 [2] https://lore.kernel.org/lkml/CAKbZUD2A+=bp_sd+Q0Yif7NJqMu8p__eb4yguq0agEcmLH8SDQ@mail.gmail.com/ [3] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/ [4] https://lore.kernel.org/all/CABi2SkUTdF6PHrudHTZZ0oWK-oU+T-5+7Eqnei4yCj2fsW2jHg@mail.gmail.com/#t [5] https://lore.kernel.org/lkml/796b6877-0548-4d2a-a484-ba4156104a20@infradead.org/T/#mb5c8bfe234759589cadf0bcee10eaa7e07b2301a [6] https://lore.kernel.org/lkml/CAHk-=wiy0nHG9+3rXzQa=W8gM8F6-MhsHrs_ZqWaHtjmPK4=FA@mail.gmail.com/ [7] https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/T/#m657fffd96ffff91902da53dc9dbc1bb093fe367c > You dropped the RFC from the topic and increment the version numbering > on the patch set. I thought it was customary to restart counting after > the RFC was complete? Maybe I'm wrong, but it seemed a bit odd to see > that happen. The documentation also implies there are still questions > to be answered, so it seems this is still an RFC in some ways? > The RFC has been dropped since V6. That said, I'm open to feedback from Linux developers. I will respond to the rest of your email in seperate emails. Best Regards. -Jeff
> To me, this means Linus is OK with the general signatures of the APIs.
Linus, you are in for a shock when the proposal doesn't work for glibc
and all the applications!
On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote: > > Linus, you are in for a shock when the proposal doesn't work for glibc > and all the applications! Heh. I've enjoyed seeing your argumentative style that made you so famous back in the days. Maybe it's always been there, but I haven't seen the BSD people in so long that I'd forgotten all about it. That said, famously argumentative or not, I think Theo is right, and I do think the MAP_SEALABLE bit is nonsensical. If somebody wants to mseal() a memory region, why would they need to express that ahead of time? So the part I think is sane is the mseal() system call itself, in that it allows *potential* future expansion of the semantics. But hopefully said future expansion isn't even needed, and all users want the base experience, which is why I think PROT_SEAL (both to mmap and to mprotect) makes sense as an alternative form. So yes, to my mind mprotect(addr, len, PROT_READ); mseal(addr, len, 0); should basically give identical results to mprotect(addr, len, PROT_READ | PROT_SEAL); and using PROT_SEAL at mmap() time is similarly the same obvious notion of "map this, and then seal that mapping". The reason for having "mseal()" as a separate call at all from the PROT_SEAL bit is that it does allow possible future expansion (while PROT_SEAL is just a single bit, and it won't change semantics) but also so that you can do whatever prep-work in stages if you want to, and then just go "now we seal it all". Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > So yes, to my mind > > mprotect(addr, len, PROT_READ); > mseal(addr, len, 0); > > should basically give identical results to > > mprotect(addr, len, PROT_READ | PROT_SEAL); > > and using PROT_SEAL at mmap() time is similarly the same obvious > notion of "map this, and then seal that mapping". I think that isn't easy to do. Let's expand it to show error checking. if (mprotect(addr, len, PROT_READ) == -1) react to the errno value if (mseal(addr, len, 0) == -1) react to the errno value and if (mprotect(addr, len, PROT_READ | PROT_SEAL) == -1) react to the errno value For current mprotect(), the errno values are mostly related to range issues with the parameters. After sealing a region, mprotect() also has the new errno EPERM. But what is the return value supposed to be from "PROT_READ | PROT_SEAL" over various sub-region types? Say I have a region 3 pages long. One page is unmapped, one page is regular, and one page is sealed. Re-arrange those 3 pages in all 6 permutations. Try them all. Does the returned errno change, based upon the order? Does it do part of the operation, or all of the operation? If the sealed page is first, the regular page is second, and the unmapped page is 3rd, does it return an error or return 0? Does it change the permission on the 3rd page? If it returns an error, has it changed any permissions? I don't think the diff follows the principle of if an error is returned --> we know nothing was changed. if success is returned --> we know all the requests were satisfied > The reason for having "mseal()" as a separate call at all from the > PROT_SEAL bit is that it does allow possible future expansion (while > PROT_SEAL is just a single bit, and it won't change semantics) but > also so that you can do whatever prep-work in stages if you want to, > and then just go "now we seal it all". How about you add basic mseal() that is maximum compatible with mimmutable(), and then we can all talk about whether PROT_SEAL makes sense once there are applications that demand it, and can prove they need it?
Linus Torvalds <torvalds@linux-foundation.org> wrote: > and using PROT_SEAL at mmap() time is similarly the same obvious > notion of "map this, and then seal that mapping". The usual way is: ptr = mmap(NULL, len PROT_READ|PROT_WRITE, ...) initialize region between ptr, ptr+len mprotect(ptr, len, PROT_READ) mseal(ptr, len, 0); Our source tree contains one place where a locking happens very close to a mmap(). It is the shared-library-linker 'hints file', this is a file that gets mapped PROT_READ and then we lock it. It feels like that could be one operation? It can't be. addr = (void *)mmap(0, hsize, PROT_READ, MAP_PRIVATE, hfd, 0); if (_dl_mmap_error(addr)) goto bad_hints; hheader = (struct hints_header *)addr; if (HH_BADMAG(*hheader) || hheader->hh_ehints > hsize) goto bad_hints; /* couple more error checks */ mimmutable(addr, hsize); close(hfd); return (0); bad_hints: munmap(addr, hsize); ... See the problem? It unmaps it if the contents are broken. So even that case cannot use something like "PROT_SEAL". These are not hypotheticals. I'm grepping an entire Unix kernel and userland source tree, and I know what 100,000+ applications do. I found piece of code that could almost use it, but upon inspection it can't, and it is obvious why: it is best idiom to allow a programmer to insert an inspection operation between two disctinct operations, and especially critical if the 2nd operation cannot be reversed. Noone needs PROT_SEAL as a shortcut operation in mmap() or mprotect(). Throwing around ideas without proving their use in practice is very unscientific.
On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote: > As an outsider, Linux development is really strange: > > Two sub-features are being pushed very hard, and the primary developer > doesn't have code which uses either of them. And once it goes in, it > cannot be changed. > > It's very different from my world, where the absolutely minimal > interface was written to apply to a whole operating system plus 10,000+ > applications, and then took months of testing before it was approved for > inclusion. And if it was subtly wrong, we would be able to change it. No, it's this "feature" submission that is strange to think that we don't need that. We do need, and will require, an actual working userspace something to use it, otherwise as you say, there's no way to actually know if it works properly or not and we can't change it once we accept it. So along those lines, Jeff, do you have a pointer to the Chrome patches, or glibc patches, that use this new interface that proves that it actually works? Those would be great to see to at least verify it's been tested in a real-world situation and actually works for your use case. thanks, greg k-h
On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]: > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett > > <Liam.Howlett@oracle.com> wrote: > > > > > Having to opt-in to allowing mseal will probably not work well. I'm leaving the opt-in discussion in Linus's thread. > Initial library mappings happen in one huge chunk then it's cut up into > smaller VMAs, at least that's what I see with my maple tree tracing. If > you opt-in, then the entire library will have to opt-in and so the > 'discourage inadvertent sealing' argument is not very strong. > Regarding "The initial library mappings happen in one huge chunk then it is cut up into smaller VMAS", this is not a problem. As example of elf loading (fs/binfmt_elf.c), there is just a few places to pass in what type of memory to be allocated, e.g. MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those places. If glic does additional splitting on the memory range, by using mprotect(), then the MAP_SEALABLE is automatically applied after splitting. If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE). > It also makes a somewhat messy tracking of inheritance of the attribute > across splitting, MAP_FIXED replacement, vma_move, vma_copy. I think > most of this is forced on the user? > The inheritance is the same as other VMA flags. > It makes your call less flexible, it means you have to hope that the VMA > origin was blessed before you decide you want to mseal it. > > What if you want to ensure the library mapped by a parent or on launch > is mseal'ed? > > What about the initial relocated VMA (expand/shrink of VMA)? > > Creating something as "non-sealable" is pointless. If you don't want it > sealed, then don't mseal() that region. > > If your use case doesn't need it, then can we please drop the opt-in > behaviour and just have all VMAs treated the same? > > If it does need it, can you explain why? > > The glibc relocation/fixup will then work. glibc could mseal once it is > complete - or an application could bypass glibc support and use the > feature itself. Yes. That is the idea. > > If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the > heap/stack concerns. We can either let people shoot their own feet off > or try to protect them. > > Right now, you seem to be trying to protect them. Keeping with that, I > guess we could either get the kernel to mark those VMAs or tell some > other way? I'd suggest a range, but people do very strange things with > these special VMAs [1]. I don't think you can predict enough crazy > actions to make a difference in trying to protect people. > > There are far fewer VMAs that should not be allowed to be mseal'ed than > should be, and the kernel creates those so it seems logical to only let > the kernel opt-out on those ones. > > I'd rather just let people shoot themselves and return an error. > > I also hope it reduces the complexity of this code while increasing the > flexibility of the feature. As stated before, we remove the dependency > of needing support from the initial loader. > > Merging VMAs > I can see this going Very Bad with brk + mseal. But, again, if someone > decides to mseal these VMAs then they should expect Bad Things to > happen (or maybe they know what they are doing even in some complex > situation?) > > vma_merge() can also expand a VMA. I think this is okay as it checks > for the same flags, so you will allow VMA expansion of two (or three) > vma areas to become one. Is this okay in your model? > > > > > > I mean, you specifically state that this is a 'very specific > > > requirement' in your cover letter. Does this mean even other browsers > > > have no use for it? > > > > > No, I don’t mean “other browsers have no use for it”. > > > > About specific requirements from Chrome, that refers to "The lifetime > > of those mappings are not tied to the lifetime of the process, which > > is not the case of libc" as in the cover letter. This addition to the > > cover letter was made in V3, thus, it might be beneficial to provide > > additional context to help answer the question. > > > > This patch series begins with multiple-bit approaches (v1,v2,v3), the > > rationale for this is that I am uncertain if Chrome's specific needs > > are common enough for other use cases. Consequently, I am unable to > > make this decision myself without input from the community. To > > accommodate this, multiple bits are selected initially due to their > > adaptability. > > > > Since V1, after hearing from the community, Chrome has changed its > > design (no longer relying on separating out mprotect), and Linus > > acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs, > > today mseal() has a simple design that: > > - meet Chrome's specific needs. > > How many VMAs will chrome have that are mseal'ed? Is this a common > operation? > > PROT_SEAL seems like an extra flag we could drop. I don't expect we'll > be sealing enough VMAs that a hand full of extra syscalls would make a > difference? > > > - meet Libc's needs. > > What needs of libc are you referring to? I'm looking through the > version changelog and I guess you mean return EPERM? > I meant libc's sealing RO part of the elf binary, those memory's lifetime are associated with the lifetime of the process. > > - Chrome's specific need doesn't interfere with Libc's. > > > > [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/ > > Linus said he'd be happier if we made the change in general. > > > > > > I am very concerned this feature will land and have to be maintained by > > > the core mm people for the one user it was specifically targeting. > > > > > See above. This feature is not specifically targeting Chrome. > > > > > Can we also get some benchmarking on the impact of this feature? I > > > believe my answer in v7 removed the worst offender, but since there is > > > no benchmarking we really are guessing (educated or not, hard data would > > > help). We still have an extra loop in madvise, mprotect_pkey, mremap_to > > > (and mreamp syscall?). > > > > > Yes. There is an extra loop in mmap(FIXED), munmap(), > > madvise(DONOTNEED), mremap(), to emulate the VMAs for the given > > address range. I suspect the impact would be low, but having some hard > > data would be good. I will see what I can find to assist the perf > > testing. If you have a specific test suite in mind, I can also try it. > > You should look at mmtests [2]. But since you are adding loops across > VMA ranges, you need to test loops across several ranges of VMAs. That > is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or > some subset of small and large numbers to get an idea of complexity we > are adding. My hope is that the looping will be cache-hot in the maple > tree and have minimum effect. > > In my personal testing, I've seen munmap often do a single VMA, or 3, or > more rare 7 on x86_64. There should be some good starting points in > mmtests for the common operations. > Thanks. Will do. > [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c > [2] https://github.com/gormanm/mmtests > > Thanks, > Liam
On Thu, Feb 1, 2024 at 3:15 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote: > > > > Linus, you are in for a shock when the proposal doesn't work for glibc > > and all the applications! > > Heh. I've enjoyed seeing your argumentative style that made you so > famous back in the days. Maybe it's always been there, but I haven't > seen the BSD people in so long that I'd forgotten all about it. > > That said, famously argumentative or not, I think Theo is right, and I > do think the MAP_SEALABLE bit is nonsensical. > > If somebody wants to mseal() a memory region, why would they need to > express that ahead of time? > I like to look at things from the point of view of average Linux userspace developers, they might not have the same level of expertise as the other folks on this email list or they might not have time and mileage for those details. To me, the most important thing is to deliver a feature that's easy to use and works well. I don't want users to mess things up, so if I'm the one giving them the tools, I'm going to make sure they have all the information they need and that there are safeguards in place. e.g. considering the following user case: 1> a security sensitive data is allocated from heap, using malloc, from the software component A, and filled with information. 2> software component B then uses mprotect to change it to RO, and seal it using mseal(). Yes. we could choose to allow it. But there are complications: 1> Is this the right pattern ? why don't component A already seal it if they think it is important ? 2> Why heap, why not mmap() a new memory mapping for that security data ? 3> free() will not respect the situation of whether the memory is sealed or not. How would a new developer know they probably shall never free the sealed memory ? 4> brk-shrink will never be able to pass the VMA that gets splited out by mseal(), there are memory footprint implications to the process. 5> what if the security sensitive data happens to be the first VMA or last VMA of the heap, will sealing the first VMA/last VMA cause any issue there ? since they might carry important VMA flags ? ( I don't know enough about brk.) 6> If we ever support sealing the heap for its entirety (make it not executable), and still want to support other brk behaviors, such as shrink/grow, would that conflict with current mseal(), if we allow it on heap from beginning ? Questions like that, without clear answers, to me it is premature to already let developers start using mseal() for heap. And even if we have all the answers for heap, how about stack, or other types of virtual memory ? Again, I don't have enough knowledge to get a complete list that shouldn't be sealed, the input from Theo is none should I worry about. However it is clearly not none to me, besides heap mentioned, there is also aio/shm. So MAP_SEALABLE is a conservative approach to limit the scope to *** two known use cases *** that I want to work on (libc and chrome) and give time needed to answer those questions. It is like a claim: only those marked by MAP_SEALABLE support the sealing at this point of time. And MAP_SEALABLE is reversible, e.g. a sysctl could be added to make all memory sealable in the future, or we could obsoleted it entirely when time comes, an application that already passes MAP_SEALABLE can be treated as noop. However, if all memory were allowed to be sealable from the beginning, reversing that decision would be hard. After those considerations, if MAP_SEALABLE is still not preferred by you. Then I have the following options for you to choose: 1. MAP_NOT_SEALABLE in the mmap(). And I will use them for the heap/aio/shm case. This basically says Linux does not officially support sealing on those, until we support them, we discourage the sealing on those mappings. 2. make MAP_NOT_SEALABLE only a kernel visible flag. So application space won't be able to use it. 3. open for all, and list as much as details in the documentation. If we choose this route, I would like to have more discussion on the heap/stack, at least the Linux developers will learn from those discussions. > So the part I think is sane is the mseal() system call itself, in that > it allows *potential* future expansion of the semantics. > > But hopefully said future expansion isn't even needed, and all users > want the base experience, which is why I think PROT_SEAL (both to mmap > and to mprotect) makes sense as an alternative form. > > So yes, to my mind > > mprotect(addr, len, PROT_READ); > mseal(addr, len, 0); > > should basically give identical results to > > mprotect(addr, len, PROT_READ | PROT_SEAL); > > and using PROT_SEAL at mmap() time is similarly the same obvious > notion of "map this, and then seal that mapping". > > The reason for having "mseal()" as a separate call at all from the > PROT_SEAL bit is that it does allow possible future expansion (while > PROT_SEAL is just a single bit, and it won't change semantics) but > also so that you can do whatever prep-work in stages if you want to, > and then just go "now we seal it all". > To clarify: do you mean to have the following ? mmap(PROT_READ|PROT_SEAL) mseal(addr,len,0) mprotect(addr,len,PROT_READ|PROT_SEAL) ? I have to think about the mprotect() case. For mmap(PROT_READ|PROT_SEAL), I might have a use case already: fs/binfmt_elf.c if (current->personality & MMAP_PAGE_ZERO) { /* Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications "depend" upon this behavior. Since we do not have the power to recompile these, we emulate the SVr4 behavior. Sigh. */ error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC, <-- add PROT_SEAL MAP_FIXED | MAP_PRIVATE, 0); } I don't see the benefit of RWX page 0, which might make a null pointers error to become executable for some code. Best Regards, -Jeff > Linus
On Thu, Feb 1, 2024 at 5:06 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote: > > As an outsider, Linux development is really strange: > > > > Two sub-features are being pushed very hard, and the primary developer > > doesn't have code which uses either of them. And once it goes in, it > > cannot be changed. > > > > It's very different from my world, where the absolutely minimal > > interface was written to apply to a whole operating system plus 10,000+ > > applications, and then took months of testing before it was approved for > > inclusion. And if it was subtly wrong, we would be able to change it. > > No, it's this "feature" submission that is strange to think that we > don't need that. We do need, and will require, an actual working > userspace something to use it, otherwise as you say, there's no way to > actually know if it works properly or not and we can't change it once we > accept it. > > So along those lines, Jeff, do you have a pointer to the Chrome patches, > or glibc patches, that use this new interface that proves that it > actually works? Those would be great to see to at least verify it's > been tested in a real-world situation and actually works for your use > case. > The MAP_SEALABLE is raised because of other concerns not related to libc. The patch Stephan developed was based on V1 of the patch, IIRC, which is really ancient, and it is not based on MAP_SEALABLE, which is a more recent development entirely from me. I don't see unresolvable problems with glibc though, E.g. For the elf case (binfmt_elf.c), there are two places I need to add MAP_SEALABLE, then the memory to user space is marked with sealable. There might be cases where glibc needs to add MAP_SEALABLE it uses mmap(FIXED) to split the memory. If the decision of MAP_SELABLE depends on the glibc case being able to use it, we can develop such a patch, but it will take a while, say a few weeks to months, due to vacation, work load, etc. Best Regards, -Jeff > thanks, > > greg k-h
On Thu, 1 Feb 2024 at 19:24, Jeff Xu <jeffxu@chromium.org> wrote: > > The patch Stephan developed was based on V1 of the patch, IIRC, which > is really ancient, and it is not based on MAP_SEALABLE, which is a > more recent development entirely from me. So the problem with this whole patch series from the very beginning was that it was very specialized, and COMPLETELY OVER-ENGINEERED. It got simpler at one point. And then you started adding these features that have absolutely no reason for them. Again. It's frustrating. And it's not making it more likely to be ever merged. Linus
On Thu, Feb 1, 2024 at 7:29 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Thu, 1 Feb 2024 at 19:24, Jeff Xu <jeffxu@chromium.org> wrote: > > > > The patch Stephan developed was based on V1 of the patch, IIRC, which > > is really ancient, and it is not based on MAP_SEALABLE, which is a > > more recent development entirely from me. > > So the problem with this whole patch series from the very beginning > was that it was very specialized, and COMPLETELY OVER-ENGINEERED. > > It got simpler at one point. And then you started adding these > features that have absolutely no reason for them. Again. > > It's frustrating. And it's not making it more likely to be ever merged. > I'm sorry for over-thinking. Remove the MAP_SEALABLE it is then. Keep with mseal(addr,len,0) only ? -Jeff >
Jeff Xu <jeffxu@google.com> wrote: > To me, the most important thing is to deliver a feature that's easy to > use and works well. I don't want users to mess things up, so if I'm > the one giving them the tools, I'm going to make sure they have all > the information they need and that there are safeguards in place. > > e.g. considering the following user case: > 1> a security sensitive data is allocated from heap, using malloc, > from the software component A, and filled with information. > 2> software component B then uses mprotect to change it to RO, and > seal it using mseal(). p = malloc(80); mprotect(p & ~4095, 4096, PROT_NONE); free(p); Will you save such a developer also? No. Since the same problem you describe already exists with mprotect() what does mseal() even have to do with your proposal? What about this? p = malloc(80); munmap(p & ~4095, 4096); free(p); And since it is not sealed, how about madvise operations on a proper non-malloc memory allocation? Well, the process smashes it's own memory. And why is it not sealed? You make it harder to seal memory! How about this? p = malloc(80); bzero(p, 100000; Yes it is a buffer overflow. But this is all the same class of software problem: Memory belongs to processes, which belongs to the program, which is coded by the programmer, who has to learn to be careful and handle the memory correctly. mseal() / mimmutable() add *no new expectation* to a careful programmer, because they expected to only use it on memory that they *promise will never be de-allocated or re-permissioned*. What you are proposing is not a "mitigation", it entirely cripples the proposed subsystem because you are afraid of it; because you have cloned a memory subsystem primitive you don't fully understand; and this is because you've not seen a complete operating system using it. When was the last time you developed outside of Chrome? This is systems programming. The kernel supports all the programs, not just the one holy program from god.
On Thu, Feb 1, 2024 at 8:05 PM Theo de Raadt <deraadt@openbsd.org> wrote: > > Jeff Xu <jeffxu@google.com> wrote: > > > To me, the most important thing is to deliver a feature that's easy to > > use and works well. I don't want users to mess things up, so if I'm > > the one giving them the tools, I'm going to make sure they have all > > the information they need and that there are safeguards in place. > > > > e.g. considering the following user case: > > 1> a security sensitive data is allocated from heap, using malloc, > > from the software component A, and filled with information. > > 2> software component B then uses mprotect to change it to RO, and > > seal it using mseal(). > > p = malloc(80); > mprotect(p & ~4095, 4096, PROT_NONE); > free(p); > > Will you save such a developer also? No. > > Since the same problem you describe already exists with mprotect() what > does mseal() even have to do with your proposal? > > What about this? > > p = malloc(80); > munmap(p & ~4095, 4096); > free(p); > > And since it is not sealed, how about madvise operations on a proper > non-malloc memory allocation? Well, the process smashes it's own > memory. And why is it not sealed? You make it harder to seal memory! > > How about this? > > p = malloc(80); > bzero(p, 100000; > > Yes it is a buffer overflow. But this is all the same class of software > problem: > > Memory belongs to processes, which belongs to the program, which is coded > by the programmer, who has to learn to be careful and handle the memory correctly. > > mseal() / mimmutable() add *no new expectation* to a careful programmer, > because they expected to only use it on memory that they *promise will never > be de-allocated or re-permissioned*. > > What you are proposing is not a "mitigation", it entirely cripples the > proposed subsystem because you are afraid of it; because you have cloned a > memory subsystem primitive you don't fully understand; and this is because > you've not seen a complete operating system using it. > > When was the last time you developed outside of Chrome? > > This is systems programming. The kernel supports all the programs, not > just the one holy program from god. > Even without free. I personally do not like the heap getting sealed like that. Component A. p=malloc(4096); writing something to p. Component B: mprotect(p,4096, RO) mseal(p,4096) This will split the heap VMA, and prevent the heap from shrinking, if this is in a frequent code path, then it might hurt the process's memory usage. The existing code is more likely to use malloc than mmap(), so it is easier for dev to seal a piece of data belonging to another component. I hope this pattern is not wide-spreading. The ideal way will be just changing the library A to use mmap.
Jeff Xu <jeffxu@chromium.org> wrote: > Even without free. > I personally do not like the heap getting sealed like that. > > Component A. > p=malloc(4096); > writing something to p. > > Component B: > mprotect(p,4096, RO) > mseal(p,4096) > > This will split the heap VMA, and prevent the heap from shrinking, if > this is in a frequent code path, then it might hurt the process's > memory usage. > > The existing code is more likely to use malloc than mmap(), so it is > easier for dev to seal a piece of data belonging to another component. > I hope this pattern is not wide-spreading. > > The ideal way will be just changing the library A to use mmap. I think you are lacking some test programs to see how it actually behaves; the effect is worse than you think, and the impact is immediately visible to the programmer, and the lesson is clear: you can only seal objects which you gaurantee never get recycled. Pushing a sealed object back into reuse is a disasterous bug. Noone should call this interface, unless they understand that. I'll say again, you don't have a test program for various allocators to understand how it behaves. The failure modes described in your docuemnts are not correct.
* Jeff Xu <jeffxu@google.com> [240201 22:15]: > On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]: > > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett > > > <Liam.Howlett@oracle.com> wrote: > > > > > > > > Having to opt-in to allowing mseal will probably not work well. > I'm leaving the opt-in discussion in Linus's thread. > > > Initial library mappings happen in one huge chunk then it's cut up into > > smaller VMAs, at least that's what I see with my maple tree tracing. If > > you opt-in, then the entire library will have to opt-in and so the > > 'discourage inadvertent sealing' argument is not very strong. > > > Regarding "The initial library mappings happen in one huge chunk then > it is cut up into smaller VMAS", this is not a problem. > > As example of elf loading (fs/binfmt_elf.c), there is just a few > places to pass in what type of memory to be allocated, e.g. > MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those > places. > If glic does additional splitting on the memory range, by using > mprotect(), then the MAP_SEALABLE is automatically applied after > splitting. > If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE). You are adding a flag that requires a new glibc. When I try to point out how this is unnecessary and excessive, you tell me it's fine and probably not a whole lot of work. This isn't working with developers, you are dismissing the developers who are trying to help you. Can you please: Provide code that uses this feature. Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and 32 VMAs. Provide code that tests and checks the failure paths. Failures at the start, middle, and end of the modifications. Document what happens in those failure paths. And, most importantly: keep an open mind and allow your opinion to change when presented with new information. All of these things are to help you. We need to know what needs fixing so you can be successful. Thanks, Liam
On Thu, Feb 01, 2024 at 07:24:02PM -0800, Jeff Xu wrote: > On Thu, Feb 1, 2024 at 5:06 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > On Thu, Feb 01, 2024 at 03:24:40PM -0700, Theo de Raadt wrote: > > > As an outsider, Linux development is really strange: > > > > > > Two sub-features are being pushed very hard, and the primary developer > > > doesn't have code which uses either of them. And once it goes in, it > > > cannot be changed. > > > > > > It's very different from my world, where the absolutely minimal > > > interface was written to apply to a whole operating system plus 10,000+ > > > applications, and then took months of testing before it was approved for > > > inclusion. And if it was subtly wrong, we would be able to change it. > > > > No, it's this "feature" submission that is strange to think that we > > don't need that. We do need, and will require, an actual working > > userspace something to use it, otherwise as you say, there's no way to > > actually know if it works properly or not and we can't change it once we > > accept it. > > > > So along those lines, Jeff, do you have a pointer to the Chrome patches, > > or glibc patches, that use this new interface that proves that it > > actually works? Those would be great to see to at least verify it's > > been tested in a real-world situation and actually works for your use > > case. > > > The MAP_SEALABLE is raised because of other concerns not related to libc. > > The patch Stephan developed was based on V1 of the patch, IIRC, which > is really ancient, and it is not based on MAP_SEALABLE, which is a > more recent development entirely from me. > > I don't see unresolvable problems with glibc though, E.g. For the > elf case (binfmt_elf.c), there are two places I need to add > MAP_SEALABLE, then the memory to user space is marked with sealable. > There might be cases where glibc needs to add MAP_SEALABLE it uses > mmap(FIXED) to split the memory. > > If the decision of MAP_SELABLE depends on the glibc case being able to > use it, we can develop such a patch, but it will take a while, say a > few weeks to months, due to vacation, work load, etc. There's no rush here, and no deadlines in kernel development. If you don't have a working userspace user for your new feature(s), there is no way we can accept the changes to the kernel (and hint, you don't want us to either...) good luck! greg k-h
Another interaction to consider is sigaltstack(). In OpenBSD, sigaltstack() forces MAP_STACK onto the specified (pre-allocated) region, because on kernel-entry we require the "sp" register to point to a MAP_STACK region (this severely damages ROP pivot methods). Linux does not have MAP_STACK enforcement (yet), but one day someone may try to do that work. This interacted poorly with mimmutable() because some applications allocate the memory being provided poorly. I won't get into the details unless pushed, because what we found makes me upset. Over the years, we've upstreamed diffs to applications to resolve all the nasty allocation patterns. I think the software ecosystem is now mostly clean. I suggest someone in Linux look into whether sigaltstack() is a mseal() bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the correct strategy. This is our documented strategy: On OpenBSD some additional restrictions prevent dangerous address space modifications. The proposed space at ss_sp is verified to be contiguously mapped for read-write permissions (no execute) and incapable of syscall entry (see msyscall(2)). If those conditions are met, a page- aligned inner region will be freshly mapped (all zero) with MAP_STACK (see mmap(2)), destroying the pre-existing data in the region. Once the sigaltstack is disabled, the MAP_STACK attribute remains on the memory, so it is best to deallocate the memory via a method that results in munmap(2). OK, I better provide the details of what people were doing. sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the stack, we even found one creating a sigaltstack inside a buffer on a pthread stack. We told everyone to use mmap() and munmap(), with MAP_STACK if #ifdef MAP_STACK finds a definition.
On Fri, Feb 2, 2024 at 7:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Jeff Xu <jeffxu@google.com> [240201 22:15]: > > On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]: > > > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett > > > > <Liam.Howlett@oracle.com> wrote: > > > > > > > > > > > Having to opt-in to allowing mseal will probably not work well. > > I'm leaving the opt-in discussion in Linus's thread. > > > > > Initial library mappings happen in one huge chunk then it's cut up into > > > smaller VMAs, at least that's what I see with my maple tree tracing. If > > > you opt-in, then the entire library will have to opt-in and so the > > > 'discourage inadvertent sealing' argument is not very strong. > > > > > Regarding "The initial library mappings happen in one huge chunk then > > it is cut up into smaller VMAS", this is not a problem. > > > > As example of elf loading (fs/binfmt_elf.c), there is just a few > > places to pass in what type of memory to be allocated, e.g. > > MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those > > places. > > If glic does additional splitting on the memory range, by using > > mprotect(), then the MAP_SEALABLE is automatically applied after > > splitting. > > If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE). > > You are adding a flag that requires a new glibc. When I try to point > out how this is unnecessary and excessive, you tell me it's fine and > probably not a whole lot of work. > > This isn't working with developers, you are dismissing the developers > who are trying to help you. > > Can you please: > > Provide code that uses this feature. > > Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and > 32 VMAs. > I will prepare for the benchmark tests. > Provide code that tests and checks the failure paths. Failures at the > start, middle, and end of the modifications. > Regarding, "Failures at the start, middle, and end of the modifications." With the current implementation, e.g. it checks if the sealing is applied before actual modification of VMAs, so partial modifications are avoided in mprotect, mremap, munmap. There are test cases in the selftests to cover the failure path, including the beginning, middle and end of VMAs. test_seal_unmapped_start test_seal_unmapped_middle test_seal_unmapped_end test_seal_invalid_input test_seal_start_mprotect test_seal_end_mprotect etc. Are those what you are looking for ? > Document what happens in those failure paths. > > And, most importantly: keep an open mind and allow your opinion to > change when presented with new information. > > All of these things are to help you. We need to know what needs fixing > so you can be successful. > Thanks for those feedbacks. I sincerely wish for more of those help so this syscall can be useful. Thanks. Best Regards, -Jeff > > Thanks, > Liam
On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote: > > Jeff Xu <jeffxu@chromium.org> wrote: > > > Even without free. > > I personally do not like the heap getting sealed like that. > > > > Component A. > > p=malloc(4096); > > writing something to p. > > > > Component B: > > mprotect(p,4096, RO) > > mseal(p,4096) > > > > This will split the heap VMA, and prevent the heap from shrinking, if > > this is in a frequent code path, then it might hurt the process's > > memory usage. > > > > The existing code is more likely to use malloc than mmap(), so it is > > easier for dev to seal a piece of data belonging to another component. > > I hope this pattern is not wide-spreading. > > > > The ideal way will be just changing the library A to use mmap. > > I think you are lacking some test programs to see how it actually > behaves; the effect is worse than you think, and the impact is immediately > visible to the programmer, and the lesson is clear: > > you can only seal objects which you gaurantee never get recycled. > > Pushing a sealed object back into reuse is a disasterous bug. > > Noone should call this interface, unless they understand that. > > I'll say again, you don't have a test program for various allocators to > understand how it behaves. The failure modes described in your docuemnts > are not correct. > I understand what you mean: I will add that part to the document: Try to recycle a sealed memory is disastrous, e.g. p=malloc(4096); mprotect(p,4096,RO) mseal(p,4096) free(p); My point is: I think sealing an object from the heap is a bad pattern in general, even dev doesn't free it. That was one of the reasons for the sealable flag, I hope saying this doesn't be perceived as looking for excuses. >
On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote: > > On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote: > > > > Jeff Xu <jeffxu@chromium.org> wrote: > > > > > Even without free. > > > I personally do not like the heap getting sealed like that. > > > > > > Component A. > > > p=malloc(4096); > > > writing something to p. > > > > > > Compohave nent B: > > > mprotect(p,4096, RO) > > > mseal(p,4096) > > > > > > This will split the heap VMA, and prevent the heap from shrinking, if > > > this is in a frequent code path, then it might hurt the process's > > > memory usage. > > > > > > The existing code is more likely to use malloc than mmap(), so it is > > > easier for dev to seal a piece of data belonging to another component. > > > I hope this pattern is not wide-spreading. > > > > > > The ideal way will be just changing the library A to use mmap. > > > > I think you are lacking some test programs to see how it actually > > behaves; the effect is worse than you think, and the impact is immediately > > visible to the programmer, and the lesson is clear: > > > > you can only seal objects which you gaurantee never get recycled. > > > > Pushing a sealed object back into reuse is a disasterous bug. > > > > Noone should call this interface, unless they understand that. > > > > I'll say again, you don't have a test program for various allocators to > > understand how it behaves. The failure modes described in your docuemnts > > are not correct. > > > I understand what you mean: I will add that part to the document: > Try to recycle a sealed memory is disastrous, e.g. > p=malloc(4096); > mprotect(p,4096,RO) > mseal(p,4096) > free(p); > > My point is: > I think sealing an object from the heap is a bad pattern in general, > even dev doesn't free it. That was one of the reasons for the sealable > flag, I hope saying this doesn't be perceived as looking for excuses. The point you're missing is that adding MAP_SEALABLE reduces composability. With MAP_SEALABLE, everything that mmaps some part of the address space that may ever be sealed will need to be modified to know about MAP_SEALABLE. Say you did the same thing for mprotect. MAP_PROTECT would control the mprotectability of the map. You'd stop: p = malloc(4096); mprotect(p, 4096, PROT_READ); free(p); ! But you'd need to change every spot that mmap()'s something to know about and use MAP_PROTECT: all "producers" of mmap memory would need to know about the consumers doing mprotect(). So now either all mmap() callers mindlessly add MAP_PROTECT out of fear the consumers do mprotect (and you gain nothing from MAP_PROTECT), or the mmap() callers need to know the consumers call mprotect(), and thus you introduce a huge layering violation (and you actually lose from having MAP_PROTECT). Hopefully you can map the above to MAP_SEALABLE. Or to any other m*() operation. For example, if chrome runs on an older glibc that does not know about MAP_SEALABLE, it will not be able to mseal() its own shared libraries' .text (even if, yes, that should ideally be left to ld.so). IMO, UNIX API design has historically mostly been "play stupid games, win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) work. If you close stdout (and don't dup/reopen something to stdout) and printf(), things will break, and you get to keep both pieces. There's no O_CLOSEABLE, just as there's no O_DUPABLE.
* Jeff Xu <jeffxu@chromium.org> [240202 12:24]: ... > > Provide code that uses this feature. Please do this too :) > > > > Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and > > 32 VMAs. > > > I will prepare for the benchmark tests. Thank you, please also include runs of calls that you are modifying for checking for mseal() as we are adding loops there. > > > Provide code that tests and checks the failure paths. Failures at the > > start, middle, and end of the modifications. > > > Regarding, "Failures at the start, middle, and end of the modifications." > > With the current implementation, e.g. it checks if the sealing is > applied before actual modification of VMAs, so partial modifications > are avoided in mprotect, mremap, munmap. > > There are test cases in the selftests to cover the failure path, > including the beginning, middle and end of VMAs. > test_seal_unmapped_start > test_seal_unmapped_middle > test_seal_unmapped_end > test_seal_invalid_input > test_seal_start_mprotect > test_seal_end_mprotect > etc. > > Are those what you are looking for ? Those are certainly good, but we need more checking in there. You have a seal_split test that splits the vma by mseal but you don't check the flags on the VMAs. What I'm more concerned about is what happens if you call mseal() on a range and it can mseal a portion. Like, what happens to the first vma in your test_seal_unmapped_middle case? I see it returns an error, but is the first VMA mseal()'ed? (no it's not, but test that) What about the other system calls that will be denied on an mseal() VMA? Do they still behave the same? do_mprotect_pkey() will break out of the loop on the first error it sees - but it has modified some VMAs up to that point, I believe? You have changed this to abort before anything is modified. This is probably acceptable because it won't affect existing applications unless they start using mseal(), but that's just my opinion. It would be good to state the change in behaviour because it is changing the fundamental model of changing mprotect/madvise until an issue is hit. I think you are covering this by "it blocks X" but it's doing more than, say, a flag verification. One could reasonably assume this is just another flag verification. > > > Document what happens in those failure paths. I'd like to know how this affects other system calls in the partial success cases/return error cases. Some will now return new error codes and some may change the behaviour. It may even be okay to allow munmap() to split VMAs at the start/end of the region and fail to munmap because some VMA in the middle is mseal()'ed - but maybe not? I haven't put a whole lot of thought into it. Thanks, Liam
> What I'm more concerned about is what happens if you call mseal() on a > range and it can mseal a portion. Like, what happens to the first vma > in your test_seal_unmapped_middle case? I see it returns an error, but > is the first VMA mseal()'ed? (no it's not, but test that) That is correct, Liam. Unix system calls must be atomic. They either return an error, and that is a promise they made no changes. Or they do the work required, and then return success. In OpenBSD, all mimmutable() aspects were carefully studied to gaurantee this behaviour. I am not an expert in the Linux kernel to make the assessment; someone who is qualified must make that assessment. Fuzzing with tests is a good way to judge it simpler.
On Fri, Feb 2, 2024 at 11:21 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Jeff Xu <jeffxu@chromium.org> [240202 12:24]: > > ... > > > > Provide code that uses this feature. > > Please do this too :) > Yes. Will do. > > > > > > Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and > > > 32 VMAs. > > > > > I will prepare for the benchmark tests. > > Thank you, please also include runs of calls that you are modifying for > checking for mseal() as we are adding loops there. > It will includes mmap/mremap/mprotect/munmap > > > > > Provide code that tests and checks the failure paths. Failures at the > > > start, middle, and end of the modifications. > > > > > Regarding, "Failures at the start, middle, and end of the modifications." > > > > With the current implementation, e.g. it checks if the sealing is > > applied before actual modification of VMAs, so partial modifications > > are avoided in mprotect, mremap, munmap. > > > > There are test cases in the selftests to cover the failure path, > > including the beginning, middle and end of VMAs. > > test_seal_unmapped_start > > test_seal_unmapped_middle > > test_seal_unmapped_end > > test_seal_invalid_input > > test_seal_start_mprotect > > test_seal_end_mprotect > > etc. > > > > Are those what you are looking for ? > > Those are certainly good, but we need more checking in there. You have > a seal_split test that splits the vma by mseal but you don't check the > flags on the VMAs. > I can add the flag check. > What I'm more concerned about is what happens if you call mseal() on a > range and it can mseal a portion. Like, what happens to the first vma > in your test_seal_unmapped_middle case? I see it returns an error, but > is the first VMA mseal()'ed? (no it's not, but test that) > The first VMA is not sealed. That was covered by test_seal_mprotect_two_vma_with_gap. > What about the other system calls that will be denied on an mseal() VMA? The other system call's behavior is kept as is, if the memory is not sealed. > Do they still behave the same? do_mprotect_pkey() will break out of the > loop on the first error it sees - but it has modified some VMAs up to > that point, I believe? Yes. The description about do_mprotect_pkey() is correct. > You have changed this to abort before anything > is modified. This is probably acceptable because it won't affect > existing applications unless they start using mseal(), but that's just > my opinion. > To make sure this, the test was written with sealing=false, those tests are passed in the main (before applying my patch) to make sure the test is correct. > It would be good to state the change in behaviour because it is changing > the fundamental model of changing mprotect/madvise until an issue is > hit. I think you are covering this by "it blocks X" but it's doing more > than, say, a flag verification. One could reasonably assume this is > just another flag verification. > Will add more in documentation. > > > > > Document what happens in those failure paths. > > I'd like to know how this affects other system calls in the partial > success cases/return error cases. Some will now return new error codes > and some may change the behaviour. > For the mapping that is not sealed, all remain unchanged, including the error handling path. For the mapping that is sealed, EPREM is returned if the sealing check fails, and all of VMAs remain unchanged. > It may even be okay to allow munmap() to split VMAs at the start/end of > the region and fail to munmap because some VMA in the middle is > mseal()'ed - but maybe not? I haven't put a whole lot of thought into > it. If you are referring to something like below [unmapped][map1][unmapped][map2][unmapped][map3][unmapped] and map2 is sealed. unmap(start of map1,end of map3) will fail. mmap/mremap/unmap/mprotect on an address range that includes map2 will fail with EPERM, with map1/map2/map3 unchanged. Thanks -Jeff > > Thanks, > Liam
On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote: > > Unix system calls must be atomic. > > They either return an error, and that is a promise they made no changes. That's actually not true, and never has been. It's a good thing to aim for, but several errors means "some or all may have been done". EFAULT (for various system calls), ENOMEM and other errors are all things that can happen after some of the system call has already been done, and the rest failed. There are lots of examples, but to pick one obvious VM example, something like mlock() may well return an error after the area has been successfully locked, but then the population of said pages failed for some reason. Of course, implementations can differ, and POSIX sometimes has insane language that is actively incorrect. Furthermore, the definition of "atomic" is unclear. For example, POSIX claims that a "write()" system call is one atomic thing for regular files, and some people think that means that you see all or nothing. That's simply not true, and you'll see the write progress in various indirect ways (look at intermediate file size with 'stat', look at intermediate contents with 'mmap' etc etc). So I agree that atomicity is something that people should always *strive* for, but it's not some kind of final truth or absolute requirement. In the specific case of mseal(), I suspect there are very few reasons ever *not* to be atomic, so in this particular context atomicity is likely always something that should be guaranteed. But I just wanted to point out that it's most definitely not a black-and-white issue in the general case. Linus
On Fri, Feb 2, 2024 at 12:37 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote: > > > > Unix system calls must be atomic. > > > > They either return an error, and that is a promise they made no changes. > > That's actually not true, and never has been. > > It's a good thing to aim for, but several errors means "some or all > may have been done". > > EFAULT (for various system calls), ENOMEM and other errors are all > things that can happen after some of the system call has already been > done, and the rest failed. > > There are lots of examples, but to pick one obvious VM example, > something like mlock() may well return an error after the area has > been successfully locked, but then the population of said pages failed > for some reason. > > Of course, implementations can differ, and POSIX sometimes has insane > language that is actively incorrect. > > Furthermore, the definition of "atomic" is unclear. For example, POSIX > claims that a "write()" system call is one atomic thing for regular > files, and some people think that means that you see all or nothing. > That's simply not true, and you'll see the write progress in various > indirect ways (look at intermediate file size with 'stat', look at > intermediate contents with 'mmap' etc etc). > > So I agree that atomicity is something that people should always > *strive* for, but it's not some kind of final truth or absolute > requirement. > > In the specific case of mseal(), I suspect there are very few reasons > ever *not* to be atomic, so in this particular context atomicity is > likely always something that should be guaranteed. But I just wanted > to point out that it's most definitely not a black-and-white issue in > the general case. > Thanks. At least I got this part done right for mseal() :-) -Jeff > Linus >
On Fri, Feb 2, 2024 at 9:05 AM Theo de Raadt <deraadt@openbsd.org> wrote: > > Another interaction to consider is sigaltstack(). > > In OpenBSD, sigaltstack() forces MAP_STACK onto the specified > (pre-allocated) region, because on kernel-entry we require the "sp" > register to point to a MAP_STACK region (this severely damages ROP pivot > methods). Linux does not have MAP_STACK enforcement (yet), but one day > someone may try to do that work. > > This interacted poorly with mimmutable() because some applications > allocate the memory being provided poorly. I won't get into the details > unless pushed, because what we found makes me upset. Over the years, > we've upstreamed diffs to applications to resolve all the nasty > allocation patterns. I think the software ecosystem is now mostly > clean. > > I suggest someone in Linux look into whether sigaltstack() is a mseal() > bypass, perhaps somewhat similar to madvise MADV_FREE, and consider the > correct strategy. > Thanks for bringing this up. I will follow up on sigaltstack() in Linux. > This is our documented strategy: > > On OpenBSD some additional restrictions prevent dangerous address space > modifications. The proposed space at ss_sp is verified to be > contiguously mapped for read-write permissions (no execute) and incapable > of syscall entry (see msyscall(2)). If those conditions are met, a page- > aligned inner region will be freshly mapped (all zero) with MAP_STACK > (see mmap(2)), destroying the pre-existing data in the region. Once the > sigaltstack is disabled, the MAP_STACK attribute remains on the memory, > so it is best to deallocate the memory via a method that results in > munmap(2). > > OK, I better provide the details of what people were doing. > sigaltstacks() in .data, in .bss, using malloc(), on a buffer on the > stack, we even found one creating a sigaltstack inside a buffer on a > pthread stack. We told everyone to use mmap() and munmap(), with MAP_STACK > if #ifdef MAP_STACK finds a definition. >
* Linus Torvalds <torvalds@linux-foundation.org> [240202 15:37]: > On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote: > > > > Unix system calls must be atomic. > > > > They either return an error, and that is a promise they made no changes. > > That's actually not true, and never has been. ... > > In the specific case of mseal(), I suspect there are very few reasons > ever *not* to be atomic, so in this particular context atomicity is > likely always something that should be guaranteed. But I just wanted > to point out that it's most definitely not a black-and-white issue in > the general case. There will be a larger performance cost to checking up front without allowing the partial completion. I don't expect these to be high, but it's something to keep in mind if we are okay with the flexibility and less atomic operation. Thanks, Liam
On Fri, Feb 2, 2024 at 10:52 AM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote: > > > > On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote: > > > > > > Jeff Xu <jeffxu@chromium.org> wrote: > > > > > > > Even without free. > > > > I personally do not like the heap getting sealed like that. > > > > > > > > Component A. > > > > p=malloc(4096); > > > > writing something to p. > > > > > > > > Compohave nent B: > > > > mprotect(p,4096, RO) > > > > mseal(p,4096) > > > > > > > > This will split the heap VMA, and prevent the heap from shrinking, if > > > > this is in a frequent code path, then it might hurt the process's > > > > memory usage. > > > > > > > > The existing code is more likely to use malloc than mmap(), so it is > > > > easier for dev to seal a piece of data belonging to another component. > > > > I hope this pattern is not wide-spreading. > > > > > > > > The ideal way will be just changing the library A to use mmap. > > > > > > I think you are lacking some test programs to see how it actually > > > behaves; the effect is worse than you think, and the impact is immediately > > > visible to the programmer, and the lesson is clear: > > > > > > you can only seal objects which you gaurantee never get recycled. > > > > > > Pushing a sealed object back into reuse is a disasterous bug. > > > > > > Noone should call this interface, unless they understand that. > > > > > > I'll say again, you don't have a test program for various allocators to > > > understand how it behaves. The failure modes described in your docuemnts > > > are not correct. > > > > > I understand what you mean: I will add that part to the document: > > Try to recycle a sealed memory is disastrous, e.g. > > p=malloc(4096); > > mprotect(p,4096,RO) > > mseal(p,4096) > > free(p); > > > > My point is: > > I think sealing an object from the heap is a bad pattern in general, > > even dev doesn't free it. That was one of the reasons for the sealable > > flag, I hope saying this doesn't be perceived as looking for excuses. > > The point you're missing is that adding MAP_SEALABLE reduces > composability. With MAP_SEALABLE, everything that mmaps some part of > the address space that may ever be sealed will need to be modified to > know about MAP_SEALABLE. > > Say you did the same thing for mprotect. MAP_PROTECT would control the > mprotectability of the map. You'd stop: > > p = malloc(4096); > mprotect(p, 4096, PROT_READ); > free(p); > > ! But you'd need to change every spot that mmap()'s something to know > about and use MAP_PROTECT: all "producers" of mmap memory would need > to know about the consumers doing mprotect(). So now either all mmap() > callers mindlessly add MAP_PROTECT out of fear the consumers do > mprotect (and you gain nothing from MAP_PROTECT), or the mmap() > callers need to know the consumers call mprotect(), and thus you > introduce a huge layering violation (and you actually lose from having > MAP_PROTECT). > > Hopefully you can map the above to MAP_SEALABLE. Or to any other m*() > operation. For example, if chrome runs on an older glibc that does not > know about MAP_SEALABLE, it will not be able to mseal() its own shared > libraries' .text (even if, yes, that should ideally be left to ld.so). > I think I have heard enough complaints about MAP_SEALABLE from Linux developers and Linus in the last two days to convince myself that it is a bad idea :) For the last time, I was trying to limit the scope of mseal() limited to two known cases. And MAP_SEALABLE is a reversible decision, a system ctrl can turn it off, or we can obsolete it in future. (this was mentioned in the document of V8). I will rest my case. Obviously from the feedback, it is loud and clear that we want to be able to seal all the memory. > IMO, UNIX API design has historically mostly been "play stupid games, > win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) > work. If you close stdout (and don't dup/reopen something to stdout) > and printf(), things will break, and you get to keep both pieces. > There's no O_CLOSEABLE, just as there's no O_DUPABLE. > > -- > Pedro
On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > There will be a larger performance cost to checking up front without > allowing the partial completion. I suspect that for mseal(), the only half-way common case will be sealing an area that is entirely contained within one vma. So the cost will be the vma splitting (if it's not the whole vma), and very unlikely to be any kind of "walk the vma's to check that they can all be sealed" loop up-front. We'll see, but that's my gut feel, at least. Linus
* Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]: > On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > There will be a larger performance cost to checking up front without > > allowing the partial completion. > > I suspect that for mseal(), the only half-way common case will be > sealing an area that is entirely contained within one vma. Agreed. > > So the cost will be the vma splitting (if it's not the whole vma), and > very unlikely to be any kind of "walk the vma's to check that they can > all be sealed" loop up-front. That's the cost of calling mseal(), and I think that will be totally reasonable. I'm more concerned with the other calls that do affect more than one vma that will now have to ensure there is not an mseal'ed vma among the affected area. As you pointed out, we don't do atomic updates and so we have to add a loop at the beginning to check this new special case, which is what this patch set does today. That means we're going to be looping through twice for any call that could fail if one is mseal'ed. This includes munmap() and mprotect(). The impact will vary based on how many vma's are handled. I'd like some numbers on this so we can see if it is a concern, which Jeff has agreed to provide in the future - Thank you, Jeff. It also means we're modifying the behaviour of those calls so they could fail before anything changes (regardless of where the failure would occur), and we could still fail later when another aspect of a vma would cause a failure as we do today. We are paying the price for a more atomic update, but we aren't trying very hard to be atomic with our updates - we don't have many (virtually no) vma checks before modifications start. For instance, we could move the mprotect check for map_deny_write_exec() to the pre-update loop to make it more atomic in nature. This one seems somewhat related to mseal, so it would be better if they were both checked atomic(ish) together. Although, I wonder if the user visible changes would be acceptable and worth the risk. We will have two classes of updates to vma's: the more atomic view and the legacy view. The question of what happens when the two mix, or where a specific check should go will get (more) confusing. Thanks, Liam
... > IMO, UNIX API design has historically mostly been "play stupid games, > win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) > work. If you close stdout (and don't dup/reopen something to stdout) > and printf(), things will break, and you get to keep both pieces. That is pretty much why libraries must never use printf(). (Try telling that to people at work!) In the days when processes could only have 20 files open it was a much bigger problem. You couldn't afford to not use 0, 1 and 2. A certain daemon ended up using fd 1 as a pipe to another daemon. Someone accidentally used printf() instead of fprintf() for a trace. When the 10k stdio buffer filled the text got written to the pipe. The expected fixed size message had a 32bit 'trailer' size. Although no defined messages supported trailers the second daemon synchronously discarded the trailer - with the expected side effect. Wasn't my bug, and someone else found it, but I'd read the broken code a few times without seeing the fubar. Trouble is it all worked for quite a long time... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Feb 2, 2024 at 8:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]: > > On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > There will be a larger performance cost to checking up front without > > > allowing the partial completion. > > > > I suspect that for mseal(), the only half-way common case will be > > sealing an area that is entirely contained within one vma. > > Agreed. > > > > > So the cost will be the vma splitting (if it's not the whole vma), and > > very unlikely to be any kind of "walk the vma's to check that they can > > all be sealed" loop up-front. > > That's the cost of calling mseal(), and I think that will be totally > reasonable. > > I'm more concerned with the other calls that do affect more than one vma > that will now have to ensure there is not an mseal'ed vma among the > affected area. > > As you pointed out, we don't do atomic updates and so we have to add a > loop at the beginning to check this new special case, which is what this > patch set does today. That means we're going to be looping through > twice for any call that could fail if one is mseal'ed. This includes > munmap() and mprotect(). > > The impact will vary based on how many vma's are handled. I'd like some > numbers on this so we can see if it is a concern, which Jeff has agreed > to provide in the future - Thank you, Jeff. Yes please. The additional walk Liam points to seems to be happening even if we don't use mseal at all. Android apps often create thousands of VMAs, so a small regression to a syscall like mprotect might cause a very visible regression to app launch times (one of the key metrics for Android). Having performance impact numbers here would be very helpful. > > It also means we're modifying the behaviour of those calls so they could > fail before anything changes (regardless of where the failure would > occur), and we could still fail later when another aspect of a vma would > cause a failure as we do today. We are paying the price for a more > atomic update, but we aren't trying very hard to be atomic with our > updates - we don't have many (virtually no) vma checks before > modifications start. > > For instance, we could move the mprotect check for map_deny_write_exec() > to the pre-update loop to make it more atomic in nature. This one seems > somewhat related to mseal, so it would be better if they were both > checked atomic(ish) together. Although, I wonder if the user visible > changes would be acceptable and worth the risk. > > We will have two classes of updates to vma's: the more atomic view and > the legacy view. The question of what happens when the two mix, or > where a specific check should go will get (more) confusing. > > Thanks, > Liam >
From: Jeff Xu <jeffxu@chromium.org> This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. In addition: mmap() has two related changes. The PROT_SEAL bit in prot field of mmap(). When present, it marks the map sealed since creation. The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks the map as sealable. A map created without MAP_SEALABLE will not support sealing, i.e. mseal() will fail. Applications that don't care about sealing will expect their behavior unchanged. For those that need sealing support, opt-in by adding MAP_SEALABLE in mmap(). The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Pedro Falcato: suggesting sealing in the mmap(). Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. Change history: =============== V8: - perf optimization in mmap. (Liam R. Howlett) - add one testcase (test_seal_zero_address) - Update mseal.rst to add note for MAP_SEALABLE. V7: - fix index.rst (Randy Dunlap) - fix arm build (Randy Dunlap) - return EPERM for blocked operations (Theo de Raadt) https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.org/T/ V6: - Drop RFC from subject, Given Linus's general approval. - Adjust syscall number for mseal (main Jan.11/2024) - Code style fix (Matthew Wilcox) - selftest: use ksft macros (Muhammad Usama Anjum) - Document fix. (Randy Dunlap) https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/ V5: - fix build issue in mseal-Wire-up-mseal-syscall (Suggested by Linus Torvalds, and Greg KH) - updates on selftest. https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r V4: (Suggested by Linus Torvalds) - new signature: mseal(start,len,flags) - 32 bit is not supported. vm_seal is removed, use vm_flags instead. - single bit in vm_flags for sealed state. - CONFIG_MSEAL kernel config is removed. - single bit of PROT_SEAL in the "Prot" field of mmap(). Other changes: - update selftest (Suggested by Muhammad Usama Anjum) - update documentation. https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/ V3: - Abandon per-syscall approach, (Suggested by Linus Torvalds). - Organize sealing types around their functionality, such as MM_SEAL_BASE, MM_SEAL_PROT_PKEY. - Extend the scope of sealing from calls originated in userspace to both kernel and userspace. (Suggested by Linus Torvalds) - Add seal type support in mmap(). (Suggested by Pedro Falcato) - Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent destructive operations of madvise. (Suggested by Jann Horn and Stephen Röttger) - Make sealed VMAs mergeable. (Suggested by Jann Horn) - Add MAP_SEALABLE to mmap() - Add documentation - mseal.rst https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.org/ v2: Use _BITUL to define MM_SEAL_XX type. Use unsigned long for seal type in sys_mseal() and other functions. Remove internal VM_SEAL_XX type and convert_user_seal_type(). Remove MM_ACTION_XX type. Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask. Add more comments in code. Add a detailed commit message. https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/ v1: https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/ ---------------------------------------------------------------- [1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc [6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/ [7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/ Jeff Xu (4): mseal: Wire up mseal syscall mseal: add mseal syscall selftest mm/mseal memory sealing mseal:add documentation Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mseal.rst | 215 ++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/mman-common.h | 8 + include/uapi/asm-generic/unistd.h | 5 +- kernel/sys_ni.c | 1 + mm/Makefile | 4 + mm/internal.h | 48 + mm/madvise.c | 12 + mm/mmap.c | 35 +- mm/mprotect.c | 10 + mm/mremap.c | 31 + mm/mseal.c | 343 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++ 33 files changed, 2756 insertions(+), 3 deletions(-) create mode 100644 Documentation/userspace-api/mseal.rst create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c