Message ID | 20190424203408.GA11386@beast (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] binfmt_elf: Update READ_IMPLIES_EXEC logic for modern CPUs | expand |
On Wed, Apr 24, 2019 at 01:34:08PM -0700, Kees Cook wrote: > The READ_IMPLIES_EXEC work-around was designed for old CPUs lacking NX > (to have the visible permission flags on memory regions reflect reality: > they are all executable), and for old toolchains that lacked the ELF > PT_GNU_STACK marking (under the assumption that toolchains that couldn't > even specify memory protection flags may have it wrong for all memory > regions). > > This logic is sensible, but was implemented in a way that equated having > a PT_GNU_STACK marked executable as being as "broken" as lacking the > PT_GNU_STACK marking entirely. This is not a reasonable assumption > for CPUs that have had NX support from the start (or very close to > the start). This confusion has led to situations where modern 64-bit > programs with explicitly marked executable stack are forced into the > READ_IMPLIES_EXEC state when no such thing is needed. (And leads to > unexpected failures when mmap()ing regions of device driver memory that > wish to disallow VM_EXEC[1].) > > To fix this, elf_read_implies_exec() is adjusted on arm64 (where NX has > always existed and toolchains have implemented PT_GNU_STACK for a while), > and x86 is adjusted to handle this combination of possible outcomes: > > CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > ------------------------------|------------------|------------------| > missing GNU_STACK | needs RIE | needs RIE | no RIE | > GNU_STACK == RWX | needs RIE | no RIE: stack X | no RIE: stack X | > GNU_STACK == RW | needs RIE | no RIE: stack NX | no RIE: stack NX | > > This has the effect of making binfmt_elf's EXSTACK_DEFAULT actually take > on the correct architecture default of being non-executable on arm64 and > x86_64, and being executable on ia32. > > [1] https://lkml.kernel.org/r/20190418055759.GA3155@mellanox.com > > Suggested-by: Hector Marco-Gisbert <hecmargi@upv.es> > Signed-off-by: Kees Cook <keescook@chromium.org> > --- > v2: adjust arm64 to avoid is_compat_task() (marc.w.gonzalez) > --- > arch/arm64/include/asm/elf.h | 8 +++++++- > arch/x86/include/asm/elf.h | 24 +++++++++++++++++++++--- > 2 files changed, 28 insertions(+), 4 deletions(-) > > diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h > index 6adc1a90e7e6..f1bb4b388b8f 100644 > --- a/arch/arm64/include/asm/elf.h > +++ b/arch/arm64/include/asm/elf.h > @@ -107,7 +107,13 @@ > */ > #define elf_check_arch(x) ((x)->e_machine == EM_AARCH64) > > -#define elf_read_implies_exec(ex,stk) (stk != EXSTACK_DISABLE_X) > +/* > + * 64-bit processes should not automatically gain READ_IMPLIES_EXEC. Only > + * 32-bit processes without PT_GNU_STACK should trigger READ_IMPLIES_EXEC > + * out of an abundance of caution against ancient toolchains not knowing > + * how to mark memory protection flags correctly. > + */ > +#define compat_elf_read_implies_exec(ex, stk) (stk == EXSTACK_DEFAULT) Don't you need to hack fs/compat_binfmt_elf.c to pick this up, or am I missing some trick? Should just be something like below. Will --->8 diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c index 15f6e96b3bd9..694bc3ee77eb 100644 --- a/fs/compat_binfmt_elf.c +++ b/fs/compat_binfmt_elf.c @@ -116,6 +116,11 @@ #define arch_setup_additional_pages compat_arch_setup_additional_pages #endif +#ifdef compat_elf_read_implies_exec +#undef elf_read_implies_exec +#define elf_read_implies_exec compat_elf_read_implies_exec +#endif + /* * Rename a few of the symbols that binfmt_elf.c will define. * These are all local so the names don't really matter, but it
On Wed, Apr 24, 2019 at 1:51 PM Will Deacon <will.deacon@arm.com> wrote: > Don't you need to hack fs/compat_binfmt_elf.c to pick this up, or am I > missing some trick? Should just be something like below. > > Will > > --->8 > > diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c > index 15f6e96b3bd9..694bc3ee77eb 100644 > --- a/fs/compat_binfmt_elf.c > +++ b/fs/compat_binfmt_elf.c > @@ -116,6 +116,11 @@ > #define arch_setup_additional_pages compat_arch_setup_additional_pages > #endif > > +#ifdef compat_elf_read_implies_exec > +#undef elf_read_implies_exec > +#define elf_read_implies_exec compat_elf_read_implies_exec > +#endif > + > /* > * Rename a few of the symbols that binfmt_elf.c will define. > * These are all local so the names don't really matter, but it Argh. I thought I already saw stuff like this somewhere, but I think I must have been looking at some other compat silliness. I'll fix this and split up the series...
On Wed, Apr 24, 2019 at 1:54 PM Kees Cook <keescook@chromium.org> wrote: > > On Wed, Apr 24, 2019 at 1:51 PM Will Deacon <will.deacon@arm.com> wrote: > > Don't you need to hack fs/compat_binfmt_elf.c to pick this up, or am I > > missing some trick? Should just be something like below. > > > > Will > > > > --->8 > > > > diff --git a/fs/compat_binfmt_elf.c b/fs/compat_binfmt_elf.c > > index 15f6e96b3bd9..694bc3ee77eb 100644 > > --- a/fs/compat_binfmt_elf.c > > +++ b/fs/compat_binfmt_elf.c > > @@ -116,6 +116,11 @@ > > #define arch_setup_additional_pages compat_arch_setup_additional_pages > > #endif > > > > +#ifdef compat_elf_read_implies_exec > > +#undef elf_read_implies_exec > > +#define elf_read_implies_exec compat_elf_read_implies_exec > > +#endif > > + > > /* > > * Rename a few of the symbols that binfmt_elf.c will define. > > * These are all local so the names don't really matter, but it > > Argh. I thought I already saw stuff like this somewhere, but I think I > must have been looking at some other compat silliness. I'll fix this > and split up the series... Andrew, can you please drop this patch from -mm for now? I'll pursue these changes separately through x86 and arm64 trees. Thanks!
* Kees Cook <keescook@chromium.org> wrote: > The READ_IMPLIES_EXEC work-around was designed for old CPUs lacking NX > (to have the visible permission flags on memory regions reflect reality: > they are all executable), and for old toolchains that lacked the ELF > PT_GNU_STACK marking (under the assumption that toolchains that couldn't > even specify memory protection flags may have it wrong for all memory > regions). > > This logic is sensible, but was implemented in a way that equated having > a PT_GNU_STACK marked executable as being as "broken" as lacking the > PT_GNU_STACK marking entirely. This is not a reasonable assumption > for CPUs that have had NX support from the start (or very close to > the start). This confusion has led to situations where modern 64-bit > programs with explicitly marked executable stack are forced into the > READ_IMPLIES_EXEC state when no such thing is needed. (And leads to > unexpected failures when mmap()ing regions of device driver memory that > wish to disallow VM_EXEC[1].) > > To fix this, elf_read_implies_exec() is adjusted on arm64 (where NX has > always existed and toolchains have implemented PT_GNU_STACK for a while), > and x86 is adjusted to handle this combination of possible outcomes: > > CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > ------------------------------|------------------|------------------| > missing GNU_STACK | needs RIE | needs RIE | no RIE | > GNU_STACK == RWX | needs RIE | no RIE: stack X | no RIE: stack X | > GNU_STACK == RW | needs RIE | no RIE: stack NX | no RIE: stack NX | > > This has the effect of making binfmt_elf's EXSTACK_DEFAULT actually take > on the correct architecture default of being non-executable on arm64 and > x86_64, and being executable on ia32. Just to make clear, is the change from the old behavior, in essence: CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | ELF: | | | | ------------------------------|------------------|------------------| missing GNU_STACK | exec-all | exec-all | exec-none | - GNU_STACK == RWX | exec-all | exec-all | exec-all | + GNU_STACK == RWX | exec-all | exec-stack | exec-stack | GNU_STACK == RW | exec-all | exec-none | exec-none | correct? Also note that in addition to marking the changes clearly, I also edited the table to be less confusing and more assertive: 'exec-all' : all user mappings are executable 'exec-none' : only PROT_EXEC user mappings are executable 'exec-stack': only the stack and PROT_EXEC user mappings are executable Is this correct? (Please double check the edited table.) In particular, what is the policy for write-only and exec-only mappings, what does read-implies-exec do for them? Also, it would be nice to define it precisely what 'stack' means in this context: it's only the ELF loader defined process stack - other stacks such as any thread stacks, signal stacks or alt-stacks depend on the C library - or does the kernel policy extend there too? I.e. it would be nice to clarify all this, because it's still rather confusing and ambiguous right now. Thanks, Ingo
On Wed, Apr 24, 2019 at 10:42 PM Ingo Molnar <mingo@kernel.org> wrote: > Just to make clear, is the change from the old behavior, in essence: > > > CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > ------------------------------|------------------|------------------| > missing GNU_STACK | exec-all | exec-all | exec-none | > - GNU_STACK == RWX | exec-all | exec-all | exec-all | > + GNU_STACK == RWX | exec-all | exec-stack | exec-stack | > GNU_STACK == RW | exec-all | exec-none | exec-none | > [...] > 'exec-all' : all user mappings are executable For extreme clarity, this should be: 'exec-all' : all PROT_READ user mappings are executable, except when backed by files on a noexec-filesystem. > 'exec-none' : only PROT_EXEC user mappings are executable > 'exec-stack': only the stack and PROT_EXEC user mappings are executable Thanks for helping clarify this. I spent last evening trying to figure out a better way to explain/illustrate this series; my prior patch combines too many things into a single change. One thing I noticed is the "lacks NX" column is wrong: for "lack NX", our current state is "don't care". If we _add_ RIE for the "lacks NX" case unconditionally, we may cause unexpected problems[1]. More on this below... But yes, your above diff for "has NX" is roughly correct. I'll walk through each piece I'm thinking about. Here is the current state: CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | ELF: | | | | -------------------------------|------------------|----------------| missing GNU_STACK | exec-all | exec-all | exec-all | GNU_STACK == RWX | exec-all | exec-all | exec-all | GNU_STACK == RW | exec-none | exec-none | exec-none | *this column has no architecture effect: NX markings are ignored by hardware, but may have behavioral effects when "wants X" collides with "cannot be X" constraints in memory permission flags, as in [1]. I want to make three changes, listed in increasing risk levels. First, I want to split "missing GNU_STACK" and "GNU_STACK == RWX", which is currently causing expected behavior for driver mmap regions[1], etc: CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | ELF: | | | | -------------------------------|------------------|----------------| missing GNU_STACK | exec-all | exec-all | exec-all | - GNU_STACK == RWX | exec-all | exec-all | exec-all | + GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | GNU_STACK == RW | exec-none | exec-none | exec-none | AFAICT, this has the least risk. I'm not aware of any situation where GNU_STACK==RWX is supposed to mean MORE than that. As Jann researched, even thread stacks will be treated correctly[2]. The risk would be discovering some use-case where a program was executing memory that it had not explicitly marked as executable. For ELFs marked with GNU_STACK, this seems unlikely (I hope). Second, I want to split the behavior of "missing GNU_STACK" between ia32 and x86_64. The reasonable(?) default for x86_64 memory is for it to be NX. For the very rare x86_64 systems that do not have NX, this shouldn't change anything because they still fall into the "don't care" column. It would look like this: CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | ELF: | | | | -------------------------------|------------------|----------------| - missing GNU_STACK | exec-all | exec-all | exec-all | + missing GNU_STACK | exec-all | exec-all | exec-none | GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | GNU_STACK == RW | exec-none | exec-none | exec-none | This carries some risk that there are ancient x86_64 binaries that still behave like their even more ancient ia32 counterparts, and expect to be able to execute any memory. I would _hope_ this is rare, but I have no way to actually know if things like this exist in the real world. Third, I want to have the "lacks NX" column actually reflect reality. Right now on such a system, memory permissions will show "not executable" but there is actually no architectural checking for these permissions. I think the true nature of such a system should be reflected in the reported permissions. It would look like this: CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | ELF: | | | | -------------------------------|------------------|----------------| missing GNU_STACK | exec-all | exec-all | exec-none | - GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | - GNU_STACK == RW | exec-none | exec-none | exec-none | + GNU_STACK == RWX | exec-all | exec-stack | exec-stack | + GNU_STACK == RW | exec-all | exec-none | exec-none | This carries the largest risk because it effectively enables READ_IMPLIES_EXEC on all processes for such systems. I worry this might trip as-yet-unseen problems like in [1], for only cosmetic improvements. My intention was to split up the series and likely not even bother with the third change, since it feels like too high a risk to me. What do you think? > In particular, what is the policy for write-only and exec-only mappings, > what does read-implies-exec do for them? First it manifests here, which is used for stack and brk: #define VM_DATA_DEFAULT_FLAGS \ (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0 ) | \ VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) above is used in do_brk_flags(), and is picked up by VM_STACK_DEFAULT_FLAGS, visible in VM_STACK_FLAGS for setup_arg_pages()'s stack creation. READ_IMPLIES_EXEC itself is checked directly in mmap, with noexec checks that also clear VM_MAYEXEC: if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC; ... if (path_noexec(&file->f_path)) { if (vm_flags & VM_EXEC) return -EPERM; vm_flags &= ~VM_MAYEXEC; The above is where we discussed adding some kind of check for device driver memory mapping in [1] (or getting distros to mount /dev noexec, which seems to break other things...), but I'd rather just fix READ_IMPLIES_EXEC. Write-only would ignore READ_IMPLIES_EXEC, but mprotect() rechecks it if PROT_READ gets added later: const bool rier = (current->personality & READ_IMPLIES_EXEC) && (prot & PROT_READ); ... /* Does the application expect PROT_READ to imply PROT_EXEC */ if (rier && (vma->vm_flags & VM_MAYEXEC)) prot |= PROT_EXEC; > Also, it would be nice to define it precisely what 'stack' means in this > context: it's only the ELF loader defined process stack - other stacks > such as any thread stacks, signal stacks or alt-stacks depend on the C > library - or does the kernel policy extend there too? Correct: this is only the ELF loader stack. Thread stacks are (and always have been) on their own. But as Jann found in [2], they should be unchanged by anything here. > I.e. it would be nice to clarify all this, because it's still rather > confusing and ambiguous right now. Agreed. I've been trying to pick it apart too, hopefully this helps. -Kees [1] https://lkml.kernel.org/r/20190418055759.GA3155@mellanox.com [2] https://lore.kernel.org/patchwork/patch/464875/
* Kees Cook <keescook@chromium.org> wrote: > On Wed, Apr 24, 2019 at 10:42 PM Ingo Molnar <mingo@kernel.org> wrote: > > Just to make clear, is the change from the old behavior, in essence: > > > > > > CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | > > ELF: | | | | > > ------------------------------|------------------|------------------| > > missing GNU_STACK | exec-all | exec-all | exec-none | > > - GNU_STACK == RWX | exec-all | exec-all | exec-all | > > + GNU_STACK == RWX | exec-all | exec-stack | exec-stack | > > GNU_STACK == RW | exec-all | exec-none | exec-none | > > [...] > > 'exec-all' : all user mappings are executable > > For extreme clarity, this should be: > > 'exec-all' : all PROT_READ user mappings are executable, except when > backed by files on a noexec-filesystem. > > > 'exec-none' : only PROT_EXEC user mappings are executable > > 'exec-stack': only the stack and PROT_EXEC user mappings are executable > > Thanks for helping clarify this. I spent last evening trying to figure > out a better way to explain/illustrate this series; my prior patch > combines too many things into a single change. One thing I noticed is > the "lacks NX" column is wrong: for "lack NX", our current state is > "don't care". If we _add_ RIE for the "lacks NX" case unconditionally, > we may cause unexpected problems[1]. More on this below... So what does RIE in the !NX case do to regular RAM (with the exception of device memory, see below), does it actively reject or modify actual mmap() calls and introduces behavioral changes, or is it mostly just the /proc reporting of permission bits? If it's just reporting, with no (intended) behavioral side effects, then is there really a true difference? > But yes, your above diff for "has NX" is roughly correct. I'll walk > through each piece I'm thinking about. Here is the current state: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > missing GNU_STACK | exec-all | exec-all | exec-all | > GNU_STACK == RWX | exec-all | exec-all | exec-all | > GNU_STACK == RW | exec-none | exec-none | exec-none | > > *this column has no architecture effect: NX markings are ignored by > hardware, but may have behavioral effects when "wants X" collides with > "cannot be X" constraints in memory permission flags, as in [1]. So [1] appears to be device driver mapping a BAR that isn't intended to be excutable: https://lore.kernel.org/netdev/20190418055759.GA3155@mellanox.com/ and the question is, do we reject this at the device driver mmap() level already, right? I suspect the best behavior is to reject as early as possible, so I agree with your change here - even though !NX systems tend to become less and less relevant these days. ( User-space can still work it around in practice by not using PROT_EXEC and sending CPU execution there - with undefined/undesirable outcomes, but that's user-space getting what they are asking for. ) > I want to make three changes, listed in increasing risk levels. > > First, I want to split "missing GNU_STACK" and "GNU_STACK == RWX", > which is currently causing expected behavior for driver mmap > regions[1], etc: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > missing GNU_STACK | exec-all | exec-all | exec-all | > - GNU_STACK == RWX | exec-all | exec-all | exec-all | > + GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | > GNU_STACK == RW | exec-none | exec-none | exec-none | > > AFAICT, this has the least risk. I'm not aware of any situation where > GNU_STACK==RWX is supposed to mean MORE than that. As Jann researched, > even thread stacks will be treated correctly[2]. The risk would be > discovering some use-case where a program was executing memory that it > had not explicitly marked as executable. For ELFs marked with > GNU_STACK, this seems unlikely (I hope). Ack: and this actively increases security for GNU_STACK=RWX executables, as it modifies exec-all to exec-stack, which narrows executability in a real way, and enforced by NX CPUs both in 64-bit and 32-bit apps. While obviously the executable stack is a gaping hole in the typical case, not all attacks can utilize an executable stack and they might be able to utilize other W+X regions such as the heap or some data mmap() area, right? BTW., do we have any compat variations with the table, i.e. tasks running on a 32-bit kernel versus running in 32-bit mode on a 64-bit kernel? I.e. should there be another column for compat, or is compat behavior always the same as 32-bit kernel behavior? > Second, I want to split the behavior of "missing GNU_STACK" between > ia32 and x86_64. The reasonable(?) default for x86_64 memory is for it > to be NX. For the very rare x86_64 systems that do not have NX, this > shouldn't change anything because they still fall into the "don't > care" column. It would look like this: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > - missing GNU_STACK | exec-all | exec-all | exec-all | > + missing GNU_STACK | exec-all | exec-all | exec-none | > GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | > GNU_STACK == RW | exec-none | exec-none | exec-none | > > This carries some risk that there are ancient x86_64 binaries that > still behave like their even more ancient ia32 counterparts, and > expect to be able to execute any memory. I would _hope_ this is rare, > but I have no way to actually know if things like this exist in the > real world. Ack: this too actively restricts executability which is the right direction to go. (Absent reported regressions.) > Third, I want to have the "lacks NX" column actually reflect reality. > Right now on such a system, memory permissions will show "not > executable" but there is actually no architectural checking for these > permissions. I think the true nature of such a system should be > reflected in the reported permissions. It would look like this: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > missing GNU_STACK | exec-all | exec-all | exec-none | > - GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | > - GNU_STACK == RW | exec-none | exec-none | exec-none | > + GNU_STACK == RWX | exec-all | exec-stack | exec-stack | > + GNU_STACK == RW | exec-all | exec-none | exec-none | > > This carries the largest risk because it effectively enables > READ_IMPLIES_EXEC on all processes for such systems. I worry this > might trip as-yet-unseen problems like in [1], for only cosmetic > improvements. > > My intention was to split up the series and likely not even bother > with the third change, since it feels like too high a risk to me. What > do you think? So what's the benefit of this third phase, more transparency because reported permissions and API behavior will match reality? Can we do something else perhaps and do phase 3 change for *RAM* backed vmas, but be more restrictive for device mappings, allowing [1] to be handled better? I suspect there will be a complexity threshold where it's better to default to the simpler approach though, especially since !NX gets so little attention and testing these days. So I'd be fine with phase 3 too. I'd definitely suggest making this 3 separate patches, so any regressions can be tracked back to the specific change that triggers it. Thanks, Ingo
On Thu, Apr 25, 2019 at 10:07:25PM +0200, Ingo Molnar wrote: > > But yes, your above diff for "has NX" is roughly correct. I'll walk > > through each piece I'm thinking about. Here is the current state: > > > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > > ELF: | | | | > > missing GNU_STACK | exec-all | exec-all | exec-all | > > GNU_STACK == RWX | exec-all | exec-all | exec-all | > > GNU_STACK == RW | exec-none | exec-none | exec-none | > > > > *this column has no architecture effect: NX markings are ignored by > > hardware, but may have behavioral effects when "wants X" collides with > > "cannot be X" constraints in memory permission flags, as in [1]. > > So [1] appears to be device driver mapping a BAR that isn't intended to > be excutable: > > https://lore.kernel.org/netdev/20190418055759.GA3155@mellanox.com/ > > and the question is, do we reject this at the device driver mmap() level > already, right? No, we wanted to reject it at the driver mmap() level, but if an executable is marked with GNU_STACK=RWX then the core mm code always calls the driver with VM_EXEC (even though the mmap isn't a stack) and the driver becomes incompatible with userspace using GNU_STACK=RWX (ie some Fortran programs, apparently) > I suspect the best behavior is to reject as early as possible, so I agree > with your change here - even though !NX systems tend to become less and > less relevant these days. I suggested the idea of adding a flag in either the struct file or the file_operations flag that says mmap is never to be executable for this file with the idea that most/all cdev users would set it. Does that seem reasonable? Jason
Hello Kees, all, Sorry for the delayed response, I haven't had time to see this until now. On 25/04/2019 17:51, Kees Cook wrote: > On Wed, Apr 24, 2019 at 10:42 PM Ingo Molnar <mingo@kernel.org> wrote: >> Just to make clear, is the change from the old behavior, in essence: >> >> >> CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | >> ELF: | | | | >> ------------------------------|------------------|------------------| >> missing GNU_STACK | exec-all | exec-all | exec-none | >> - GNU_STACK == RWX | exec-all | exec-all | exec-all | >> + GNU_STACK == RWX | exec-all | exec-stack | exec-stack | >> GNU_STACK == RW | exec-all | exec-none | exec-none | >> [...] >> 'exec-all' : all user mappings are executable > For extreme clarity, this should be: > > 'exec-all' : all PROT_READ user mappings are executable, except when > backed by files on a noexec-filesystem. > >> 'exec-none' : only PROT_EXEC user mappings are executable >> 'exec-stack': only the stack and PROT_EXEC user mappings are executable > Thanks for helping clarify this. I spent last evening trying to figure > out a better way to explain/illustrate this series; my prior patch > combines too many things into a single change. One thing I noticed is > the "lacks NX" column is wrong: for "lack NX", our current state is > "don't care". If we _add_ RIE for the "lacks NX" case unconditionally, > we may cause unexpected problems[1]. More on this below... > > But yes, your above diff for "has NX" is roughly correct. I'll walk > through each piece I'm thinking about. Here is the current state: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > missing GNU_STACK | exec-all | exec-all | exec-all | > GNU_STACK == RWX | exec-all | exec-all | exec-all | > GNU_STACK == RW | exec-none | exec-none | exec-none | > > *this column has no architecture effect: NX markings are ignored by > hardware, but may have behavioral effects when "wants X" collides with > "cannot be X" constraints in memory permission flags, as in [1]. > > > I want to make three changes, listed in increasing risk levels. > > First, I want to split "missing GNU_STACK" and "GNU_STACK == RWX", > which is currently causing expected behavior for driver mmap > regions[1], etc: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > missing GNU_STACK | exec-all | exec-all | exec-all | > - GNU_STACK == RWX | exec-all | exec-all | exec-all | > + GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | > GNU_STACK == RW | exec-none | exec-none | exec-none | > > AFAICT, this has the least risk. I'm not aware of any situation where > GNU_STACK==RWX is supposed to mean MORE than that. As Jann researched, > even thread stacks will be treated correctly[2]. The risk would be > discovering some use-case where a program was executing memory that it > had not explicitly marked as executable. For ELFs marked with > GNU_STACK, this seems unlikely (I hope). I agree that "missing GNU_STACK" is not the same than GNU_STACK==RWX and this should be handled differently. There is a clear security benefit if we don't assume that GNU_STACK==RWX means more than that. My initial patch intended to prevent that on modern 64-bit programs where explicitly marked executable stack, they are forced to have the READ_IMPLIES_EXEC state when no such thing is needed. The read-implies-exec could be used via personality, so, such unlikely applications executing memory that it had not explicit marked as executable, could just use the READ_IMPLIES_EXEC personality, right? Adding a flag to prevent the core mm to call the driver with VM_EXEC can prevent [1]. So, I'm completely fine the "first" change. > > > Second, I want to split the behavior of "missing GNU_STACK" between > ia32 and x86_64. The reasonable(?) default for x86_64 memory is for it > to be NX. For the very rare x86_64 systems that do not have NX, this > shouldn't change anything because they still fall into the "don't > care" column. It would look like this: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > - missing GNU_STACK | exec-all | exec-all | exec-all | > + missing GNU_STACK | exec-all | exec-all | exec-none | > GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | > GNU_STACK == RW | exec-none | exec-none | exec-none | > > This carries some risk that there are ancient x86_64 binaries that > still behave like their even more ancient ia32 counterparts, and > expect to be able to execute any memory. I would _hope_ this is rare, > but I have no way to actually know if things like this exist in the > real world. This "second" change only affects "missing GNU_STACK" programs. So both, the benefits and the risks are only for ancient applications. So, this is not a bid deal, I would go for apply this "second" change. Maybe I'm missing something, but why we can't use personalities for x86_64 ancient binaries that expect to execute any memory? Again, we can add a flag to prevent the core mm to call the driver with VM_EXEC. > > > Third, I want to have the "lacks NX" column actually reflect reality. > Right now on such a system, memory permissions will show "not > executable" but there is actually no architectural checking for these > permissions. I think the true nature of such a system should be > reflected in the reported permissions. It would look like this: > > CPU: | lacks NX* | has NX, ia32 | has NX, x86_64 | > ELF: | | | | > -------------------------------|------------------|----------------| > missing GNU_STACK | exec-all | exec-all | exec-none | > - GNU_STACK == RWX | exec-stack | exec-stack | exec-stack | > - GNU_STACK == RW | exec-none | exec-none | exec-none | > + GNU_STACK == RWX | exec-all | exec-stack | exec-stack | > + GNU_STACK == RW | exec-all | exec-none | exec-none | > > This carries the largest risk because it effectively enables > READ_IMPLIES_EXEC on all processes for such systems. I worry this > might trip as-yet-unseen problems like in [1], for only cosmetic > improvements. Also as you pointed out, if there are backed files on a nonexec-filesystems, then should we remove the "x" to reflect reality? If we want to reflect reality, then there are other things we are missing. For example on i386, a write-only memory region can be read. So, if we have a "write-only" memory region, should we expect "rw-" in systems with NX and "rwx" in systems that lacks NX? There are probably others situations I'm not considering here. I'm not sure about the unseen issues that doing this can introduce but if we want to reflect reality, why we shouldn't do the same for others permissions? I am not sure that it worth to it just for cosmetic reasons. > > My intention was to split up the series and likely not even bother > with the third change, since it feels like too high a risk to me. What > do you think? > >> In particular, what is the policy for write-only and exec-only mappings, >> what does read-implies-exec do for them? > First it manifests here, which is used for stack and brk: > > #define VM_DATA_DEFAULT_FLAGS \ > (((current->personality & READ_IMPLIES_EXEC) ? VM_EXEC : 0 ) | \ > VM_READ | VM_WRITE | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC) > > above is used in do_brk_flags(), and is picked up by > VM_STACK_DEFAULT_FLAGS, visible in VM_STACK_FLAGS for > setup_arg_pages()'s stack creation. > > READ_IMPLIES_EXEC itself is checked directly in mmap, with noexec > checks that also clear VM_MAYEXEC: > > if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) > if (!(file && path_noexec(&file->f_path))) > prot |= PROT_EXEC; > ... > if (path_noexec(&file->f_path)) { > if (vm_flags & VM_EXEC) > return -EPERM; > vm_flags &= ~VM_MAYEXEC; > > The above is where we discussed adding some kind of check for device > driver memory mapping in [1] (or getting distros to mount /dev noexec, > which seems to break other things...), but I'd rather just fix > READ_IMPLIES_EXEC. > > Write-only would ignore READ_IMPLIES_EXEC, but mprotect() rechecks it > if PROT_READ gets added later: > > const bool rier = (current->personality & READ_IMPLIES_EXEC) && > (prot & PROT_READ); > ... > /* Does the application expect PROT_READ to imply PROT_EXEC */ > if (rier && (vma->vm_flags & VM_MAYEXEC)) > prot |= PROT_EXEC; > >> Also, it would be nice to define it precisely what 'stack' means in this >> context: it's only the ELF loader defined process stack - other stacks >> such as any thread stacks, signal stacks or alt-stacks depend on the C >> library - or does the kernel policy extend there too? > Correct: this is only the ELF loader stack. Thread stacks are (and > always have been) on their own. But as Jann found in [2], they should > be unchanged by anything here. > >> I.e. it would be nice to clarify all this, because it's still rather >> confusing and ambiguous right now. > Agreed. I've been trying to pick it apart too, hopefully this helps. > > -Kees > > [1] https://lkml.kernel.org/r/20190418055759.GA3155@mellanox.com > [2] https://lore.kernel.org/patchwork/patch/464875/ > Anyway, thank you for handling this, I would like also to see this fixed. Hector.
diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h index 6adc1a90e7e6..f1bb4b388b8f 100644 --- a/arch/arm64/include/asm/elf.h +++ b/arch/arm64/include/asm/elf.h @@ -107,7 +107,13 @@ */ #define elf_check_arch(x) ((x)->e_machine == EM_AARCH64) -#define elf_read_implies_exec(ex,stk) (stk != EXSTACK_DISABLE_X) +/* + * 64-bit processes should not automatically gain READ_IMPLIES_EXEC. Only + * 32-bit processes without PT_GNU_STACK should trigger READ_IMPLIES_EXEC + * out of an abundance of caution against ancient toolchains not knowing + * how to mark memory protection flags correctly. + */ +#define compat_elf_read_implies_exec(ex, stk) (stk == EXSTACK_DEFAULT) #define CORE_DUMP_USE_REGSET #define ELF_EXEC_PAGESIZE PAGE_SIZE diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index 69c0f892e310..5e65f1dcefc9 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -280,10 +280,28 @@ extern u32 elf_hwcap2; /* * An executable for which elf_read_implies_exec() returns TRUE will - * have the READ_IMPLIES_EXEC personality flag set automatically. + * have the READ_IMPLIES_EXEC personality flag set automatically. This + * is needed either to show the truth about a memory mapping (i.e. CPUs + * that lack NX have all memory implicitly executable, so this makes + * sure that the visible permissions reflect reality), or to deal with + * old toolchains on new CPUs. Old binaries entirely lacking a GNU_STACK + * indicate they were likely built with a toolchain that has no idea about + * memory permissions, and so we must default to the lowest reasonable + * common denominator for the architecture: on ia32 we assume all memory + * to be executable by default, and on x86_64 we assume all memory to be + * non-executable by default. + * + * CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | + * ELF: | | | | + * ------------------------------|------------------|------------------| + * missing GNU_STACK | needs RIE | needs RIE | no RIE | + * GNU_STACK == RWX | needs RIE | no RIE: stack X | no RIE: stack X | + * GNU_STACK == RW | needs RIE | no RIE: stack NX | no RIE: stack NX | + * */ -#define elf_read_implies_exec(ex, executable_stack) \ - (executable_stack != EXSTACK_DISABLE_X) +#define elf_read_implies_exec(ex, stk) \ + (!(__supported_pte_mask & _PAGE_NX) || \ + (mmap_is_ia32() && stk == EXSTACK_DEFAULT)) struct task_struct;
The READ_IMPLIES_EXEC work-around was designed for old CPUs lacking NX (to have the visible permission flags on memory regions reflect reality: they are all executable), and for old toolchains that lacked the ELF PT_GNU_STACK marking (under the assumption that toolchains that couldn't even specify memory protection flags may have it wrong for all memory regions). This logic is sensible, but was implemented in a way that equated having a PT_GNU_STACK marked executable as being as "broken" as lacking the PT_GNU_STACK marking entirely. This is not a reasonable assumption for CPUs that have had NX support from the start (or very close to the start). This confusion has led to situations where modern 64-bit programs with explicitly marked executable stack are forced into the READ_IMPLIES_EXEC state when no such thing is needed. (And leads to unexpected failures when mmap()ing regions of device driver memory that wish to disallow VM_EXEC[1].) To fix this, elf_read_implies_exec() is adjusted on arm64 (where NX has always existed and toolchains have implemented PT_GNU_STACK for a while), and x86 is adjusted to handle this combination of possible outcomes: CPU: | lacks NX | has NX, ia32 | has NX, x86_64 | ELF: | | | | ------------------------------|------------------|------------------| missing GNU_STACK | needs RIE | needs RIE | no RIE | GNU_STACK == RWX | needs RIE | no RIE: stack X | no RIE: stack X | GNU_STACK == RW | needs RIE | no RIE: stack NX | no RIE: stack NX | This has the effect of making binfmt_elf's EXSTACK_DEFAULT actually take on the correct architecture default of being non-executable on arm64 and x86_64, and being executable on ia32. [1] https://lkml.kernel.org/r/20190418055759.GA3155@mellanox.com Suggested-by: Hector Marco-Gisbert <hecmargi@upv.es> Signed-off-by: Kees Cook <keescook@chromium.org> --- v2: adjust arm64 to avoid is_compat_task() (marc.w.gonzalez) --- arch/arm64/include/asm/elf.h | 8 +++++++- arch/x86/include/asm/elf.h | 24 +++++++++++++++++++++--- 2 files changed, 28 insertions(+), 4 deletions(-)