Message ID | cover.1604393169.git.szabolcs.nagy@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831] | expand |
On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote: > Re-mmap executable segments instead of mprotecting them in > case mprotect is seccomp filtered. > For the kernel mapped main executable we don't have the fd > for re-mmap so linux needs to be updated to add BTI. (In the > presence of seccomp filters for mprotect(PROT_EXEC) the libc > cannot change BTI protection at runtime based on user space > policy so it is better if the kernel maps BTI compatible > binaries with PROT_BTI by default.) Given that there were still some ongoing discussions on a more robust kernel interface here and there seem to be a few concerns with this series should we perhaps just take a step back and disable this seccomp filter in systemd on arm64, at least for the time being? That seems safer than rolling out things that set ABI quickly, a big part of the reason we went with having the dynamic linker enable PROT_BTI in the first place was to give us more flexibility to handle any unforseen consequences of enabling BTI that we run into. We are going to have similar issues with other features like MTE so we need to make sure that whatever we're doing works with them too. Also updated to Will's current e-mail address - Will, do you have thoughts on what we should do here?
Hi, On 11/3/20 11:34 AM, Mark Brown wrote: > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote: > >> Re-mmap executable segments instead of mprotecting them in >> case mprotect is seccomp filtered. > >> For the kernel mapped main executable we don't have the fd >> for re-mmap so linux needs to be updated to add BTI. (In the >> presence of seccomp filters for mprotect(PROT_EXEC) the libc >> cannot change BTI protection at runtime based on user space >> policy so it is better if the kernel maps BTI compatible >> binaries with PROT_BTI by default.) > > Given that there were still some ongoing discussions on a more robust > kernel interface here and there seem to be a few concerns with this > series should we perhaps just take a step back and disable this seccomp > filter in systemd on arm64, at least for the time being? That seems > safer than rolling out things that set ABI quickly, a big part of the So, that's a bigger hammer than I think is needed and punishes !BTI machines. I'm going to suggest that if we need to carry a temp patch its more like the glibc patch I mentioned in the Fedora defect. That patch simply logs a message, on the mprotect failures rather than aborting. Its fairly non-intrusive. That leaves seccomp functional, and BTI generally functional except when seccomp is restricting it. I've also been asked that if a patch like that is needed, its (temporary?) merged to the glibc trunk, rather than just being carried by the distro's. Thanks, > reason we went with having the dynamic linker enable PROT_BTI in the > first place was to give us more flexibility to handle any unforseen > consequences of enabling BTI that we run into. We are going to have > similar issues with other features like MTE so we need to make sure that > whatever we're doing works with them too. > > Also updated to Will's current e-mail address - Will, do you have > thoughts on what we should do here? >
The 11/03/2020 23:41, Jeremy Linton wrote: > On 11/3/20 11:34 AM, Mark Brown wrote: > > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote: > > > > > Re-mmap executable segments instead of mprotecting them in > > > case mprotect is seccomp filtered. > > > > > For the kernel mapped main executable we don't have the fd > > > for re-mmap so linux needs to be updated to add BTI. (In the > > > presence of seccomp filters for mprotect(PROT_EXEC) the libc > > > cannot change BTI protection at runtime based on user space > > > policy so it is better if the kernel maps BTI compatible > > > binaries with PROT_BTI by default.) > > > > Given that there were still some ongoing discussions on a more robust > > kernel interface here and there seem to be a few concerns with this > > series should we perhaps just take a step back and disable this seccomp > > filter in systemd on arm64, at least for the time being? That seems > > safer than rolling out things that set ABI quickly, a big part of the > > So, that's a bigger hammer than I think is needed and punishes !BTI > machines. I'm going to suggest that if we need to carry a temp patch its > more like the glibc patch I mentioned in the Fedora defect. That patch > simply logs a message, on the mprotect failures rather than aborting. Its > fairly non-intrusive. > > That leaves seccomp functional, and BTI generally functional except when > seccomp is restricting it. I've also been asked that if a patch like that is > needed, its (temporary?) merged to the glibc trunk, rather than just being > carried by the distro's. note that changing mprotect into mmap in glibc works even if the kernel or systemd decides to do things differently: currently the only wart is that on the main exe we have to use mprotect and silently ignore the failures. (but e.g. some return to libc attacks are reliably prevented since libc.so guaranteed to have PROT_BTI this way.) the patchset is a bit more intrusive than i hoped for, so if we expect to get a new syscall then simply ignoring mprotect failures may be a better temporary solution. (and i still need to do benchmarks to see if the cost of re-mmap is not much more than the cost of mprotect.) changing the seccomp filter in systemd does not help if there are other seccomp filters elsewhere doing the same. i think we will have to avoid using mprotect(PROT_EXEC) or at least ignore the failure. > > reason we went with having the dynamic linker enable PROT_BTI in the > > first place was to give us more flexibility to handle any unforseen > > consequences of enabling BTI that we run into. We are going to have > > similar issues with other features like MTE so we need to make sure that > > whatever we're doing works with them too. > > > > Also updated to Will's current e-mail address - Will, do you have > > thoughts on what we should do here? > >
On 3.11.2020 19.34, Mark Brown wrote: > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote: > >> Re-mmap executable segments instead of mprotecting them in >> case mprotect is seccomp filtered. > >> For the kernel mapped main executable we don't have the fd >> for re-mmap so linux needs to be updated to add BTI. (In the >> presence of seccomp filters for mprotect(PROT_EXEC) the libc >> cannot change BTI protection at runtime based on user space >> policy so it is better if the kernel maps BTI compatible >> binaries with PROT_BTI by default.) > > Given that there were still some ongoing discussions on a more robust > kernel interface here and there seem to be a few concerns with this > series should we perhaps just take a step back and disable this seccomp > filter in systemd on arm64, at least for the time being? Filtering mprotect() and mmap() with seccomp also protects BTI, since without it the attacker could remove PROT_BTI from existing pages, or map new pages without BTI. This would be possible even with SARA or SELinux execmem protections enabled, since they don't care about PROT_BTI. -Topi
On Tue, Nov 03, 2020 at 05:34:38PM +0000, Mark Brown wrote: > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote: > > > Re-mmap executable segments instead of mprotecting them in > > case mprotect is seccomp filtered. > > > For the kernel mapped main executable we don't have the fd > > for re-mmap so linux needs to be updated to add BTI. (In the > > presence of seccomp filters for mprotect(PROT_EXEC) the libc > > cannot change BTI protection at runtime based on user space > > policy so it is better if the kernel maps BTI compatible > > binaries with PROT_BTI by default.) > > Given that there were still some ongoing discussions on a more robust > kernel interface here and there seem to be a few concerns with this > series should we perhaps just take a step back and disable this seccomp > filter in systemd on arm64, at least for the time being? That seems > safer than rolling out things that set ABI quickly, a big part of the > reason we went with having the dynamic linker enable PROT_BTI in the > first place was to give us more flexibility to handle any unforseen > consequences of enabling BTI that we run into. We are going to have > similar issues with other features like MTE so we need to make sure that > whatever we're doing works with them too. > > Also updated to Will's current e-mail address - Will, do you have > thoughts on what we should do here? Changing the kernel to map the main executable with PROT_BTI by default is a user-visible change in behaviour and not without risk, so if we're going to do that then it needs to be opt-in because the current behaviour has been there since 5.8. I suppose we could shoe-horn in a cmdline option for 5.10 (which will be the first LTS with BTI) but it would be better to put up with the current ABI if possible. Is there real value in this seccomp filter if it only looks at mprotect(), or was it just implemented because it's easy to do and sounds like a good idea? Will
* Will Deacon: > Is there real value in this seccomp filter if it only looks at mprotect(), > or was it just implemented because it's easy to do and sounds like a good > idea? It seems bogus to me. Everyone will just create alias mappings instead, just like they did for the similar SELinux feature. See “Example code to avoid execmem violations” in: <https://www.akkadia.org/drepper/selinux-mem.html> As you can see, this reference implementation creates a PROT_WRITE mapping aliased to a PROT_EXEC mapping, so it actually reduces security compared to something that generates the code in an anonymous mapping and calls mprotect to make it executable. Furthermore, it requires unusual cache flushing code on some AArch64 implementations (a requirement that is not shared by any Linux other architecture to which libffi has been ported), resulting in hard-to-track-down real-world bugs. Thanks, Florian
On 4.11.2020 11.29, Florian Weimer wrote: > * Will Deacon: > >> Is there real value in this seccomp filter if it only looks at mprotect(), >> or was it just implemented because it's easy to do and sounds like a good >> idea? > > It seems bogus to me. Everyone will just create alias mappings instead, > just like they did for the similar SELinux feature. See “Example code > to avoid execmem violations” in: > > <https://www.akkadia.org/drepper/selinux-mem.html> Also note "But this is very dangerous: programs should never use memory regions which are writable and executable at the same time. Assuming that it is really necessary to generate executable code while the program runs the method employed should be reconsidered." > As you can see, this reference implementation creates a PROT_WRITE > mapping aliased to a PROT_EXEC mapping, so it actually reduces security > compared to something that generates the code in an anonymous mapping > and calls mprotect to make it executable. Drepper's methods to avoid SELinux protections are indeed the two ways (which I'm aware) to also avoid the seccomp filter: by using memfd_create() and using a file system which writable and executable to the process to create a new executable file. Both methods can be eliminated for many system services, memfd_create() with seccomp and filesystem W&X with mount namespaces. If a service legitimately needs executable and writable mappings (due to JIT, trampolines etc), it's easy to disable the filter whenever really needed with "MemoryDenyWriteExecute=no" (which is the default) in case of systemd or a TE rule like "allow type_t self:process { execmem };" for SELinux. But this shouldn't be the default case, since there are many services which don't need W&X. I'd also question what is the value of BTI if it can be easily circumvented by removing PROT_BTI with mprotect()? -Topi
On Tue, Nov 03, 2020 at 11:41:42PM -0600, Jeremy Linton wrote: > On 11/3/20 11:34 AM, Mark Brown wrote: > > Given that there were still some ongoing discussions on a more robust > > kernel interface here and there seem to be a few concerns with this > > series should we perhaps just take a step back and disable this seccomp > > filter in systemd on arm64, at least for the time being? That seems > > safer than rolling out things that set ABI quickly, a big part of the > So, that's a bigger hammer than I think is needed and punishes !BTI > machines. I'm going to suggest that if we need to carry a temp patch its > more like the glibc patch I mentioned in the Fedora defect. That patch > simply logs a message, on the mprotect failures rather than aborting. Its > fairly non-intrusive. > That leaves seccomp functional, and BTI generally functional except when > seccomp is restricting it. I've also been asked that if a patch like that is > needed, its (temporary?) merged to the glibc trunk, rather than just being > carried by the distro's. The effect on pre-BTI hardware is an issue, another option would be for systemd to disable this seccomp usage but only after checking for BTI support in the system rather than just doing so purely based on the architecture.
On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote: > On 4.11.2020 11.29, Florian Weimer wrote: > > * Will Deacon: > > > > > Is there real value in this seccomp filter if it only looks at mprotect(), > > > or was it just implemented because it's easy to do and sounds like a good > > > idea? > > > > It seems bogus to me. Everyone will just create alias mappings instead, > > just like they did for the similar SELinux feature. See “Example code > > to avoid execmem violations” in: > > > > <https://www.akkadia.org/drepper/selinux-mem.html> [...] > > As you can see, this reference implementation creates a PROT_WRITE > > mapping aliased to a PROT_EXEC mapping, so it actually reduces security > > compared to something that generates the code in an anonymous mapping > > and calls mprotect to make it executable. [...] > If a service legitimately needs executable and writable mappings (due to > JIT, trampolines etc), it's easy to disable the filter whenever really > needed with "MemoryDenyWriteExecute=no" (which is the default) in case of > systemd or a TE rule like "allow type_t self:process { execmem };" for > SELinux. But this shouldn't be the default case, since there are many > services which don't need W&X. I think Drepper's point is that separate X and W mappings, with enough randomisation, would be more secure than allowing W&X at the same address (but, of course, less secure than not having W at all, though that's not always possible). > I'd also question what is the value of BTI if it can be easily circumvented > by removing PROT_BTI with mprotect()? Well, BTI is a protection against JOP attacks. The assumption here is that an attacker cannot invoke mprotect() to disable PROT_BTI. If it can, it's probably not worth bothering with a subsequent JOP attack, it can already call functions directly. I see MDWX not as a way of detecting attacks but rather plugging inadvertent security holes in certain programs. On arm64, such hardening currently gets in the way of another hardening feature, BTI.
On Wed, Nov 04, 2020 at 08:57:05AM +0000, Szabolcs Nagy wrote: > The 11/03/2020 23:41, Jeremy Linton wrote: > > On 11/3/20 11:34 AM, Mark Brown wrote: > > > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote: > > > > > > > Re-mmap executable segments instead of mprotecting them in > > > > case mprotect is seccomp filtered. > > > > > > > For the kernel mapped main executable we don't have the fd > > > > for re-mmap so linux needs to be updated to add BTI. (In the > > > > presence of seccomp filters for mprotect(PROT_EXEC) the libc > > > > cannot change BTI protection at runtime based on user space > > > > policy so it is better if the kernel maps BTI compatible > > > > binaries with PROT_BTI by default.) > > > > > > Given that there were still some ongoing discussions on a more robust > > > kernel interface here and there seem to be a few concerns with this > > > series should we perhaps just take a step back and disable this seccomp > > > filter in systemd on arm64, at least for the time being? That seems > > > safer than rolling out things that set ABI quickly, a big part of the > > > > So, that's a bigger hammer than I think is needed and punishes !BTI > > machines. I'm going to suggest that if we need to carry a temp patch its > > more like the glibc patch I mentioned in the Fedora defect. That patch > > simply logs a message, on the mprotect failures rather than aborting. Its > > fairly non-intrusive. > > > > That leaves seccomp functional, and BTI generally functional except when > > seccomp is restricting it. I've also been asked that if a patch like that is > > needed, its (temporary?) merged to the glibc trunk, rather than just being > > carried by the distro's. > > note that changing mprotect into mmap in glibc works > even if the kernel or systemd decides to do things > differently: currently the only wart is that on the > main exe we have to use mprotect and silently ignore > the failures. Can the dynamic loader mmap() the main exe again while munmap'ing the original one? (sorry if it was already discussed)
* Catalin Marinas: > Can the dynamic loader mmap() the main exe again while munmap'ing the > original one? (sorry if it was already discussed) No, we don't have a descriptor for that. /proc may not be mounted, and using the path stored there has a race condition anyway. Thanks, Florian
On 4.11.2020 16.35, Catalin Marinas wrote: > On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote: >> On 4.11.2020 11.29, Florian Weimer wrote: >>> * Will Deacon: >>> >>>> Is there real value in this seccomp filter if it only looks at mprotect(), >>>> or was it just implemented because it's easy to do and sounds like a good >>>> idea? >>> >>> It seems bogus to me. Everyone will just create alias mappings instead, >>> just like they did for the similar SELinux feature. See “Example code >>> to avoid execmem violations” in: >>> >>> <https://www.akkadia.org/drepper/selinux-mem.html> > [...] >>> As you can see, this reference implementation creates a PROT_WRITE >>> mapping aliased to a PROT_EXEC mapping, so it actually reduces security >>> compared to something that generates the code in an anonymous mapping >>> and calls mprotect to make it executable. > [...] >> If a service legitimately needs executable and writable mappings (due to >> JIT, trampolines etc), it's easy to disable the filter whenever really >> needed with "MemoryDenyWriteExecute=no" (which is the default) in case of >> systemd or a TE rule like "allow type_t self:process { execmem };" for >> SELinux. But this shouldn't be the default case, since there are many >> services which don't need W&X. > > I think Drepper's point is that separate X and W mappings, with enough > randomisation, would be more secure than allowing W&X at the same > address (but, of course, less secure than not having W at all, though > that's not always possible). > >> I'd also question what is the value of BTI if it can be easily circumvented >> by removing PROT_BTI with mprotect()? > > Well, BTI is a protection against JOP attacks. The assumption here is > that an attacker cannot invoke mprotect() to disable PROT_BTI. If it > can, it's probably not worth bothering with a subsequent JOP attack, it > can already call functions directly. I suppose that the target for the attacker is to eventually perform system calls rather than looping forever in JOP/ROP gadgets. > I see MDWX not as a way of detecting attacks but rather plugging > inadvertent security holes in certain programs. On arm64, such hardening > currently gets in the way of another hardening feature, BTI. I don't think it has to get in the way at all. Why wouldn't something simple like this work: diff --git a/elf/dl-load.c b/elf/dl-load.c index 646c5dca40..12a74d15e8 100644 --- a/elf/dl-load.c +++ b/elf/dl-load.c @@ -1170,8 +1170,13 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd, c->prot |= PROT_READ; if (ph->p_flags & PF_W) c->prot |= PROT_WRITE; - if (ph->p_flags & PF_X) + if (ph->p_flags & PF_X) { c->prot |= PROT_EXEC; +#ifdef PROT_BTI + if (GLRO(dl_bti) & 1) + c->prot |= PROT_BTI; +#endif + } #endif break; diff --git a/elf/dl-support.c b/elf/dl-support.c index 7704c101c5..22c7cc7b81 100644 --- a/elf/dl-support.c +++ b/elf/dl-support.c @@ -222,7 +222,7 @@ __rtld_lock_define_initialized_recursive (, _dl_load_write_lock) #ifdef HAVE_AUX_VECTOR -int _dl_clktck; +int _dl_clktck, _dl_bti; void _dl_aux_init (ElfW(auxv_t) *av) @@ -294,6 +294,11 @@ _dl_aux_init (ElfW(auxv_t) *av) case AT_RANDOM: _dl_random = (void *) av->a_un.a_val; break; +#ifdef PROT_BTI + case AT_BTI: + _dl_bti = (void *) av->a_un.a_val; + break; +#endif DL_PLATFORM_AUXV } if (seen == 0xf) Kernel sets the aux vector to indicate that BTI should be enabled for all segments and main exe is already protected. -Topi
On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote: > On 4.11.2020 11.29, Florian Weimer wrote: > > * Will Deacon: > > > > > Is there real value in this seccomp filter if it only looks at mprotect(), > > > or was it just implemented because it's easy to do and sounds like a good > > > idea? > > > > It seems bogus to me. Everyone will just create alias mappings instead, > > just like they did for the similar SELinux feature. See “Example code > > to avoid execmem violations” in: > > > > <https://www.akkadia.org/drepper/selinux-mem.html> > > Also note "But this is very dangerous: programs should never use memory > regions which are writable and executable at the same time. Assuming that it > is really necessary to generate executable code while the program runs the > method employed should be reconsidered." Sure, and to be clear we're not trying to violate the "at the same time" property. We do not want to permit simultaneous PROT_WRITE and PROT_EXEC at any instant in time. What we're asking is to not block changing permissions to PROT_EXEC in the absence of PROT_WRITE. I think that the goal of preventing WRITE -> EXEC transitions for some memory is sane, but I think the existing kernel primitives available to systemd don't allow us to do that in a robust way because we don't have all the relevant state tracked and accessible, and the existing approach gets in the way of doing the right thing for other mitigations. Consequently I think it would be better going forward to add a more robust (kernel) mechanism for enforcement that can distinguish WRITE->EXEC from EXEC->EXEC+BTI, and e.g. can be used to forbid aliasing mappings with differing W/X permissions. Then userspace could eventually transition over to that and get /stronger/ protection while permitting the BTI case we'd like to work now. > If a service legitimately needs executable and writable mappings (due to > JIT, trampolines etc), it's easy to disable the filter whenever really > needed with "MemoryDenyWriteExecute=no" (which is the default) in case of > systemd or a TE rule like "allow type_t self:process { execmem };" for > SELinux. But this shouldn't be the default case, since there are many > services which don't need W&X. > > I'd also question what is the value of BTI if it can be easily circumvented > by removing PROT_BTI with mprotect()? I agree that turning BTI off is a concern, and to that end I'd like to add an enforcement mechanism whereby we could prevent that (ideally the same mechanism by which we could prevent WRITE -> EXEC transitions). But, as with all things it's a matter of degree. MDWE and BTI are both hurdles to an adversary, but neither are absolutes and there are approaches to bypass either. By the time someone's issuing mprotect() with an arbitrary VA and/or prot, they are liable to have been able to do the same with mmap() and circumvent MDWE. I'd really like to not have BTI silently disabled in order to work with MDWE, because the risk is that it gets silently disabled elsewhere. The risk of the changing the kernel to enable BTI for a binary is not well known since we don't control other peoples libraries that might end up not being compatible somehow with that. The risk of disabling a portion of the MDWE protections seems to be the least out of the options we have available, as unfortunate as it seems, and I think we can come up with a better MDWE approach going forward. Thanks, Mark.
The 11/04/2020 17:19, Topi Miettinen wrote: > On 4.11.2020 16.35, Catalin Marinas wrote: > > On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote: > > > On 4.11.2020 11.29, Florian Weimer wrote: > > > > * Will Deacon: > > > > > Is there real value in this seccomp filter if it only looks at mprotect(), > > > > > or was it just implemented because it's easy to do and sounds like a good > > > > > idea? > > > > > > > > It seems bogus to me. Everyone will just create alias mappings instead, > > > > just like they did for the similar SELinux feature. See “Example code > > > > to avoid execmem violations” in: > > > > > > > > <https://www.akkadia.org/drepper/selinux-mem.html> > > [...] > > > > As you can see, this reference implementation creates a PROT_WRITE > > > > mapping aliased to a PROT_EXEC mapping, so it actually reduces security > > > > compared to something that generates the code in an anonymous mapping > > > > and calls mprotect to make it executable. > > [...] > > > If a service legitimately needs executable and writable mappings (due to > > > JIT, trampolines etc), it's easy to disable the filter whenever really > > > needed with "MemoryDenyWriteExecute=no" (which is the default) in case of > > > systemd or a TE rule like "allow type_t self:process { execmem };" for > > > SELinux. But this shouldn't be the default case, since there are many > > > services which don't need W&X. > > > > I think Drepper's point is that separate X and W mappings, with enough > > randomisation, would be more secure than allowing W&X at the same > > address (but, of course, less secure than not having W at all, though > > that's not always possible). > > > > > I'd also question what is the value of BTI if it can be easily circumvented > > > by removing PROT_BTI with mprotect()? > > > > Well, BTI is a protection against JOP attacks. The assumption here is > > that an attacker cannot invoke mprotect() to disable PROT_BTI. If it > > can, it's probably not worth bothering with a subsequent JOP attack, it > > can already call functions directly. > > I suppose that the target for the attacker is to eventually perform system > calls rather than looping forever in JOP/ROP gadgets. > > > I see MDWX not as a way of detecting attacks but rather plugging > > inadvertent security holes in certain programs. On arm64, such hardening > > currently gets in the way of another hardening feature, BTI. > > I don't think it has to get in the way at all. Why wouldn't something simple > like this work: PROT_BTI is only valid on binaries that are BTI compatible. to detect that, the load segments must be already mapped. AT_BTI does not solve this: we want to be able to load legacy elf modules. (a BTI enforcement setting may be useful where incompatible modules are rejected, but that cannot be the default for backward compatibility reasons.) > > diff --git a/elf/dl-load.c b/elf/dl-load.c > index 646c5dca40..12a74d15e8 100644 > --- a/elf/dl-load.c > +++ b/elf/dl-load.c > @@ -1170,8 +1170,13 @@ _dl_map_object_from_fd (const char *name, const char > *origname, int fd, > c->prot |= PROT_READ; > if (ph->p_flags & PF_W) > c->prot |= PROT_WRITE; > - if (ph->p_flags & PF_X) > + if (ph->p_flags & PF_X) { > c->prot |= PROT_EXEC; > +#ifdef PROT_BTI > + if (GLRO(dl_bti) & 1) > + c->prot |= PROT_BTI; > +#endif > + } > #endif > break; > > diff --git a/elf/dl-support.c b/elf/dl-support.c > index 7704c101c5..22c7cc7b81 100644 > --- a/elf/dl-support.c > +++ b/elf/dl-support.c > @@ -222,7 +222,7 @@ __rtld_lock_define_initialized_recursive (, > _dl_load_write_lock) > > > #ifdef HAVE_AUX_VECTOR > -int _dl_clktck; > +int _dl_clktck, _dl_bti; > > void > _dl_aux_init (ElfW(auxv_t) *av) > @@ -294,6 +294,11 @@ _dl_aux_init (ElfW(auxv_t) *av) > case AT_RANDOM: > _dl_random = (void *) av->a_un.a_val; > break; > +#ifdef PROT_BTI > + case AT_BTI: > + _dl_bti = (void *) av->a_un.a_val; > + break; > +#endif > DL_PLATFORM_AUXV > } > if (seen == 0xf) > > Kernel sets the aux vector to indicate that BTI should be enabled for all > segments and main exe is already protected. > > -Topi
Hi, On 11/4/20 4:50 AM, Mark Brown wrote: > On Tue, Nov 03, 2020 at 11:41:42PM -0600, Jeremy Linton wrote: >> On 11/3/20 11:34 AM, Mark Brown wrote: > >>> Given that there were still some ongoing discussions on a more robust >>> kernel interface here and there seem to be a few concerns with this >>> series should we perhaps just take a step back and disable this seccomp >>> filter in systemd on arm64, at least for the time being? That seems >>> safer than rolling out things that set ABI quickly, a big part of the > >> So, that's a bigger hammer than I think is needed and punishes !BTI >> machines. I'm going to suggest that if we need to carry a temp patch its >> more like the glibc patch I mentioned in the Fedora defect. That patch >> simply logs a message, on the mprotect failures rather than aborting. Its >> fairly non-intrusive. > >> That leaves seccomp functional, and BTI generally functional except when >> seccomp is restricting it. I've also been asked that if a patch like that is >> needed, its (temporary?) merged to the glibc trunk, rather than just being >> carried by the distro's. > > The effect on pre-BTI hardware is an issue, another option would be for > systemd to disable this seccomp usage but only after checking for BTI > support in the system rather than just doing so purely based on the > architecture. > That works, but your also losing seccomp in the case where the machine is BTI capable, but the service isn't. So it should really be checking the elf notes, but at that point you might just as well patch glibc.
On Wed, Nov 04, 2020 at 12:47:09PM -0600, Jeremy Linton wrote: > On 11/4/20 4:50 AM, Mark Brown wrote: > > The effect on pre-BTI hardware is an issue, another option would be for > > systemd to disable this seccomp usage but only after checking for BTI > > support in the system rather than just doing so purely based on the > > architecture. > That works, but your also losing seccomp in the case where the machine is > BTI capable, but the service isn't. So it should really be checking the elf > notes, but at that point you might just as well patch glibc. True, I guess I was assuming that a BTI rebuild is done at the distro level but of course even if that's the case a system could have third party binaries so you can't just assume that the world is BTI.
Hi, On 11/4/20 9:20 AM, Mark Rutland wrote: > On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote: >> On 4.11.2020 11.29, Florian Weimer wrote: >>> * Will Deacon: >>> >>>> Is there real value in this seccomp filter if it only looks at mprotect(), >>>> or was it just implemented because it's easy to do and sounds like a good >>>> idea? >>> >>> It seems bogus to me. Everyone will just create alias mappings instead, >>> just like they did for the similar SELinux feature. See “Example code >>> to avoid execmem violations” in: >>> >>> <https://www.akkadia.org/drepper/selinux-mem.html> >> >> Also note "But this is very dangerous: programs should never use memory >> regions which are writable and executable at the same time. Assuming that it >> is really necessary to generate executable code while the program runs the >> method employed should be reconsidered." > > Sure, and to be clear we're not trying to violate the "at the same time" > property. We do not want to permit simultaneous PROT_WRITE and PROT_EXEC > at any instant in time. What we're asking is to not block changing > permissions to PROT_EXEC in the absence of PROT_WRITE. > > I think that the goal of preventing WRITE -> EXEC transitions for some > memory is sane, but I think the existing kernel primitives available to > systemd don't allow us to do that in a robust way because we don't have > all the relevant state tracked and accessible, and the existing approach > gets in the way of doing the right thing for other mitigations. > > Consequently I think it would be better going forward to add a more > robust (kernel) mechanism for enforcement that can distinguish > WRITE->EXEC from EXEC->EXEC+BTI, and e.g. can be used to forbid aliasing > mappings with differing W/X permissions. Then userspace could eventually > transition over to that and get /stronger/ protection while permitting > the BTI case we'd like to work now. > >> If a service legitimately needs executable and writable mappings (due to >> JIT, trampolines etc), it's easy to disable the filter whenever really >> needed with "MemoryDenyWriteExecute=no" (which is the default) in case of >> systemd or a TE rule like "allow type_t self:process { execmem };" for >> SELinux. But this shouldn't be the default case, since there are many >> services which don't need W&X. >> >> I'd also question what is the value of BTI if it can be easily circumvented >> by removing PROT_BTI with mprotect()? > > I agree that turning BTI off is a concern, and to that end I'd like to > add an enforcement mechanism whereby we could prevent that (ideally the > same mechanism by which we could prevent WRITE -> EXEC transitions). > > But, as with all things it's a matter of degree. MDWE and BTI are both > hurdles to an adversary, but neither are absolutes and there are > approaches to bypass either. By the time someone's issuing mprotect() > with an arbitrary VA and/or prot, they are liable to have been able to > do the same with mmap() and circumvent MDWE. > > I'd really like to not have BTI silently disabled in order to work with > MDWE, because the risk is that it gets silently disabled elsewhere. The > risk of the changing the kernel to enable BTI for a binary is not well > known since we don't control other peoples libraries that might end up > not being compatible somehow with that. The risk of disabling a portion > of the MDWE protections seems to be the least out of the options we have > available, as unfortunate as it seems, and I think we can come up with a > better MDWE approach going forward. OTOH, You don't really want to blanket disable either protection, and unfortunately you can't really tell until its too late if the service is fully BTI enabled. So you either end up disabling MDWE unnecessarily, or you delay until the only choice is not enabling BTI. I guess there is another option too, which is some kind of delayed MDWE policy that only turns on once the service has started, but that isn't ideal either. . > > Thanks, > Mark. >
The 11/04/2020 09:20, Will Deacon wrote: > On Tue, Nov 03, 2020 at 05:34:38PM +0000, Mark Brown wrote: > > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote: > > > > > Re-mmap executable segments instead of mprotecting them in > > > case mprotect is seccomp filtered. > > > > > For the kernel mapped main executable we don't have the fd > > > for re-mmap so linux needs to be updated to add BTI. (In the > > > presence of seccomp filters for mprotect(PROT_EXEC) the libc > > > cannot change BTI protection at runtime based on user space > > > policy so it is better if the kernel maps BTI compatible > > > binaries with PROT_BTI by default.) > > > > Given that there were still some ongoing discussions on a more robust > > kernel interface here and there seem to be a few concerns with this > > series should we perhaps just take a step back and disable this seccomp > > filter in systemd on arm64, at least for the time being? That seems > > safer than rolling out things that set ABI quickly, a big part of the > > reason we went with having the dynamic linker enable PROT_BTI in the > > first place was to give us more flexibility to handle any unforseen > > consequences of enabling BTI that we run into. We are going to have > > similar issues with other features like MTE so we need to make sure that > > whatever we're doing works with them too. > > > > Also updated to Will's current e-mail address - Will, do you have > > thoughts on what we should do here? > > Changing the kernel to map the main executable with PROT_BTI by default is a > user-visible change in behaviour and not without risk, so if we're going to > do that then it needs to be opt-in because the current behaviour has been > there since 5.8. I suppose we could shoe-horn in a cmdline option for 5.10 > (which will be the first LTS with BTI) but it would be better to put up with > the current ABI if possible. it's not clear to me how adding PROT_BTI in the kernel would be observable in practice. adding PROT_BTI to marked elf modules should only have effect in cases when the process does invalid operations and then there is no compat requirement. if this is not the case then adding PROT_BTI on static exe is already problematic. if there is some issue with bti that makes users want to turn it off, then they should do it system wide or may be we can have a no-bti option in userspace which uses mprotect to turn it off (but since that has to be an explicit opt-out i don't mind if the user also has to disable the seccomp sandbox). > Is there real value in this seccomp filter if it only looks at mprotect(), > or was it just implemented because it's easy to do and sounds like a good > idea? i'm fine with just using mprotect and telling users to remove the seccomp filter. but that makes bti less attractive for deployment.