Message ID | 20240214122845.2033971-87-ardb+git@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: Add support for LPA2 and WXN at stage 1 | expand |
On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote: > From: Ard Biesheuvel <ardb@kernel.org> > > Add a hook to permit architectures to perform validation on the prot > flags passed to mmap(), like arch_validate_prot() does for mprotect(). > This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on > configurations that run with WXN enabled. > > Reviewed-by: Kees Cook <keescook@chromium.org> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > --- > include/linux/mman.h | 15 +++++++++++++++ > mm/mmap.c | 3 +++ > 2 files changed, 18 insertions(+) > > diff --git a/include/linux/mman.h b/include/linux/mman.h > index dc7048824be8..ec5e7f606e43 100644 > --- a/include/linux/mman.h > +++ b/include/linux/mman.h > @@ -124,6 +124,21 @@ static inline bool arch_validate_flags(unsigned long flags) > #define arch_validate_flags arch_validate_flags > #endif > > +#ifndef arch_validate_mmap_prot > +/* > + * This is called from mmap(), which ignores unknown prot bits so the default > + * is to accept anything. > + * > + * Returns true if the prot flags are valid > + */ > +static inline bool arch_validate_mmap_prot(unsigned long prot, > + unsigned long addr) > +{ > + return true; > +} > +#define arch_validate_mmap_prot arch_validate_mmap_prot > +#endif > + > /* > * Optimisation macro. It is equivalent to: > * (x & bit1) ? bit2 : 0 > diff --git a/mm/mmap.c b/mm/mmap.c > index d89770eaab6b..977a8c3fd9f5 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > if (!(file && path_noexec(&file->f_path))) > prot |= PROT_EXEC; > > + if (!arch_validate_mmap_prot(prot, addr)) > + return -EACCES; > + > /* force arch specific MAP_FIXED handling in get_unmapped_area */ > if (flags & MAP_FIXED_NOREPLACE) > flags |= MAP_FIXED; While writing the pull request for Linus (and looking to justify this change), I realised that we already have arch_validate_flags() that can do a similar check but on VM_* flags instead of PROT_* (we use it for VM_MTE checks). What was the reason for adding a new hook? The only difference is that here it returns -EACCESS while on arch_validate_flags() failure it would return -EINVAL. It probably makes more to return -EACCESS as it matches map_deny_write_exec() but with your patches we are inconsistent between mmap() and mprotect() errors (the latter is still -EINVAL). It also got me thinking on whether we could use this as a hardened version of the MDWE feature instead a CPU feature (though we'd end up context-switching this SCTLR_EL1 bit). I think your patches have been around before the MDWE feature was added to the kernel. Sorry for not catching this early.
On Tue, 12 Mar 2024 at 20:53, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote: > > From: Ard Biesheuvel <ardb@kernel.org> > > > > Add a hook to permit architectures to perform validation on the prot > > flags passed to mmap(), like arch_validate_prot() does for mprotect(). > > This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on > > configurations that run with WXN enabled. > > > > Reviewed-by: Kees Cook <keescook@chromium.org> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > > --- > > include/linux/mman.h | 15 +++++++++++++++ > > mm/mmap.c | 3 +++ > > 2 files changed, 18 insertions(+) > > > > diff --git a/include/linux/mman.h b/include/linux/mman.h > > index dc7048824be8..ec5e7f606e43 100644 > > --- a/include/linux/mman.h > > +++ b/include/linux/mman.h > > @@ -124,6 +124,21 @@ static inline bool arch_validate_flags(unsigned long flags) > > #define arch_validate_flags arch_validate_flags > > #endif > > > > +#ifndef arch_validate_mmap_prot > > +/* > > + * This is called from mmap(), which ignores unknown prot bits so the default > > + * is to accept anything. > > + * > > + * Returns true if the prot flags are valid > > + */ > > +static inline bool arch_validate_mmap_prot(unsigned long prot, > > + unsigned long addr) > > +{ > > + return true; > > +} > > +#define arch_validate_mmap_prot arch_validate_mmap_prot > > +#endif > > + > > /* > > * Optimisation macro. It is equivalent to: > > * (x & bit1) ? bit2 : 0 > > diff --git a/mm/mmap.c b/mm/mmap.c > > index d89770eaab6b..977a8c3fd9f5 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > > if (!(file && path_noexec(&file->f_path))) > > prot |= PROT_EXEC; > > > > + if (!arch_validate_mmap_prot(prot, addr)) > > + return -EACCES; > > + > > /* force arch specific MAP_FIXED handling in get_unmapped_area */ > > if (flags & MAP_FIXED_NOREPLACE) > > flags |= MAP_FIXED; > > While writing the pull request for Linus (and looking to justify this > change), I realised that we already have arch_validate_flags() that can > do a similar check but on VM_* flags instead of PROT_* (we use it for > VM_MTE checks). What was the reason for adding a new hook? I did not consider arch_validate_flags() at all because I was looking at ways to validate PROT_ flags not VM_ flags. But if PROT_WRITE+PROT_EXEC implies VM_WRITE+VM_EXEC, I don't see why we wouldn't be able to change this. That way, we can drop this patch entirely afaict. > The only > difference is that here it returns -EACCESS while on > arch_validate_flags() failure it would return -EINVAL. It probably makes > more to return -EACCESS as it matches map_deny_write_exec() but with > your patches we are inconsistent between mmap() and mprotect() errors > (the latter is still -EINVAL). > Yes. Looking at it now, it would be better to add a single arch hook to map_deny_write_exec(), and use that instead. > It also got me thinking on whether we could use this as a hardened > version of the MDWE feature instead a CPU feature (though we'd end up > context-switching this SCTLR_EL1 bit). I think your patches have been > around before the MDWE feature was added to the kernel. > Indeed. MDWE looks like a good match in principle, but combining the two seems tricky: - EL1 is going to flip between WXN protection on and off depending on which setting it is using for EL0; - context switching SCTLR_EL1.WXN may become costly in terms of TLB maintenance, unless we cheat and ignore the kernel mappings (which should work as expected regardless of the value of SCTLR_EL1.WXN); If we can find a reasonable strategy to manage the TLB maintenance that does not leave EL1 behind, I'm happy to explore this further. But perhaps WXN should simply be treated as MDWE always-on for all processes. > Sorry for not catching this early. > No worries - it's more likely to be useful if we get this right.
On Wed, Mar 13, 2024 at 12:23:24AM +0100, Ard Biesheuvel wrote: > On Tue, 12 Mar 2024 at 20:53, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote: > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index d89770eaab6b..977a8c3fd9f5 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > > > if (!(file && path_noexec(&file->f_path))) > > > prot |= PROT_EXEC; > > > > > > + if (!arch_validate_mmap_prot(prot, addr)) > > > + return -EACCES; > > > + > > > /* force arch specific MAP_FIXED handling in get_unmapped_area */ > > > if (flags & MAP_FIXED_NOREPLACE) > > > flags |= MAP_FIXED; > > > > While writing the pull request for Linus (and looking to justify this > > change), I realised that we already have arch_validate_flags() that can > > do a similar check but on VM_* flags instead of PROT_* (we use it for > > VM_MTE checks). What was the reason for adding a new hook? [...] > > The only > > difference is that here it returns -EACCESS while on > > arch_validate_flags() failure it would return -EINVAL. It probably makes > > more to return -EACCESS as it matches map_deny_write_exec() but with > > your patches we are inconsistent between mmap() and mprotect() errors > > (the latter is still -EINVAL). > > Yes. Looking at it now, it would be better to add a single arch hook > to map_deny_write_exec(), and use that instead. This would work and matches the API better. Another way to look at the arm64 WXN feature is to avoid bothering with with the checks knowing that the hardware enforces XN when a permission is W. So it can be seen as a choice irrespective of the user PROT_EXEC|PROT_WRITE permission. But it's still an ABI change, so I guess better to be upfront with the user and reject such mmap()/mprotect() permission combinations. However, I've been looking through specs and realised that SCTLR_ELx.WXN is RES0 when the permission indirection is enabled (FEAT_PIE from the 2022 specs, hopefully you have access to it). And while apparently WXN gets better as it allows separate EL0/EL1 controls, it seems to only apply when the base permission is RWX and the XN is toggled based on the overlay permission (pkeys which Joey is working on). So it looks like what the architects had in mind is optimising RW/RX switching via overlays (no syscalls) but keeping the base permission RWX. The traditional WXN hardening via SCTLR_EL1 disappeared. (adding Joey to the thread, he contributed the PIE support) > > It also got me thinking on whether we could use this as a hardened > > version of the MDWE feature instead a CPU feature (though we'd end up > > context-switching this SCTLR_EL1 bit). I think your patches have been > > around before the MDWE feature was added to the kernel. > > Indeed. > > MDWE looks like a good match in principle, but combining the two seems tricky: > - EL1 is going to flip between WXN protection on and off depending on > which setting it is using for EL0; > - context switching SCTLR_EL1.WXN may become costly in terms of TLB > maintenance, unless we cheat and ignore the kernel mappings (which > should work as expected regardless of the value of SCTLR_EL1.WXN); > > If we can find a reasonable strategy to manage the TLB maintenance > that does not leave EL1 behind, I'm happy to explore this further. But > perhaps WXN should simply be treated as MDWE always-on for all > processes. Ah, I did not realise this bit can be cached in the TLB. So this doesn't really work (and the Arm ARM is also vague about whether it's cached per ASID or not). > > Sorry for not catching this early. > > No worries - it's more likely to be useful if we get this right. Given that with PIE this feature no longer works, I'll revert these two patches for now. We should revisit it in combination with PIE and POE (overlays) but it does look like the semantics are slightly different (unless I misread the specs). If we want a global MDWE=on on the command line, this can be generic.
On Wed, 13 Mar 2024 at 11:47, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Mar 13, 2024 at 12:23:24AM +0100, Ard Biesheuvel wrote: > > On Tue, 12 Mar 2024 at 20:53, Catalin Marinas <catalin.marinas@arm.com> wrote: > > > On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote: > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index d89770eaab6b..977a8c3fd9f5 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > > > > if (!(file && path_noexec(&file->f_path))) > > > > prot |= PROT_EXEC; > > > > > > > > + if (!arch_validate_mmap_prot(prot, addr)) > > > > + return -EACCES; > > > > + > > > > /* force arch specific MAP_FIXED handling in get_unmapped_area */ > > > > if (flags & MAP_FIXED_NOREPLACE) > > > > flags |= MAP_FIXED; > > > > > > While writing the pull request for Linus (and looking to justify this > > > change), I realised that we already have arch_validate_flags() that can > > > do a similar check but on VM_* flags instead of PROT_* (we use it for > > > VM_MTE checks). What was the reason for adding a new hook? > [...] > > > The only > > > difference is that here it returns -EACCESS while on > > > arch_validate_flags() failure it would return -EINVAL. It probably makes > > > more to return -EACCESS as it matches map_deny_write_exec() but with > > > your patches we are inconsistent between mmap() and mprotect() errors > > > (the latter is still -EINVAL). > > > > Yes. Looking at it now, it would be better to add a single arch hook > > to map_deny_write_exec(), and use that instead. > > This would work and matches the API better. Another way to look at the > arm64 WXN feature is to avoid bothering with with the checks knowing > that the hardware enforces XN when a permission is W. So it can be seen > as a choice irrespective of the user PROT_EXEC|PROT_WRITE permission. > But it's still an ABI change, so I guess better to be upfront with the > user and reject such mmap()/mprotect() permission combinations. > Yes, that was the idea in the original patch. > However, I've been looking through specs and realised that SCTLR_ELx.WXN > is RES0 when the permission indirection is enabled (FEAT_PIE from the > 2022 specs, hopefully you have access to it). The latest public version of the ARM ARM does not cover FEAT_PIE at all. > And while apparently WXN > gets better as it allows separate EL0/EL1 controls, it seems to only > apply when the base permission is RWX and the XN is toggled based on the > overlay permission (pkeys which Joey is working on). So it looks like > what the architects had in mind is optimising RW/RX switching via > overlays (no syscalls) but keeping the base permission RWX. The > traditional WXN hardening via SCTLR_EL1 disappeared. > > (adding Joey to the thread, he contributed the PIE support) > PIE sounds useful to implement things like JITs in user space, where you want a certain mapping to transition to RW while all other CPUs retain RX access concurrently. WXN is intended to be static, where a single bit sets the system-wide policy for all kernel and user space code. It's rather unfortunate that FEAT_PIE relies on RWX mappings and therefore needs to deprecate WXN entirely. It would have been nice to have something like this for the kernel, which never has a need for RWX mappings or transitioning mappings between RX and RW like that, and so overlays don't seem to be a great fit. > > > It also got me thinking on whether we could use this as a hardened > > > version of the MDWE feature instead a CPU feature (though we'd end up > > > context-switching this SCTLR_EL1 bit). I think your patches have been > > > around before the MDWE feature was added to the kernel. > > > > Indeed. > > > > MDWE looks like a good match in principle, but combining the two seems tricky: > > - EL1 is going to flip between WXN protection on and off depending on > > which setting it is using for EL0; > > - context switching SCTLR_EL1.WXN may become costly in terms of TLB > > maintenance, unless we cheat and ignore the kernel mappings (which > > should work as expected regardless of the value of SCTLR_EL1.WXN); > > > > If we can find a reasonable strategy to manage the TLB maintenance > > that does not leave EL1 behind, I'm happy to explore this further. But > > perhaps WXN should simply be treated as MDWE always-on for all > > processes. > > Ah, I did not realise this bit can be cached in the TLB. So this doesn't > really work (and the Arm ARM is also vague about whether it's cached per > ASID or not). > > > > Sorry for not catching this early. > > > > No worries - it's more likely to be useful if we get this right. > > Given that with PIE this feature no longer works, I'll revert these two > patches for now. We should revisit it in combination with PIE and POE > (overlays) but it does look like the semantics are slightly different > (unless I misread the specs). If we want a global MDWE=on on the command > line, this can be generic. > I looked into this a bit more, and MDWE is a bit stricter than WXN, and therefore less suitable for enabling system-wide. It disallows adding executable permissions entirely, as well as adding write permissions to a mapping that is already executable. WXN just disallows setting both at the same time. So using the same hook makes sense, but combining the logic beyond that seems problematic too. I'll code it up in any case to see what it looks like.
On Wed, Mar 13, 2024 at 12:45:22PM +0100, Ard Biesheuvel wrote: > On Wed, 13 Mar 2024 at 11:47, Catalin Marinas <catalin.marinas@arm.com> wrote: > > However, I've been looking through specs and realised that SCTLR_ELx.WXN > > is RES0 when the permission indirection is enabled (FEAT_PIE from the > > 2022 specs, hopefully you have access to it). > > The latest public version of the ARM ARM does not cover FEAT_PIE at all. According to Mark R, the next version should be out soon. The xml tarball for the 2022 extensions doesn't have any new text for SCTLR_ELx.WXN field either. I could only find it in the engineering spec which isn't public: When Stage1 Base Permissions uses the Indirect Permission Scheme, SCTLR_ELx.WXN has no effect and is RES 0. > > And while apparently WXN > > gets better as it allows separate EL0/EL1 controls, it seems to only > > apply when the base permission is RWX and the XN is toggled based on the > > overlay permission (pkeys which Joey is working on). So it looks like > > what the architects had in mind is optimising RW/RX switching via > > overlays (no syscalls) but keeping the base permission RWX. The > > traditional WXN hardening via SCTLR_EL1 disappeared. > > > > (adding Joey to the thread, he contributed the PIE support) > > PIE sounds useful to implement things like JITs in user space, where > you want a certain mapping to transition to RW while all other CPUs > retain RX access concurrently. > > WXN is intended to be static, where a single bit sets the system-wide > policy for all kernel and user space code. I agree. I guess no-one used the current WXN and the architects decided to deprecate it. > It's rather unfortunate that FEAT_PIE relies on RWX mappings and > therefore needs to deprecate WXN entirely. It would have been nice to > have something like this for the kernel, which never has a need for > RWX mappings or transitioning mappings between RX and RW like that, > and so overlays don't seem to be a great fit. Indeed. It looks more of a risk to somehow use WXN in the kernel in combination with overlays because of the RWX permission. > I looked into this a bit more, and MDWE is a bit stricter than WXN, > and therefore less suitable for enabling system-wide. It disallows > adding executable permissions entirely, as well as adding write > permissions to a mapping that is already executable. WXN just > disallows setting both at the same time. With MDWE, we tried to copy the semantics of the BPF variant. It allows mmap(PROT_EXEC) but not mrpotect(PROT_EXEC). But I agree, it's slightly different than your proposed WXN.
diff --git a/include/linux/mman.h b/include/linux/mman.h index dc7048824be8..ec5e7f606e43 100644 --- a/include/linux/mman.h +++ b/include/linux/mman.h @@ -124,6 +124,21 @@ static inline bool arch_validate_flags(unsigned long flags) #define arch_validate_flags arch_validate_flags #endif +#ifndef arch_validate_mmap_prot +/* + * This is called from mmap(), which ignores unknown prot bits so the default + * is to accept anything. + * + * Returns true if the prot flags are valid + */ +static inline bool arch_validate_mmap_prot(unsigned long prot, + unsigned long addr) +{ + return true; +} +#define arch_validate_mmap_prot arch_validate_mmap_prot +#endif + /* * Optimisation macro. It is equivalent to: * (x & bit1) ? bit2 : 0 diff --git a/mm/mmap.c b/mm/mmap.c index d89770eaab6b..977a8c3fd9f5 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (!(file && path_noexec(&file->f_path))) prot |= PROT_EXEC; + if (!arch_validate_mmap_prot(prot, addr)) + return -EACCES; + /* force arch specific MAP_FIXED handling in get_unmapped_area */ if (flags & MAP_FIXED_NOREPLACE) flags |= MAP_FIXED;