diff mbox series

[v8,42/43] mm: add arch hook to validate mmap() prot flags

Message ID 20240214122845.2033971-87-ardb+git@google.com (mailing list archive)
State New, archived
Headers show
Series arm64: Add support for LPA2 and WXN at stage 1 | expand

Commit Message

Ard Biesheuvel Feb. 14, 2024, 12:29 p.m. UTC
From: Ard Biesheuvel <ardb@kernel.org>

Add a hook to permit architectures to perform validation on the prot
flags passed to mmap(), like arch_validate_prot() does for mprotect().
This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on
configurations that run with WXN enabled.

Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 include/linux/mman.h | 15 +++++++++++++++
 mm/mmap.c            |  3 +++
 2 files changed, 18 insertions(+)

Comments

Catalin Marinas March 12, 2024, 7:53 p.m. UTC | #1
On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> Add a hook to permit architectures to perform validation on the prot
> flags passed to mmap(), like arch_validate_prot() does for mprotect().
> This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on
> configurations that run with WXN enabled.
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  include/linux/mman.h | 15 +++++++++++++++
>  mm/mmap.c            |  3 +++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index dc7048824be8..ec5e7f606e43 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -124,6 +124,21 @@ static inline bool arch_validate_flags(unsigned long flags)
>  #define arch_validate_flags arch_validate_flags
>  #endif
>  
> +#ifndef arch_validate_mmap_prot
> +/*
> + * This is called from mmap(), which ignores unknown prot bits so the default
> + * is to accept anything.
> + *
> + * Returns true if the prot flags are valid
> + */
> +static inline bool arch_validate_mmap_prot(unsigned long prot,
> +					   unsigned long addr)
> +{
> +	return true;
> +}
> +#define arch_validate_mmap_prot arch_validate_mmap_prot
> +#endif
> +
>  /*
>   * Optimisation macro.  It is equivalent to:
>   *      (x & bit1) ? bit2 : 0
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d89770eaab6b..977a8c3fd9f5 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  		if (!(file && path_noexec(&file->f_path)))
>  			prot |= PROT_EXEC;
>  
> +	if (!arch_validate_mmap_prot(prot, addr))
> +		return -EACCES;
> +
>  	/* force arch specific MAP_FIXED handling in get_unmapped_area */
>  	if (flags & MAP_FIXED_NOREPLACE)
>  		flags |= MAP_FIXED;

While writing the pull request for Linus (and looking to justify this
change), I realised that we already have arch_validate_flags() that can
do a similar check but on VM_* flags instead of PROT_* (we use it for
VM_MTE checks). What was the reason for adding a new hook? The only
difference is that here it returns -EACCESS while on
arch_validate_flags() failure it would return -EINVAL. It probably makes
more to return -EACCESS as it matches map_deny_write_exec() but with
your patches we are inconsistent between mmap() and mprotect() errors
(the latter is still -EINVAL).

It also got me thinking on whether we could use this as a hardened
version of the MDWE feature instead a CPU feature (though we'd end up
context-switching this SCTLR_EL1 bit). I think your patches have been
around before the MDWE feature was added to the kernel.

Sorry for not catching this early.
Ard Biesheuvel March 12, 2024, 11:23 p.m. UTC | #2
On Tue, 12 Mar 2024 at 20:53, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > Add a hook to permit architectures to perform validation on the prot
> > flags passed to mmap(), like arch_validate_prot() does for mprotect().
> > This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on
> > configurations that run with WXN enabled.
> >
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  include/linux/mman.h | 15 +++++++++++++++
> >  mm/mmap.c            |  3 +++
> >  2 files changed, 18 insertions(+)
> >
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index dc7048824be8..ec5e7f606e43 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -124,6 +124,21 @@ static inline bool arch_validate_flags(unsigned long flags)
> >  #define arch_validate_flags arch_validate_flags
> >  #endif
> >
> > +#ifndef arch_validate_mmap_prot
> > +/*
> > + * This is called from mmap(), which ignores unknown prot bits so the default
> > + * is to accept anything.
> > + *
> > + * Returns true if the prot flags are valid
> > + */
> > +static inline bool arch_validate_mmap_prot(unsigned long prot,
> > +                                        unsigned long addr)
> > +{
> > +     return true;
> > +}
> > +#define arch_validate_mmap_prot arch_validate_mmap_prot
> > +#endif
> > +
> >  /*
> >   * Optimisation macro.  It is equivalent to:
> >   *      (x & bit1) ? bit2 : 0
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index d89770eaab6b..977a8c3fd9f5 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> >               if (!(file && path_noexec(&file->f_path)))
> >                       prot |= PROT_EXEC;
> >
> > +     if (!arch_validate_mmap_prot(prot, addr))
> > +             return -EACCES;
> > +
> >       /* force arch specific MAP_FIXED handling in get_unmapped_area */
> >       if (flags & MAP_FIXED_NOREPLACE)
> >               flags |= MAP_FIXED;
>
> While writing the pull request for Linus (and looking to justify this
> change), I realised that we already have arch_validate_flags() that can
> do a similar check but on VM_* flags instead of PROT_* (we use it for
> VM_MTE checks). What was the reason for adding a new hook?

I did not consider arch_validate_flags() at all because I was looking
at ways to validate PROT_ flags not VM_ flags. But if
PROT_WRITE+PROT_EXEC implies VM_WRITE+VM_EXEC, I don't see why we
wouldn't be able to change this. That way, we can drop this patch
entirely afaict.

> The only
> difference is that here it returns -EACCESS while on
> arch_validate_flags() failure it would return -EINVAL. It probably makes
> more to return -EACCESS as it matches map_deny_write_exec() but with
> your patches we are inconsistent between mmap() and mprotect() errors
> (the latter is still -EINVAL).
>

Yes. Looking at it now, it would be better to add a single arch hook
to map_deny_write_exec(), and use that instead.

> It also got me thinking on whether we could use this as a hardened
> version of the MDWE feature instead a CPU feature (though we'd end up
> context-switching this SCTLR_EL1 bit). I think your patches have been
> around before the MDWE feature was added to the kernel.
>

Indeed.

MDWE looks like a good match in principle, but combining the two seems tricky:
- EL1 is going to flip between WXN protection on and off depending on
which setting it is using for EL0;
- context switching SCTLR_EL1.WXN may become costly in terms of TLB
maintenance, unless we cheat and ignore the kernel mappings (which
should work as expected regardless of the value of SCTLR_EL1.WXN);

If we can find a reasonable strategy to manage the TLB maintenance
that does not leave EL1 behind, I'm happy to explore this further. But
perhaps WXN should simply be treated as MDWE always-on for all
processes.

> Sorry for not catching this early.
>

No worries - it's more likely to be useful if we get this right.
Catalin Marinas March 13, 2024, 10:47 a.m. UTC | #3
On Wed, Mar 13, 2024 at 12:23:24AM +0100, Ard Biesheuvel wrote:
> On Tue, 12 Mar 2024 at 20:53, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote:
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index d89770eaab6b..977a8c3fd9f5 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > >               if (!(file && path_noexec(&file->f_path)))
> > >                       prot |= PROT_EXEC;
> > >
> > > +     if (!arch_validate_mmap_prot(prot, addr))
> > > +             return -EACCES;
> > > +
> > >       /* force arch specific MAP_FIXED handling in get_unmapped_area */
> > >       if (flags & MAP_FIXED_NOREPLACE)
> > >               flags |= MAP_FIXED;
> >
> > While writing the pull request for Linus (and looking to justify this
> > change), I realised that we already have arch_validate_flags() that can
> > do a similar check but on VM_* flags instead of PROT_* (we use it for
> > VM_MTE checks). What was the reason for adding a new hook?
[...]
> > The only
> > difference is that here it returns -EACCESS while on
> > arch_validate_flags() failure it would return -EINVAL. It probably makes
> > more to return -EACCESS as it matches map_deny_write_exec() but with
> > your patches we are inconsistent between mmap() and mprotect() errors
> > (the latter is still -EINVAL).
> 
> Yes. Looking at it now, it would be better to add a single arch hook
> to map_deny_write_exec(), and use that instead.

This would work and matches the API better. Another way to look at the
arm64 WXN feature is to avoid bothering with with the checks knowing
that the hardware enforces XN when a permission is W. So it can be seen
as a choice irrespective of the user PROT_EXEC|PROT_WRITE permission.
But it's still an ABI change, so I guess better to be upfront with the
user and reject such mmap()/mprotect() permission combinations.

However, I've been looking through specs and realised that SCTLR_ELx.WXN
is RES0 when the permission indirection is enabled (FEAT_PIE from the
2022 specs, hopefully you have access to it). And while apparently WXN
gets better as it allows separate EL0/EL1 controls, it seems to only
apply when the base permission is RWX and the XN is toggled based on the
overlay permission (pkeys which Joey is working on). So it looks like
what the architects had in mind is optimising RW/RX switching via
overlays (no syscalls) but keeping the base permission RWX. The
traditional WXN hardening via SCTLR_EL1 disappeared.

(adding Joey to the thread, he contributed the PIE support)

> > It also got me thinking on whether we could use this as a hardened
> > version of the MDWE feature instead a CPU feature (though we'd end up
> > context-switching this SCTLR_EL1 bit). I think your patches have been
> > around before the MDWE feature was added to the kernel.
> 
> Indeed.
> 
> MDWE looks like a good match in principle, but combining the two seems tricky:
> - EL1 is going to flip between WXN protection on and off depending on
> which setting it is using for EL0;
> - context switching SCTLR_EL1.WXN may become costly in terms of TLB
> maintenance, unless we cheat and ignore the kernel mappings (which
> should work as expected regardless of the value of SCTLR_EL1.WXN);
> 
> If we can find a reasonable strategy to manage the TLB maintenance
> that does not leave EL1 behind, I'm happy to explore this further. But
> perhaps WXN should simply be treated as MDWE always-on for all
> processes.

Ah, I did not realise this bit can be cached in the TLB. So this doesn't
really work (and the Arm ARM is also vague about whether it's cached per
ASID or not).

> > Sorry for not catching this early.
> 
> No worries - it's more likely to be useful if we get this right.

Given that with PIE this feature no longer works, I'll revert these two
patches for now. We should revisit it in combination with PIE and POE
(overlays) but it does look like the semantics are slightly different
(unless I misread the specs). If we want a global MDWE=on on the command
line, this can be generic.
Ard Biesheuvel March 13, 2024, 11:45 a.m. UTC | #4
On Wed, 13 Mar 2024 at 11:47, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Wed, Mar 13, 2024 at 12:23:24AM +0100, Ard Biesheuvel wrote:
> > On Tue, 12 Mar 2024 at 20:53, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote:
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index d89770eaab6b..977a8c3fd9f5 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > > >               if (!(file && path_noexec(&file->f_path)))
> > > >                       prot |= PROT_EXEC;
> > > >
> > > > +     if (!arch_validate_mmap_prot(prot, addr))
> > > > +             return -EACCES;
> > > > +
> > > >       /* force arch specific MAP_FIXED handling in get_unmapped_area */
> > > >       if (flags & MAP_FIXED_NOREPLACE)
> > > >               flags |= MAP_FIXED;
> > >
> > > While writing the pull request for Linus (and looking to justify this
> > > change), I realised that we already have arch_validate_flags() that can
> > > do a similar check but on VM_* flags instead of PROT_* (we use it for
> > > VM_MTE checks). What was the reason for adding a new hook?
> [...]
> > > The only
> > > difference is that here it returns -EACCESS while on
> > > arch_validate_flags() failure it would return -EINVAL. It probably makes
> > > more to return -EACCESS as it matches map_deny_write_exec() but with
> > > your patches we are inconsistent between mmap() and mprotect() errors
> > > (the latter is still -EINVAL).
> >
> > Yes. Looking at it now, it would be better to add a single arch hook
> > to map_deny_write_exec(), and use that instead.
>
> This would work and matches the API better. Another way to look at the
> arm64 WXN feature is to avoid bothering with with the checks knowing
> that the hardware enforces XN when a permission is W. So it can be seen
> as a choice irrespective of the user PROT_EXEC|PROT_WRITE permission.
> But it's still an ABI change, so I guess better to be upfront with the
> user and reject such mmap()/mprotect() permission combinations.
>

Yes, that was the idea in the original patch.

> However, I've been looking through specs and realised that SCTLR_ELx.WXN
> is RES0 when the permission indirection is enabled (FEAT_PIE from the
> 2022 specs, hopefully you have access to it).

The latest public version of the ARM ARM does not cover FEAT_PIE at all.

> And while apparently WXN
> gets better as it allows separate EL0/EL1 controls, it seems to only
> apply when the base permission is RWX and the XN is toggled based on the
> overlay permission (pkeys which Joey is working on). So it looks like
> what the architects had in mind is optimising RW/RX switching via
> overlays (no syscalls) but keeping the base permission RWX. The
> traditional WXN hardening via SCTLR_EL1 disappeared.
>
> (adding Joey to the thread, he contributed the PIE support)
>

PIE sounds useful to implement things like JITs in user space, where
you want a certain mapping to transition to RW while all other CPUs
retain RX access concurrently.

WXN is intended to be static, where a single bit sets the system-wide
policy for all kernel and user space code.

It's rather unfortunate that FEAT_PIE relies on RWX mappings and
therefore needs to deprecate WXN entirely. It would have been nice to
have something like this for the kernel, which never has a need for
RWX mappings or transitioning mappings between RX and RW like that,
and so overlays don't seem to be a great fit.

> > > It also got me thinking on whether we could use this as a hardened
> > > version of the MDWE feature instead a CPU feature (though we'd end up
> > > context-switching this SCTLR_EL1 bit). I think your patches have been
> > > around before the MDWE feature was added to the kernel.
> >
> > Indeed.
> >
> > MDWE looks like a good match in principle, but combining the two seems tricky:
> > - EL1 is going to flip between WXN protection on and off depending on
> > which setting it is using for EL0;
> > - context switching SCTLR_EL1.WXN may become costly in terms of TLB
> > maintenance, unless we cheat and ignore the kernel mappings (which
> > should work as expected regardless of the value of SCTLR_EL1.WXN);
> >
> > If we can find a reasonable strategy to manage the TLB maintenance
> > that does not leave EL1 behind, I'm happy to explore this further. But
> > perhaps WXN should simply be treated as MDWE always-on for all
> > processes.
>
> Ah, I did not realise this bit can be cached in the TLB. So this doesn't
> really work (and the Arm ARM is also vague about whether it's cached per
> ASID or not).
>
> > > Sorry for not catching this early.
> >
> > No worries - it's more likely to be useful if we get this right.
>
> Given that with PIE this feature no longer works, I'll revert these two
> patches for now. We should revisit it in combination with PIE and POE
> (overlays) but it does look like the semantics are slightly different
> (unless I misread the specs). If we want a global MDWE=on on the command
> line, this can be generic.
>

I looked into this a bit more, and MDWE is a bit stricter than WXN,
and therefore less suitable for enabling system-wide. It disallows
adding executable permissions entirely, as well as adding write
permissions to a mapping that is already executable. WXN just
disallows setting both at the same time.

So using the same hook makes sense, but combining the logic beyond
that seems problematic too.

I'll code it up in any case to see what it looks like.
Catalin Marinas March 13, 2024, 3:31 p.m. UTC | #5
On Wed, Mar 13, 2024 at 12:45:22PM +0100, Ard Biesheuvel wrote:
> On Wed, 13 Mar 2024 at 11:47, Catalin Marinas <catalin.marinas@arm.com> wrote:
> > However, I've been looking through specs and realised that SCTLR_ELx.WXN
> > is RES0 when the permission indirection is enabled (FEAT_PIE from the
> > 2022 specs, hopefully you have access to it).
> 
> The latest public version of the ARM ARM does not cover FEAT_PIE at all.

According to Mark R, the next version should be out soon. The xml
tarball for the 2022 extensions doesn't have any new text for
SCTLR_ELx.WXN field either. I could only find it in the engineering spec
which isn't public:

  When Stage1 Base Permissions uses the Indirect Permission Scheme,
  SCTLR_ELx.WXN has no effect and is RES 0.

> > And while apparently WXN
> > gets better as it allows separate EL0/EL1 controls, it seems to only
> > apply when the base permission is RWX and the XN is toggled based on the
> > overlay permission (pkeys which Joey is working on). So it looks like
> > what the architects had in mind is optimising RW/RX switching via
> > overlays (no syscalls) but keeping the base permission RWX. The
> > traditional WXN hardening via SCTLR_EL1 disappeared.
> >
> > (adding Joey to the thread, he contributed the PIE support)
> 
> PIE sounds useful to implement things like JITs in user space, where
> you want a certain mapping to transition to RW while all other CPUs
> retain RX access concurrently.
> 
> WXN is intended to be static, where a single bit sets the system-wide
> policy for all kernel and user space code.

I agree. I guess no-one used the current WXN and the architects decided
to deprecate it.

> It's rather unfortunate that FEAT_PIE relies on RWX mappings and
> therefore needs to deprecate WXN entirely. It would have been nice to
> have something like this for the kernel, which never has a need for
> RWX mappings or transitioning mappings between RX and RW like that,
> and so overlays don't seem to be a great fit.

Indeed. It looks more of a risk to somehow use WXN in the kernel in
combination with overlays because of the RWX permission.

> I looked into this a bit more, and MDWE is a bit stricter than WXN,
> and therefore less suitable for enabling system-wide. It disallows
> adding executable permissions entirely, as well as adding write
> permissions to a mapping that is already executable. WXN just
> disallows setting both at the same time.

With MDWE, we tried to copy the semantics of the BPF variant. It allows
mmap(PROT_EXEC) but not mrpotect(PROT_EXEC). But I agree, it's slightly
different than your proposed WXN.
diff mbox series

Patch

diff --git a/include/linux/mman.h b/include/linux/mman.h
index dc7048824be8..ec5e7f606e43 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -124,6 +124,21 @@  static inline bool arch_validate_flags(unsigned long flags)
 #define arch_validate_flags arch_validate_flags
 #endif
 
+#ifndef arch_validate_mmap_prot
+/*
+ * This is called from mmap(), which ignores unknown prot bits so the default
+ * is to accept anything.
+ *
+ * Returns true if the prot flags are valid
+ */
+static inline bool arch_validate_mmap_prot(unsigned long prot,
+					   unsigned long addr)
+{
+	return true;
+}
+#define arch_validate_mmap_prot arch_validate_mmap_prot
+#endif
+
 /*
  * Optimisation macro.  It is equivalent to:
  *      (x & bit1) ? bit2 : 0
diff --git a/mm/mmap.c b/mm/mmap.c
index d89770eaab6b..977a8c3fd9f5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1229,6 +1229,9 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 		if (!(file && path_noexec(&file->f_path)))
 			prot |= PROT_EXEC;
 
+	if (!arch_validate_mmap_prot(prot, addr))
+		return -EACCES;
+
 	/* force arch specific MAP_FIXED handling in get_unmapped_area */
 	if (flags & MAP_FIXED_NOREPLACE)
 		flags |= MAP_FIXED;