diff mbox series

[1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

Message ID 20201007073932.865218-1-jannh@google.com (mailing list archive)
State New, archived
Headers show
Series [1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length | expand

Commit Message

Jann Horn Oct. 7, 2020, 7:39 a.m. UTC
arch_validate_prot() is a hook that can validate whether a given set of
protection flags is valid in an mprotect() operation. It is given the set
of protection flags and the address being modified.

However, the address being modified can currently not actually be used in
a meaningful way because:

1. Only the address is given, but not the length, and the operation can
   span multiple VMAs. Therefore, the callee can't actually tell which
   virtual address range, or which VMAs, are being targeted.
2. The mmap_lock is not held, meaning that if the callee were to check
   the VMA at @addr, that VMA would be unrelated to the one the
   operation is performed on.

Currently, custom arch_validate_prot() handlers are defined by
arm64, powerpc and sparc.
arm64 and powerpc don't care about the address range, they just check the
flags against CPU support masks.
sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
the mmap_lock.

Change the function signature to also take a length, and move the
arch_validate_prot() call in mm/mprotect.c down into the locked region.

Cc: stable@vger.kernel.org
Fixes: 9035cf9a97e4 ("mm: Add address parameter to arch_validate_prot()")
Suggested-by: Khalid Aziz <khalid.aziz@oracle.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jann Horn <jannh@google.com>
---
 arch/arm64/include/asm/mman.h   | 4 ++--
 arch/powerpc/include/asm/mman.h | 3 ++-
 arch/powerpc/kernel/syscalls.c  | 2 +-
 arch/sparc/include/asm/mman.h   | 6 ++++--
 include/linux/mman.h            | 3 ++-
 mm/mprotect.c                   | 6 ++++--
 6 files changed, 15 insertions(+), 9 deletions(-)


base-commit: c85fb28b6f999db9928b841f63f1beeb3074eeca

Comments

Christoph Hellwig Oct. 7, 2020, 12:35 p.m. UTC | #1
On Wed, Oct 07, 2020 at 09:39:31AM +0200, Jann Horn wrote:
> diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
> index 078608ec2e92..b1fabb97d138 100644
> --- a/arch/powerpc/kernel/syscalls.c
> +++ b/arch/powerpc/kernel/syscalls.c
> @@ -43,7 +43,7 @@ static inline long do_mmap2(unsigned long addr, size_t len,
>  {
>  	long ret = -EINVAL;
>  
> -	if (!arch_validate_prot(prot, addr))
> +	if (!arch_validate_prot(prot, addr, len))

This call isn't under mmap lock.  I also find it rather weird as the
generic code only calls arch_validate_prot from mprotect, only powerpc
also calls it from mmap.

This seems to go back to commit ef3d3246a0d0
("powerpc/mm: Add Strong Access Ordering support")
Jann Horn Oct. 7, 2020, 2:42 p.m. UTC | #2
On Wed, Oct 7, 2020 at 2:35 PM Christoph Hellwig <hch@infradead.org> wrote:
> On Wed, Oct 07, 2020 at 09:39:31AM +0200, Jann Horn wrote:
> > diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
> > index 078608ec2e92..b1fabb97d138 100644
> > --- a/arch/powerpc/kernel/syscalls.c
> > +++ b/arch/powerpc/kernel/syscalls.c
> > @@ -43,7 +43,7 @@ static inline long do_mmap2(unsigned long addr, size_t len,
> >  {
> >       long ret = -EINVAL;
> >
> > -     if (!arch_validate_prot(prot, addr))
> > +     if (!arch_validate_prot(prot, addr, len))
>
> This call isn't under mmap lock.  I also find it rather weird as the
> generic code only calls arch_validate_prot from mprotect, only powerpc
> also calls it from mmap.
>
> This seems to go back to commit ef3d3246a0d0
> ("powerpc/mm: Add Strong Access Ordering support")

I'm _guessing_ the idea in the generic case might be that mmap()
doesn't check unknown bits in the protection flags, and therefore
maybe people wanted to avoid adding new error cases that could be
caused by random high bits being set? So while the mprotect() case
checks the flags and refuses unknown values, the mmap() code just lets
the architecture figure out which bits are actually valid to set (via
arch_calc_vm_prot_bits()) and silently ignores the rest?

And powerpc apparently decided that they do want to error out on bogus
prot values passed to their version of mmap(), and in exchange, assume
in arch_calc_vm_prot_bits() that the protection bits are valid?

powerpc's arch_validate_prot() doesn't actually need the mmap lock, so
I think this is fine-ish for now (as in, while the code is a bit
unclean, I don't think I'm making it worse, and I don't think it's
actually buggy). In theory, we could move the arch_validate_prot()
call over into the mmap guts, where we're holding the lock, and gate
it on the architecture or on some feature CONFIG that powerpc can
activate in its Kconfig. But I'm not sure whether that'd be helping or
making things worse, so when I sent this patch, I deliberately left
the powerpc stuff as-is.
Khalid Aziz Oct. 7, 2020, 8:14 p.m. UTC | #3
On 10/7/20 1:39 AM, Jann Horn wrote:
> arch_validate_prot() is a hook that can validate whether a given set of
> protection flags is valid in an mprotect() operation. It is given the set
> of protection flags and the address being modified.
> 
> However, the address being modified can currently not actually be used in
> a meaningful way because:
> 
> 1. Only the address is given, but not the length, and the operation can
>    span multiple VMAs. Therefore, the callee can't actually tell which
>    virtual address range, or which VMAs, are being targeted.
> 2. The mmap_lock is not held, meaning that if the callee were to check
>    the VMA at @addr, that VMA would be unrelated to the one the
>    operation is performed on.
> 
> Currently, custom arch_validate_prot() handlers are defined by
> arm64, powerpc and sparc.
> arm64 and powerpc don't care about the address range, they just check the
> flags against CPU support masks.
> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> the mmap_lock.
> 
> Change the function signature to also take a length, and move the
> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> 
> Cc: stable@vger.kernel.org
> Fixes: 9035cf9a97e4 ("mm: Add address parameter to arch_validate_prot()")
> Suggested-by: Khalid Aziz <khalid.aziz@oracle.com>
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>  arch/arm64/include/asm/mman.h   | 4 ++--
>  arch/powerpc/include/asm/mman.h | 3 ++-
>  arch/powerpc/kernel/syscalls.c  | 2 +-
>  arch/sparc/include/asm/mman.h   | 6 ++++--
>  include/linux/mman.h            | 3 ++-
>  mm/mprotect.c                   | 6 ++++--
>  6 files changed, 15 insertions(+), 9 deletions(-)


This looks good to me.

As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
is made without holding mmap_lock. Lock is not acquired until
vm_mmap_pgoff(). This variance is uncomfortable but I am more
uncomfortable forcing all implementations of validate_prot to require
mmap_lock be held when non-sparc implementations do not have such need
yet. Since do_mmap2() is in powerpc specific code, for now this patch
solves a current problem. That leaves open the question of should
generic mmap call arch_validate_prot and return EINVAL for invalid
combination of protection bits, but that is better addressed in a
separate patch.

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>

> 
> diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> index 081ec8de9ea6..0876a87986dc 100644
> --- a/arch/arm64/include/asm/mman.h
> +++ b/arch/arm64/include/asm/mman.h
> @@ -23,7 +23,7 @@ static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
>  #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
>  
>  static inline bool arch_validate_prot(unsigned long prot,
> -	unsigned long addr __always_unused)
> +	unsigned long addr __always_unused, unsigned long len __always_unused)
>  {
>  	unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM;
>  
> @@ -32,6 +32,6 @@ static inline bool arch_validate_prot(unsigned long prot,
>  
>  	return (prot & ~supported) == 0;
>  }
> -#define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)
> +#define arch_validate_prot(prot, addr, len) arch_validate_prot(prot, addr, len)
>  
>  #endif /* ! __ASM_MMAN_H__ */
> diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
> index 7cb6d18f5cd6..65dd9b594985 100644
> --- a/arch/powerpc/include/asm/mman.h
> +++ b/arch/powerpc/include/asm/mman.h
> @@ -36,7 +36,8 @@ static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
>  }
>  #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
>  
> -static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
> +static inline bool arch_validate_prot(unsigned long prot, unsigned long addr,
> +				      unsigned long len)
>  {
>  	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
>  		return false;
> diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
> index 078608ec2e92..b1fabb97d138 100644
> --- a/arch/powerpc/kernel/syscalls.c
> +++ b/arch/powerpc/kernel/syscalls.c
> @@ -43,7 +43,7 @@ static inline long do_mmap2(unsigned long addr, size_t len,
>  {
>  	long ret = -EINVAL;
>  
> -	if (!arch_validate_prot(prot, addr))
> +	if (!arch_validate_prot(prot, addr, len))
>  		goto out;
>  
>  	if (shift) {
> diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
> index f94532f25db1..e85222c76585 100644
> --- a/arch/sparc/include/asm/mman.h
> +++ b/arch/sparc/include/asm/mman.h
> @@ -52,9 +52,11 @@ static inline pgprot_t sparc_vm_get_page_prot(unsigned long vm_flags)
>  	return (vm_flags & VM_SPARC_ADI) ? __pgprot(_PAGE_MCD_4V) : __pgprot(0);
>  }
>  
> -#define arch_validate_prot(prot, addr) sparc_validate_prot(prot, addr)
> -static inline int sparc_validate_prot(unsigned long prot, unsigned long addr)
> +#define arch_validate_prot(prot, addr, len) sparc_validate_prot(prot, addr, len)
> +static inline int sparc_validate_prot(unsigned long prot, unsigned long addr,
> +				      unsigned long len)
>  {
> +	mmap_assert_write_locked(current->mm);
>  	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_ADI))
>  		return 0;
>  	if (prot & PROT_ADI) {
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 6f34c33075f9..5b4d554d3189 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -96,7 +96,8 @@ static inline void vm_unacct_memory(long pages)
>   *
>   * Returns true if the prot flags are valid
>   */
> -static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
> +static inline bool arch_validate_prot(unsigned long prot, unsigned long addr,
> +				      unsigned long len)
>  {
>  	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
>  }
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index ce8b8a5eacbb..e2d6b51acbf8 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -533,14 +533,16 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
>  	end = start + len;
>  	if (end <= start)
>  		return -ENOMEM;
> -	if (!arch_validate_prot(prot, start))
> -		return -EINVAL;
>  
>  	reqprot = prot;
>  
>  	if (mmap_write_lock_killable(current->mm))
>  		return -EINTR;
>  
> +	error = -EINVAL;
> +	if (!arch_validate_prot(prot, start, len))
> +		goto out;
> +
>  	/*
>  	 * If userspace did not allocate the pkey, do not let
>  	 * them use it here.
> 
> base-commit: c85fb28b6f999db9928b841f63f1beeb3074eeca
>
Christoph Hellwig Oct. 8, 2020, 6:21 a.m. UTC | #4
On Wed, Oct 07, 2020 at 04:42:55PM +0200, Jann Horn wrote:
> > > @@ -43,7 +43,7 @@ static inline long do_mmap2(unsigned long addr, size_t len,
> > >  {
> > >       long ret = -EINVAL;
> > >
> > > -     if (!arch_validate_prot(prot, addr))
> > > +     if (!arch_validate_prot(prot, addr, len))
> >
> > This call isn't under mmap lock.  I also find it rather weird as the
> > generic code only calls arch_validate_prot from mprotect, only powerpc
> > also calls it from mmap.
> >
> > This seems to go back to commit ef3d3246a0d0
> > ("powerpc/mm: Add Strong Access Ordering support")
> 
> I'm _guessing_ the idea in the generic case might be that mmap()
> doesn't check unknown bits in the protection flags, and therefore
> maybe people wanted to avoid adding new error cases that could be
> caused by random high bits being set? So while the mprotect() case
> checks the flags and refuses unknown values, the mmap() code just lets
> the architecture figure out which bits are actually valid to set (via
> arch_calc_vm_prot_bits()) and silently ignores the rest?
> 
> And powerpc apparently decided that they do want to error out on bogus
> prot values passed to their version of mmap(), and in exchange, assume
> in arch_calc_vm_prot_bits() that the protection bits are valid?

The problem really is that now programs behave different on powerpc
compared to all other architectures.

> powerpc's arch_validate_prot() doesn't actually need the mmap lock, so
> I think this is fine-ish for now (as in, while the code is a bit
> unclean, I don't think I'm making it worse, and I don't think it's
> actually buggy). In theory, we could move the arch_validate_prot()
> call over into the mmap guts, where we're holding the lock, and gate
> it on the architecture or on some feature CONFIG that powerpc can
> activate in its Kconfig. But I'm not sure whether that'd be helping or
> making things worse, so when I sent this patch, I deliberately left
> the powerpc stuff as-is.

For now I'd just duplicate the trivial logic from arch_validate_prot
in the powerpc version of do_mmap2 and add a comment that this check
causes a gratious incompatibility to all other architectures.  And then
hope that the powerpc maintainers fix it up :)
Catalin Marinas Oct. 8, 2020, 10:12 a.m. UTC | #5
On Wed, Oct 07, 2020 at 09:39:31AM +0200, Jann Horn wrote:
> arch_validate_prot() is a hook that can validate whether a given set of
> protection flags is valid in an mprotect() operation. It is given the set
> of protection flags and the address being modified.
> 
> However, the address being modified can currently not actually be used in
> a meaningful way because:
> 
> 1. Only the address is given, but not the length, and the operation can
>    span multiple VMAs. Therefore, the callee can't actually tell which
>    virtual address range, or which VMAs, are being targeted.
> 2. The mmap_lock is not held, meaning that if the callee were to check
>    the VMA at @addr, that VMA would be unrelated to the one the
>    operation is performed on.
> 
> Currently, custom arch_validate_prot() handlers are defined by
> arm64, powerpc and sparc.
> arm64 and powerpc don't care about the address range, they just check the
> flags against CPU support masks.
> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> the mmap_lock.
> 
> Change the function signature to also take a length, and move the
> arch_validate_prot() call in mm/mprotect.c down into the locked region.

For arm64 mte, I noticed the arch_validate_prot() issue with multiple
vmas and addressed this in a different way:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=for-next/mte&id=c462ac288f2c97e0c1d9ff6a65955317e799f958
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=for-next/mte&id=0042090548740921951f31fc0c20dcdb96638cb0

Both patches queued for 5.10.

Basically, arch_calc_vm_prot_bits() returns a VM_MTE if PROT_MTE has
been requested. The newly introduced arch_validate_flags() will check
the VM_MTE flag against what the system supports and this covers both
mmap() and mprotect(). Note that arch_validate_prot() only handles the
latter and I don't think it's sufficient for SPARC ADI. For arm64 MTE we
definitely wanted mmap() flags to be validated.

In addition, there's a new arch_calc_vm_flag_bits() which allows us to
set a VM_MTE_ALLOWED on a vma if the conditions are right (MAP_ANONYMOUS
or shmem_mmap():

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=for-next/mte&id=b3fbbea4c00220f62e6f7e2514466e6ee81f7f60
Michael Ellerman Oct. 8, 2020, 10:34 a.m. UTC | #6
Jann Horn <jannh@google.com> writes:
> On Wed, Oct 7, 2020 at 2:35 PM Christoph Hellwig <hch@infradead.org> wrote:
>> On Wed, Oct 07, 2020 at 09:39:31AM +0200, Jann Horn wrote:
>> > diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
>> > index 078608ec2e92..b1fabb97d138 100644
>> > --- a/arch/powerpc/kernel/syscalls.c
>> > +++ b/arch/powerpc/kernel/syscalls.c
>> > @@ -43,7 +43,7 @@ static inline long do_mmap2(unsigned long addr, size_t len,
>> >  {
>> >       long ret = -EINVAL;
>> >
>> > -     if (!arch_validate_prot(prot, addr))
>> > +     if (!arch_validate_prot(prot, addr, len))
>>
>> This call isn't under mmap lock.  I also find it rather weird as the
>> generic code only calls arch_validate_prot from mprotect, only powerpc
>> also calls it from mmap.
>>
>> This seems to go back to commit ef3d3246a0d0
>> ("powerpc/mm: Add Strong Access Ordering support")
>
> I'm _guessing_ the idea in the generic case might be that mmap()
> doesn't check unknown bits in the protection flags, and therefore
> maybe people wanted to avoid adding new error cases that could be
> caused by random high bits being set?

I suspect it's just that when we added it we updated our do_mmap2() and
didn't touch the generic version because we didn't need to. ie. it's not
intentional it's just a buglet.

I think this is the original submission:

  https://lore.kernel.org/linuxppc-dev/20080610220055.10257.84465.sendpatchset@norville.austin.ibm.com/

Which only calls arch_validate_prot() from mprotect and the powerpc
code, and there was no discussion about adding it elsewhere.

> So while the mprotect() case
> checks the flags and refuses unknown values, the mmap() code just lets
> the architecture figure out which bits are actually valid to set (via
> arch_calc_vm_prot_bits()) and silently ignores the rest?
>
> And powerpc apparently decided that they do want to error out on bogus
> prot values passed to their version of mmap(), and in exchange, assume
> in arch_calc_vm_prot_bits() that the protection bits are valid?

I don't think we really decided that, it just happened by accident and
no one noticed/complained.

Seems userspace is pretty well behaved when it comes to passing prot
values to mmap().

> powerpc's arch_validate_prot() doesn't actually need the mmap lock, so
> I think this is fine-ish for now (as in, while the code is a bit
> unclean, I don't think I'm making it worse, and I don't think it's
> actually buggy). In theory, we could move the arch_validate_prot()
> call over into the mmap guts, where we're holding the lock, and gate
> it on the architecture or on some feature CONFIG that powerpc can
> activate in its Kconfig. But I'm not sure whether that'd be helping or
> making things worse, so when I sent this patch, I deliberately left
> the powerpc stuff as-is.

I think what you've done is fine, and anything more elaborate is not
worth the effort.

cheers
Catalin Marinas Oct. 8, 2020, 11:03 a.m. UTC | #7
On Thu, Oct 08, 2020 at 09:34:26PM +1100, Michael Ellerman wrote:
> Jann Horn <jannh@google.com> writes:
> > So while the mprotect() case
> > checks the flags and refuses unknown values, the mmap() code just lets
> > the architecture figure out which bits are actually valid to set (via
> > arch_calc_vm_prot_bits()) and silently ignores the rest?
> >
> > And powerpc apparently decided that they do want to error out on bogus
> > prot values passed to their version of mmap(), and in exchange, assume
> > in arch_calc_vm_prot_bits() that the protection bits are valid?
> 
> I don't think we really decided that, it just happened by accident and
> no one noticed/complained.
> 
> Seems userspace is pretty well behaved when it comes to passing prot
> values to mmap().

It's not necessarily about well behaved but whether it can have security
implications. On arm64, if the underlying memory does not support MTE
(say some DAX mmap) but we still allow PROT_MTE driven by user, it will
lead to an SError which brings the whole machine down.

Not sure whether ADI has similar requirements but at least for arm64 we
addressed the mmap() case as well (see my other email on the details; I
think the approach would work on SPARC as well).
Catalin Marinas Oct. 10, 2020, 11:09 a.m. UTC | #8
Hi Khalid,

On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
> On 10/7/20 1:39 AM, Jann Horn wrote:
> > arch_validate_prot() is a hook that can validate whether a given set of
> > protection flags is valid in an mprotect() operation. It is given the set
> > of protection flags and the address being modified.
> > 
> > However, the address being modified can currently not actually be used in
> > a meaningful way because:
> > 
> > 1. Only the address is given, but not the length, and the operation can
> >    span multiple VMAs. Therefore, the callee can't actually tell which
> >    virtual address range, or which VMAs, are being targeted.
> > 2. The mmap_lock is not held, meaning that if the callee were to check
> >    the VMA at @addr, that VMA would be unrelated to the one the
> >    operation is performed on.
> > 
> > Currently, custom arch_validate_prot() handlers are defined by
> > arm64, powerpc and sparc.
> > arm64 and powerpc don't care about the address range, they just check the
> > flags against CPU support masks.
> > sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> > the mmap_lock.
> > 
> > Change the function signature to also take a length, and move the
> > arch_validate_prot() call in mm/mprotect.c down into the locked region.
[...]
> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
> is made without holding mmap_lock. Lock is not acquired until
> vm_mmap_pgoff(). This variance is uncomfortable but I am more
> uncomfortable forcing all implementations of validate_prot to require
> mmap_lock be held when non-sparc implementations do not have such need
> yet. Since do_mmap2() is in powerpc specific code, for now this patch
> solves a current problem.

I still think sparc should avoid walking the vmas in
arch_validate_prot(). The core code already has the vmas, though not
when calling arch_validate_prot(). That's one of the reasons I added
arch_validate_flags() with the MTE patches. For sparc, this could be
(untested, just copied the arch_validate_prot() code):

static inline bool arch_validate_flags(unsigned long vm_flags)
{
	if (!(vm_flags & VM_SPARC_ADI))
		return true;

	if (!adi_capable())
		return false;

	/* ADI can not be enabled on PFN mapped pages */
	if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
		return false;

	/*
	 * Mergeable pages can become unmergeable if ADI is enabled on
	 * them even if they have identical data on them. This can be
	 * because ADI enabled pages with identical data may still not
	 * have identical ADI tags on them. Disallow ADI on mergeable
	 * pages.
	 */
	if (vma->vm_flags & VM_MERGEABLE)
		return false;

	return true;
}

> That leaves open the question of should
> generic mmap call arch_validate_prot and return EINVAL for invalid
> combination of protection bits, but that is better addressed in a
> separate patch.

The above would cover mmap() as well.

The current sparc_validate_prot() relies on finding the vma for the
corresponding address. However, if you call this early in the mmap()
path, there's no such vma. It is only created later in mmap_region()
which no longer has the original PROT_* flags (all converted to VM_*
flags).

Calling arch_validate_flags() on mmap() has a small side-effect on the
user ABI: if the CPU is not adi_capable(), PROT_ADI is currently ignored
on mmap() but rejected by sparc_validate_prot(). Powerpc already does
this already and I think it should be fine for arm64 (it needs checking
though as we have another flag, PROT_BTI, hopefully dynamic loaders
don't pass this flag unconditionally).

However, as I said above, it doesn't solve the mmap() PROT_ADI checking
for sparc since there's no vma yet. I'd strongly recommend the
arch_validate_flags() approach and reverting the "start" parameter added
to arch_validate_prot() if you go for the flags route.

Thanks.
Khalid Aziz Oct. 12, 2020, 5:03 p.m. UTC | #9
On 10/10/20 5:09 AM, Catalin Marinas wrote:
> Hi Khalid,
> 
> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
>> On 10/7/20 1:39 AM, Jann Horn wrote:
>>> arch_validate_prot() is a hook that can validate whether a given set of
>>> protection flags is valid in an mprotect() operation. It is given the set
>>> of protection flags and the address being modified.
>>>
>>> However, the address being modified can currently not actually be used in
>>> a meaningful way because:
>>>
>>> 1. Only the address is given, but not the length, and the operation can
>>>    span multiple VMAs. Therefore, the callee can't actually tell which
>>>    virtual address range, or which VMAs, are being targeted.
>>> 2. The mmap_lock is not held, meaning that if the callee were to check
>>>    the VMA at @addr, that VMA would be unrelated to the one the
>>>    operation is performed on.
>>>
>>> Currently, custom arch_validate_prot() handlers are defined by
>>> arm64, powerpc and sparc.
>>> arm64 and powerpc don't care about the address range, they just check the
>>> flags against CPU support masks.
>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
>>> the mmap_lock.
>>>
>>> Change the function signature to also take a length, and move the
>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> [...]
>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
>> is made without holding mmap_lock. Lock is not acquired until
>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
>> uncomfortable forcing all implementations of validate_prot to require
>> mmap_lock be held when non-sparc implementations do not have such need
>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
>> solves a current problem.
> 
> I still think sparc should avoid walking the vmas in
> arch_validate_prot(). The core code already has the vmas, though not
> when calling arch_validate_prot(). That's one of the reasons I added
> arch_validate_flags() with the MTE patches. For sparc, this could be
> (untested, just copied the arch_validate_prot() code):

I am little uncomfortable with the idea of validating protection bits
inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
enabled across multiple VMAs and arch_validate_flags() fails on a VMA
later, do_mprotect_pkey() will bail out with error leaving ADI enabled
on earlier VMAs. This will apply to protection bits other than ADI as
well of course. This becomes a partial failure of mprotect() call. I
think it should be all or nothing with mprotect() - when one calls
mprotect() from userspace, either the entire address range passed in
gets its protection bits updated or none of it does. That requires
validating protection bits upfront or undoing what earlier iterations of
VMA walk loop might have done.

--
Khalid

> 
> static inline bool arch_validate_flags(unsigned long vm_flags)
> {
> 	if (!(vm_flags & VM_SPARC_ADI))
> 		return true;
> 
> 	if (!adi_capable())
> 		return false;
> 
> 	/* ADI can not be enabled on PFN mapped pages */
> 	if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> 		return false;
> 
> 	/*
> 	 * Mergeable pages can become unmergeable if ADI is enabled on
> 	 * them even if they have identical data on them. This can be
> 	 * because ADI enabled pages with identical data may still not
> 	 * have identical ADI tags on them. Disallow ADI on mergeable
> 	 * pages.
> 	 */
> 	if (vma->vm_flags & VM_MERGEABLE)
> 		return false;
> 
> 	return true;
> }
> 
>> That leaves open the question of should
>> generic mmap call arch_validate_prot and return EINVAL for invalid
>> combination of protection bits, but that is better addressed in a
>> separate patch.
> 
> The above would cover mmap() as well.
> 
> The current sparc_validate_prot() relies on finding the vma for the
> corresponding address. However, if you call this early in the mmap()
> path, there's no such vma. It is only created later in mmap_region()
> which no longer has the original PROT_* flags (all converted to VM_*
> flags).
> 
> Calling arch_validate_flags() on mmap() has a small side-effect on the
> user ABI: if the CPU is not adi_capable(), PROT_ADI is currently ignored
> on mmap() but rejected by sparc_validate_prot(). Powerpc already does
> this already and I think it should be fine for arm64 (it needs checking
> though as we have another flag, PROT_BTI, hopefully dynamic loaders
> don't pass this flag unconditionally).
> 
> However, as I said above, it doesn't solve the mmap() PROT_ADI checking
> for sparc since there's no vma yet. I'd strongly recommend the
> arch_validate_flags() approach and reverting the "start" parameter added
> to arch_validate_prot() if you go for the flags route.
> 
> Thanks.
>
Catalin Marinas Oct. 12, 2020, 5:22 p.m. UTC | #10
On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
> On 10/10/20 5:09 AM, Catalin Marinas wrote:
> > On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
> >> On 10/7/20 1:39 AM, Jann Horn wrote:
> >>> arch_validate_prot() is a hook that can validate whether a given set of
> >>> protection flags is valid in an mprotect() operation. It is given the set
> >>> of protection flags and the address being modified.
> >>>
> >>> However, the address being modified can currently not actually be used in
> >>> a meaningful way because:
> >>>
> >>> 1. Only the address is given, but not the length, and the operation can
> >>>    span multiple VMAs. Therefore, the callee can't actually tell which
> >>>    virtual address range, or which VMAs, are being targeted.
> >>> 2. The mmap_lock is not held, meaning that if the callee were to check
> >>>    the VMA at @addr, that VMA would be unrelated to the one the
> >>>    operation is performed on.
> >>>
> >>> Currently, custom arch_validate_prot() handlers are defined by
> >>> arm64, powerpc and sparc.
> >>> arm64 and powerpc don't care about the address range, they just check the
> >>> flags against CPU support masks.
> >>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> >>> the mmap_lock.
> >>>
> >>> Change the function signature to also take a length, and move the
> >>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> > [...]
> >> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
> >> is made without holding mmap_lock. Lock is not acquired until
> >> vm_mmap_pgoff(). This variance is uncomfortable but I am more
> >> uncomfortable forcing all implementations of validate_prot to require
> >> mmap_lock be held when non-sparc implementations do not have such need
> >> yet. Since do_mmap2() is in powerpc specific code, for now this patch
> >> solves a current problem.
> > 
> > I still think sparc should avoid walking the vmas in
> > arch_validate_prot(). The core code already has the vmas, though not
> > when calling arch_validate_prot(). That's one of the reasons I added
> > arch_validate_flags() with the MTE patches. For sparc, this could be
> > (untested, just copied the arch_validate_prot() code):
> 
> I am little uncomfortable with the idea of validating protection bits
> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
> on earlier VMAs. This will apply to protection bits other than ADI as
> well of course. This becomes a partial failure of mprotect() call. I
> think it should be all or nothing with mprotect() - when one calls
> mprotect() from userspace, either the entire address range passed in
> gets its protection bits updated or none of it does. That requires
> validating protection bits upfront or undoing what earlier iterations of
> VMA walk loop might have done.

I thought the same initially but mprotect() already does this with the
VM_MAY* flag checking. If you ask it for an mprotect() that crosses
multiple vmas and one of them fails, it doesn't roll back the changes to
the prior ones. I considered that a similar approach is fine for MTE
(it's most likely a user error).
Khalid Aziz Oct. 12, 2020, 7:14 p.m. UTC | #11
On 10/12/20 11:22 AM, Catalin Marinas wrote:
> On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
>> On 10/10/20 5:09 AM, Catalin Marinas wrote:
>>> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
>>>> On 10/7/20 1:39 AM, Jann Horn wrote:
>>>>> arch_validate_prot() is a hook that can validate whether a given set of
>>>>> protection flags is valid in an mprotect() operation. It is given the set
>>>>> of protection flags and the address being modified.
>>>>>
>>>>> However, the address being modified can currently not actually be used in
>>>>> a meaningful way because:
>>>>>
>>>>> 1. Only the address is given, but not the length, and the operation can
>>>>>    span multiple VMAs. Therefore, the callee can't actually tell which
>>>>>    virtual address range, or which VMAs, are being targeted.
>>>>> 2. The mmap_lock is not held, meaning that if the callee were to check
>>>>>    the VMA at @addr, that VMA would be unrelated to the one the
>>>>>    operation is performed on.
>>>>>
>>>>> Currently, custom arch_validate_prot() handlers are defined by
>>>>> arm64, powerpc and sparc.
>>>>> arm64 and powerpc don't care about the address range, they just check the
>>>>> flags against CPU support masks.
>>>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
>>>>> the mmap_lock.
>>>>>
>>>>> Change the function signature to also take a length, and move the
>>>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
>>> [...]
>>>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
>>>> is made without holding mmap_lock. Lock is not acquired until
>>>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
>>>> uncomfortable forcing all implementations of validate_prot to require
>>>> mmap_lock be held when non-sparc implementations do not have such need
>>>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
>>>> solves a current problem.
>>>
>>> I still think sparc should avoid walking the vmas in
>>> arch_validate_prot(). The core code already has the vmas, though not
>>> when calling arch_validate_prot(). That's one of the reasons I added
>>> arch_validate_flags() with the MTE patches. For sparc, this could be
>>> (untested, just copied the arch_validate_prot() code):
>>
>> I am little uncomfortable with the idea of validating protection bits
>> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
>> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
>> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
>> on earlier VMAs. This will apply to protection bits other than ADI as
>> well of course. This becomes a partial failure of mprotect() call. I
>> think it should be all or nothing with mprotect() - when one calls
>> mprotect() from userspace, either the entire address range passed in
>> gets its protection bits updated or none of it does. That requires
>> validating protection bits upfront or undoing what earlier iterations of
>> VMA walk loop might have done.
> 
> I thought the same initially but mprotect() already does this with the
> VM_MAY* flag checking. If you ask it for an mprotect() that crosses
> multiple vmas and one of them fails, it doesn't roll back the changes to
> the prior ones. I considered that a similar approach is fine for MTE
> (it's most likely a user error).
> 

You are right about the current behavior with VM_MAY* flags, but that is
not the right behavior. Adding more cases to this just perpetuates
incorrect behavior. It is not easy to roll back changes after VMAs have
potentially been split/merged which is probably why the current code
simply throws in the towel and returns with partially modified address
space. It is lot easier to do all the checks upfront and then proceed or
not proceed with modifying VMAs. One approach might be to call
arch_validate_flags() in a loop before modifying VMAs and walk all VMAs
with a read lock held. Current code also bails out with ENOMEM if it
finds a hole in the address range and leaves any modifications already
made in place. This is another case where a hole could have been
detected earlier.

--
Khalid
Catalin Marinas Oct. 13, 2020, 9:16 a.m. UTC | #12
On Mon, Oct 12, 2020 at 01:14:50PM -0600, Khalid Aziz wrote:
> On 10/12/20 11:22 AM, Catalin Marinas wrote:
> > On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
> >> On 10/10/20 5:09 AM, Catalin Marinas wrote:
> >>> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
> >>>> On 10/7/20 1:39 AM, Jann Horn wrote:
> >>>>> arch_validate_prot() is a hook that can validate whether a given set of
> >>>>> protection flags is valid in an mprotect() operation. It is given the set
> >>>>> of protection flags and the address being modified.
> >>>>>
> >>>>> However, the address being modified can currently not actually be used in
> >>>>> a meaningful way because:
> >>>>>
> >>>>> 1. Only the address is given, but not the length, and the operation can
> >>>>>    span multiple VMAs. Therefore, the callee can't actually tell which
> >>>>>    virtual address range, or which VMAs, are being targeted.
> >>>>> 2. The mmap_lock is not held, meaning that if the callee were to check
> >>>>>    the VMA at @addr, that VMA would be unrelated to the one the
> >>>>>    operation is performed on.
> >>>>>
> >>>>> Currently, custom arch_validate_prot() handlers are defined by
> >>>>> arm64, powerpc and sparc.
> >>>>> arm64 and powerpc don't care about the address range, they just check the
> >>>>> flags against CPU support masks.
> >>>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> >>>>> the mmap_lock.
> >>>>>
> >>>>> Change the function signature to also take a length, and move the
> >>>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> >>> [...]
> >>>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
> >>>> is made without holding mmap_lock. Lock is not acquired until
> >>>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
> >>>> uncomfortable forcing all implementations of validate_prot to require
> >>>> mmap_lock be held when non-sparc implementations do not have such need
> >>>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
> >>>> solves a current problem.
> >>>
> >>> I still think sparc should avoid walking the vmas in
> >>> arch_validate_prot(). The core code already has the vmas, though not
> >>> when calling arch_validate_prot(). That's one of the reasons I added
> >>> arch_validate_flags() with the MTE patches. For sparc, this could be
> >>> (untested, just copied the arch_validate_prot() code):
> >>
> >> I am little uncomfortable with the idea of validating protection bits
> >> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
> >> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
> >> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
> >> on earlier VMAs. This will apply to protection bits other than ADI as
> >> well of course. This becomes a partial failure of mprotect() call. I
> >> think it should be all or nothing with mprotect() - when one calls
> >> mprotect() from userspace, either the entire address range passed in
> >> gets its protection bits updated or none of it does. That requires
> >> validating protection bits upfront or undoing what earlier iterations of
> >> VMA walk loop might have done.
> > 
> > I thought the same initially but mprotect() already does this with the
> > VM_MAY* flag checking. If you ask it for an mprotect() that crosses
> > multiple vmas and one of them fails, it doesn't roll back the changes to
> > the prior ones. I considered that a similar approach is fine for MTE
> > (it's most likely a user error).
> 
> You are right about the current behavior with VM_MAY* flags, but that is
> not the right behavior. Adding more cases to this just perpetuates
> incorrect behavior. It is not easy to roll back changes after VMAs have
> potentially been split/merged which is probably why the current code
> simply throws in the towel and returns with partially modified address
> space. It is lot easier to do all the checks upfront and then proceed or
> not proceed with modifying VMAs. One approach might be to call
> arch_validate_flags() in a loop before modifying VMAs and walk all VMAs
> with a read lock held. Current code also bails out with ENOMEM if it
> finds a hole in the address range and leaves any modifications already
> made in place. This is another case where a hole could have been
> detected earlier.

This should be ideal indeed though with the risk of breaking the current
ABI (FWIW, FreeBSD seems to do a first pass to check for violations:
https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_map.c#L2630).

However, I'm not sure it's worth the hassle. Do we expect the user to
call mprotect() across multiple mixed type mappings while relying on no
change if an error is returned? We should probably at least document the
current behaviour in the mprotect man page.
Khalid Aziz Oct. 14, 2020, 9:21 p.m. UTC | #13
On 10/13/20 3:16 AM, Catalin Marinas wrote:
> On Mon, Oct 12, 2020 at 01:14:50PM -0600, Khalid Aziz wrote:
>> On 10/12/20 11:22 AM, Catalin Marinas wrote:
>>> On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
>>>> On 10/10/20 5:09 AM, Catalin Marinas wrote:
>>>>> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
>>>>>> On 10/7/20 1:39 AM, Jann Horn wrote:
>>>>>>> arch_validate_prot() is a hook that can validate whether a given set of
>>>>>>> protection flags is valid in an mprotect() operation. It is given the set
>>>>>>> of protection flags and the address being modified.
>>>>>>>
>>>>>>> However, the address being modified can currently not actually be used in
>>>>>>> a meaningful way because:
>>>>>>>
>>>>>>> 1. Only the address is given, but not the length, and the operation can
>>>>>>>    span multiple VMAs. Therefore, the callee can't actually tell which
>>>>>>>    virtual address range, or which VMAs, are being targeted.
>>>>>>> 2. The mmap_lock is not held, meaning that if the callee were to check
>>>>>>>    the VMA at @addr, that VMA would be unrelated to the one the
>>>>>>>    operation is performed on.
>>>>>>>
>>>>>>> Currently, custom arch_validate_prot() handlers are defined by
>>>>>>> arm64, powerpc and sparc.
>>>>>>> arm64 and powerpc don't care about the address range, they just check the
>>>>>>> flags against CPU support masks.
>>>>>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
>>>>>>> the mmap_lock.
>>>>>>>
>>>>>>> Change the function signature to also take a length, and move the
>>>>>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
>>>>> [...]
>>>>>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
>>>>>> is made without holding mmap_lock. Lock is not acquired until
>>>>>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
>>>>>> uncomfortable forcing all implementations of validate_prot to require
>>>>>> mmap_lock be held when non-sparc implementations do not have such need
>>>>>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
>>>>>> solves a current problem.
>>>>>
>>>>> I still think sparc should avoid walking the vmas in
>>>>> arch_validate_prot(). The core code already has the vmas, though not
>>>>> when calling arch_validate_prot(). That's one of the reasons I added
>>>>> arch_validate_flags() with the MTE patches. For sparc, this could be
>>>>> (untested, just copied the arch_validate_prot() code):
>>>>
>>>> I am little uncomfortable with the idea of validating protection bits
>>>> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
>>>> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
>>>> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
>>>> on earlier VMAs. This will apply to protection bits other than ADI as
>>>> well of course. This becomes a partial failure of mprotect() call. I
>>>> think it should be all or nothing with mprotect() - when one calls
>>>> mprotect() from userspace, either the entire address range passed in
>>>> gets its protection bits updated or none of it does. That requires
>>>> validating protection bits upfront or undoing what earlier iterations of
>>>> VMA walk loop might have done.
>>>
>>> I thought the same initially but mprotect() already does this with the
>>> VM_MAY* flag checking. If you ask it for an mprotect() that crosses
>>> multiple vmas and one of them fails, it doesn't roll back the changes to
>>> the prior ones. I considered that a similar approach is fine for MTE
>>> (it's most likely a user error).
>>
>> You are right about the current behavior with VM_MAY* flags, but that is
>> not the right behavior. Adding more cases to this just perpetuates
>> incorrect behavior. It is not easy to roll back changes after VMAs have
>> potentially been split/merged which is probably why the current code
>> simply throws in the towel and returns with partially modified address
>> space. It is lot easier to do all the checks upfront and then proceed or
>> not proceed with modifying VMAs. One approach might be to call
>> arch_validate_flags() in a loop before modifying VMAs and walk all VMAs
>> with a read lock held. Current code also bails out with ENOMEM if it
>> finds a hole in the address range and leaves any modifications already
>> made in place. This is another case where a hole could have been
>> detected earlier.
> 
> This should be ideal indeed though with the risk of breaking the current
> ABI (FWIW, FreeBSD seems to do a first pass to check for violations:
> https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_map.c#L2630).

I am not sure I understand where the ABI breakage would be. Are we aware
of apps that intentionally modify address space partially using the
current code? What FreeBSD does seems like a reasonable thing to do. Any
way first thing to do is to update sparc to use arch_validate_flags()
and update sparc_validate_prot() to not peek into vma without lock. I
can do that unless Jann wants to rework this 2 patch series with these
changes.

> 
> However, I'm not sure it's worth the hassle. Do we expect the user to
> call mprotect() across multiple mixed type mappings while relying on no
> change if an error is returned? We should probably at least document the
> current behaviour in the mprotect man page.
> 

Yes, documenting current behavior is definitely a good thing to do.

--
Khalid
Jann Horn Oct. 14, 2020, 10:29 p.m. UTC | #14
On Wed, Oct 14, 2020 at 11:24 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
[...]
> current code? What FreeBSD does seems like a reasonable thing to do. Any
> way first thing to do is to update sparc to use arch_validate_flags()
> and update sparc_validate_prot() to not peek into vma without lock. I
> can do that unless Jann wants to rework this 2 patch series with these
> changes.

Ah, if you're willing to take care of that, that'd be nice, please do. :)
Catalin Marinas Oct. 15, 2020, 9:05 a.m. UTC | #15
On Wed, Oct 14, 2020 at 03:21:16PM -0600, Khalid Aziz wrote:
> On 10/13/20 3:16 AM, Catalin Marinas wrote:
> > On Mon, Oct 12, 2020 at 01:14:50PM -0600, Khalid Aziz wrote:
> >> On 10/12/20 11:22 AM, Catalin Marinas wrote:
> >>> On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
> >>>> On 10/10/20 5:09 AM, Catalin Marinas wrote:
> >>>>> I still think sparc should avoid walking the vmas in
> >>>>> arch_validate_prot(). The core code already has the vmas, though not
> >>>>> when calling arch_validate_prot(). That's one of the reasons I added
> >>>>> arch_validate_flags() with the MTE patches. For sparc, this could be
> >>>>> (untested, just copied the arch_validate_prot() code):
> >>>>
> >>>> I am little uncomfortable with the idea of validating protection bits
> >>>> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
> >>>> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
> >>>> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
> >>>> on earlier VMAs. This will apply to protection bits other than ADI as
> >>>> well of course. This becomes a partial failure of mprotect() call. I
> >>>> think it should be all or nothing with mprotect() - when one calls
> >>>> mprotect() from userspace, either the entire address range passed in
> >>>> gets its protection bits updated or none of it does. That requires
> >>>> validating protection bits upfront or undoing what earlier iterations of
> >>>> VMA walk loop might have done.
> >>>
> >>> I thought the same initially but mprotect() already does this with the
> >>> VM_MAY* flag checking. If you ask it for an mprotect() that crosses
> >>> multiple vmas and one of them fails, it doesn't roll back the changes to
> >>> the prior ones. I considered that a similar approach is fine for MTE
> >>> (it's most likely a user error).
> >>
> >> You are right about the current behavior with VM_MAY* flags, but that is
> >> not the right behavior. Adding more cases to this just perpetuates
> >> incorrect behavior. It is not easy to roll back changes after VMAs have
> >> potentially been split/merged which is probably why the current code
> >> simply throws in the towel and returns with partially modified address
> >> space. It is lot easier to do all the checks upfront and then proceed or
> >> not proceed with modifying VMAs. One approach might be to call
> >> arch_validate_flags() in a loop before modifying VMAs and walk all VMAs
> >> with a read lock held. Current code also bails out with ENOMEM if it
> >> finds a hole in the address range and leaves any modifications already
> >> made in place. This is another case where a hole could have been
> >> detected earlier.
> > 
> > This should be ideal indeed though with the risk of breaking the current
> > ABI (FWIW, FreeBSD seems to do a first pass to check for violations:
> > https://github.com/freebsd/freebsd/blob/master/sys/vm/vm_map.c#L2630).
> 
> I am not sure I understand where the ABI breakage would be. Are we aware
> of apps that intentionally modify address space partially using the
> current code?

I hope there aren't any but you never know until you make the change and
someone complains. Arguably, such user code is already broken since
mprotect() doesn't even tell where the failure occurred.

> What FreeBSD does seems like a reasonable thing to do. Any way first
> thing to do is to update sparc to use arch_validate_flags() and update
> sparc_validate_prot() to not peek into vma without lock.

If you go for arch_validate_flags(), I think sparc_validate_prot()
doesn't need the vma at all.

BTW, on the ADI topic, I think you have a race in do_swap_page() since
set_pte_at() is called before arch_do_swap_page(). So a thread in the
same process would see the new mapping but the tags have not been
updated yet. Unless sparc relies on the new user pte to be set, I think
you can just swap the two calls.
Khalid Aziz Oct. 15, 2020, 2:53 p.m. UTC | #16
On 10/15/20 3:05 AM, Catalin Marinas wrote:
> On Wed, Oct 14, 2020 at 03:21:16PM -0600, Khalid Aziz wrote:
>> What FreeBSD does seems like a reasonable thing to do. Any way first
>> thing to do is to update sparc to use arch_validate_flags() and update
>> sparc_validate_prot() to not peek into vma without lock.
> 
> If you go for arch_validate_flags(), I think sparc_validate_prot()
> doesn't need the vma at all.

Yes, the plan is to move vma flag check from sparc_validate_prot() to
arch_validate_flags()..

> 
> BTW, on the ADI topic, I think you have a race in do_swap_page() since
> set_pte_at() is called before arch_do_swap_page(). So a thread in the
> same process would see the new mapping but the tags have not been
> updated yet. Unless sparc relies on the new user pte to be set, I think
> you can just swap the two calls.
> 

Thanks for pointing that out. I will take a look at it.

--
Khalid
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index 081ec8de9ea6..0876a87986dc 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -23,7 +23,7 @@  static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
 
 static inline bool arch_validate_prot(unsigned long prot,
-	unsigned long addr __always_unused)
+	unsigned long addr __always_unused, unsigned long len __always_unused)
 {
 	unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM;
 
@@ -32,6 +32,6 @@  static inline bool arch_validate_prot(unsigned long prot,
 
 	return (prot & ~supported) == 0;
 }
-#define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)
+#define arch_validate_prot(prot, addr, len) arch_validate_prot(prot, addr, len)
 
 #endif /* ! __ASM_MMAN_H__ */
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 7cb6d18f5cd6..65dd9b594985 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -36,7 +36,8 @@  static inline pgprot_t arch_vm_get_page_prot(unsigned long vm_flags)
 }
 #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
 
-static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr,
+				      unsigned long len)
 {
 	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
 		return false;
diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
index 078608ec2e92..b1fabb97d138 100644
--- a/arch/powerpc/kernel/syscalls.c
+++ b/arch/powerpc/kernel/syscalls.c
@@ -43,7 +43,7 @@  static inline long do_mmap2(unsigned long addr, size_t len,
 {
 	long ret = -EINVAL;
 
-	if (!arch_validate_prot(prot, addr))
+	if (!arch_validate_prot(prot, addr, len))
 		goto out;
 
 	if (shift) {
diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
index f94532f25db1..e85222c76585 100644
--- a/arch/sparc/include/asm/mman.h
+++ b/arch/sparc/include/asm/mman.h
@@ -52,9 +52,11 @@  static inline pgprot_t sparc_vm_get_page_prot(unsigned long vm_flags)
 	return (vm_flags & VM_SPARC_ADI) ? __pgprot(_PAGE_MCD_4V) : __pgprot(0);
 }
 
-#define arch_validate_prot(prot, addr) sparc_validate_prot(prot, addr)
-static inline int sparc_validate_prot(unsigned long prot, unsigned long addr)
+#define arch_validate_prot(prot, addr, len) sparc_validate_prot(prot, addr, len)
+static inline int sparc_validate_prot(unsigned long prot, unsigned long addr,
+				      unsigned long len)
 {
+	mmap_assert_write_locked(current->mm);
 	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_ADI))
 		return 0;
 	if (prot & PROT_ADI) {
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 6f34c33075f9..5b4d554d3189 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -96,7 +96,8 @@  static inline void vm_unacct_memory(long pages)
  *
  * Returns true if the prot flags are valid
  */
-static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
+static inline bool arch_validate_prot(unsigned long prot, unsigned long addr,
+				      unsigned long len)
 {
 	return (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM)) == 0;
 }
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..e2d6b51acbf8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -533,14 +533,16 @@  static int do_mprotect_pkey(unsigned long start, size_t len,
 	end = start + len;
 	if (end <= start)
 		return -ENOMEM;
-	if (!arch_validate_prot(prot, start))
-		return -EINVAL;
 
 	reqprot = prot;
 
 	if (mmap_write_lock_killable(current->mm))
 		return -EINTR;
 
+	error = -EINVAL;
+	if (!arch_validate_prot(prot, start, len))
+		goto out;
+
 	/*
 	 * If userspace did not allocate the pkey, do not let
 	 * them use it here.