diff mbox series

[v2,3/5] x86/mm: check exec permissions on fault

Message ID 20211021122112.592634-4-namit@vmware.com (mailing list archive)
State New
Headers show
Series mm/mprotect: avoid unnecessary TLB flushes | expand

Commit Message

Nadav Amit Oct. 21, 2021, 12:21 p.m. UTC
From: Nadav Amit <namit@vmware.com>

access_error() currently does not check for execution permission
violation. As a result, spurious page-faults due to execution permission
violation cause SIGSEGV.

It appears not to be an issue so far, but the next patches avoid TLB
flushes on permission promotion, which can lead to this scenario. nodejs
for instance crashes when TLB flush is avoided on permission promotion.

Add a check to prevent access_error() from returning mistakenly that
page-faults due to instruction fetch are not allowed. Intel SDM does not
indicate whether "instruction fetch" and "write" in the hardware error
code are mutual exclusive, so check both before returning whether the
access is allowed.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: x86@kernel.org
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 arch/x86/mm/fault.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra Oct. 25, 2021, 10:59 a.m. UTC | #1
On Thu, Oct 21, 2021 at 05:21:10AM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>

> Add a check to prevent access_error() from returning mistakenly that
> page-faults due to instruction fetch are not allowed. Intel SDM does not
> indicate whether "instruction fetch" and "write" in the hardware error
> code are mutual exclusive, so check both before returning whether the
> access is allowed.

Dave, can we get that clarified? It seems a bit naf and leads to
confusing code IMO.

Other than that, the change looks ok to me.

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index b2eefdefc108..e776130473ce 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1100,10 +1100,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>  				       (error_code & X86_PF_INSTR), foreign))
>  		return 1;
>  
> -	if (error_code & X86_PF_WRITE) {
> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>  		/* write, present and write, not present: */
> -		if (unlikely(!(vma->vm_flags & VM_WRITE)))
> +		if ((error_code & X86_PF_WRITE) &&
> +		    unlikely(!(vma->vm_flags & VM_WRITE)))
>  			return 1;
> +
> +		/* exec, present and exec, not present: */
> +		if ((error_code & X86_PF_INSTR) &&
> +		    unlikely(!(vma->vm_flags & VM_EXEC)))
> +			return 1;
> +
>  		return 0;
>  	}
Andrew Cooper Oct. 25, 2021, 11:13 a.m. UTC | #2
On 25/10/2021 11:59, Peter Zijlstra wrote:
> On Thu, Oct 21, 2021 at 05:21:10AM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> Add a check to prevent access_error() from returning mistakenly that
>> page-faults due to instruction fetch are not allowed. Intel SDM does not
>> indicate whether "instruction fetch" and "write" in the hardware error
>> code are mutual exclusive, so check both before returning whether the
>> access is allowed.
> Dave, can we get that clarified? It seems a bit naf and leads to
> confusing code IMO.

There is no such thing as an instruction fetch (a read) causing a
modification to the mapping.  From this point of view, you'd never
expect to see them both set.

However, be aware that INSTR is only reported for NX || SMEP.  Without
either, instruction vs data accesses are distinguished internally (can
be demonstrated with SMAP) but not visible in the pagefault error code.

~Andrew
Dave Hansen Oct. 25, 2021, 2:20 p.m. UTC | #3
On 10/21/21 5:21 AM, Nadav Amit wrote:
> access_error() currently does not check for execution permission
> violation. 
Ye

> As a result, spurious page-faults due to execution permission
> violation cause SIGSEGV.

While I could totally believe that something is goofy when VMAs are
being changed underneath a page fault, I'm having trouble figuring out
why the "if (error_code & X86_PF_WRITE)" code is being modified.

> It appears not to be an issue so far, but the next patches avoid TLB
> flushes on permission promotion, which can lead to this scenario. nodejs
> for instance crashes when TLB flush is avoided on permission promotion.

Just to be clear, "promotion" is going from something like:

	W=0->W=1
or
	NX=1->NX=0

right?  I tend to call that "relaxing" permissions.

Currently, X86_PF_WRITE faults are considered an access error unless the
VMA to which the write occurred allows writes.  Returning "no access
error" permits continuing and handling the copy-on-write.

It sounds like you want to expand that.  You want to add a whole class
of new faults that can be ignored: not just that some COW handling might
be necessary, but that the PTE itself might be out of date.    Just like
a "COW fault" may just result in setting the PTE.W=1 and moving on with
our day, an instruction fault might now just end up with setting
PTE.NX=0 and also moving on with our day.

I'm really confused why the "error_code & X86_PF_WRITE" case is getting
modified.  I would have expected it to be something like just adding:

	/* read, instruction fetch */
	if (error_code & X86_PF_INSN) {
                /* Avoid enforcing access error if spurious: */
                if (unlikely(!(vma->vm_flags & VM_EXEC)))
                        return 1;
                return 0;
        }

I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common
other than both being able to (now) be generated spuriously.
Dave Hansen Oct. 25, 2021, 2:23 p.m. UTC | #4
On 10/25/21 3:59 AM, Peter Zijlstra wrote:
>> Add a check to prevent access_error() from returning mistakenly that
>> page-faults due to instruction fetch are not allowed. Intel SDM does not
>> indicate whether "instruction fetch" and "write" in the hardware error
>> code are mutual exclusive, so check both before returning whether the
>> access is allowed.
> Dave, can we get that clarified? It seems a bit naf and leads to
> confusing code IMO.

We can, but there are quite a few implicit relationships in those bits.
 PF_INSN and PF_PK can't ever be set together, for instance.  It's
pretty clear as long as you have fetch==read in your head.
Nadav Amit Oct. 25, 2021, 4:19 p.m. UTC | #5
> On Oct 25, 2021, at 7:20 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 10/21/21 5:21 AM, Nadav Amit wrote:
>> access_error() currently does not check for execution permission
>> violation. 
> Ye
> 
>> As a result, spurious page-faults due to execution permission
>> violation cause SIGSEGV.
> 
> While I could totally believe that something is goofy when VMAs are
> being changed underneath a page fault, I'm having trouble figuring out
> why the "if (error_code & X86_PF_WRITE)" code is being modified.

In the scenario I mentioned the VMAs are not changed underneath the
page-fault. They change *before* the page-fault, but there are
residues of the old PTE in the TLB. 

> 
>> It appears not to be an issue so far, but the next patches avoid TLB
>> flushes on permission promotion, which can lead to this scenario. nodejs
>> for instance crashes when TLB flush is avoided on permission promotion.
> 
> Just to be clear, "promotion" is going from something like:
> 
> 	W=0->W=1
> or
> 	NX=1->NX=0
> 
> right?  I tend to call that "relaxing" permissions.

I specifically talk about NX=1>NX=0.

I can change the language to “relaxing”.

> 
> Currently, X86_PF_WRITE faults are considered an access error unless the
> VMA to which the write occurred allows writes.  Returning "no access
> error" permits continuing and handling the copy-on-write.
> 
> It sounds like you want to expand that.  You want to add a whole class
> of new faults that can be ignored: not just that some COW handling might
> be necessary, but that the PTE itself might be out of date.    Just like
> a "COW fault" may just result in setting the PTE.W=1 and moving on with
> our day, an instruction fault might now just end up with setting
> PTE.NX=0 and also moving on with our day.

You raise an interesting idea (which can easily be implemented with uffd),
but no - I had none of that in my mind.

My only purpose is to deal with actual spurious page-faults that I
encountered when I removed the TLB flush the happens after NX=1->NX=0.

I am actually surprised that the kernel makes such a strong assumption
that every change of NX=1->NX=0 would be followed by a TLB flush, and
that during these changes the mm is locked for write. But that is the
case. If you do not have this change and a PTE is changed from
NX=1->NX=0 and *later* you access the page, you can have a page-fault
due to stale PTE, and get a SIGSEGV since access_error() is wrong to
assume that this is an invalid access.

I did not change and there are no changes to the VMA during the
page-fault. The page-fault handler would do pretty much nothing and
return to user-space which would retry the instruction. [ page-fault
triggers an implicit TLB flush of the offending PTE ]

> 
> I'm really confused why the "error_code & X86_PF_WRITE" case is getting
> modified.  I would have expected it to be something like just adding:
> 
> 	/* read, instruction fetch */
> 	if (error_code & X86_PF_INSN) {
>                /* Avoid enforcing access error if spurious: */
>                if (unlikely(!(vma->vm_flags & VM_EXEC)))
>                        return 1;
>                return 0;
>        }
> 
> I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common
> other than both being able to (now) be generated spuriously.

That was my first version, but I was concerned that perhaps there is
some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
be set. That is the reason that Peter asked you whether this is
something that might happen.

If you confirm they cannot be both set, I would the version you just
mentioned.
Dave Hansen Oct. 25, 2021, 5:45 p.m. UTC | #6
On 10/25/21 9:19 AM, Nadav Amit wrote:
> That was my first version, but I was concerned that perhaps there is
> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
> be set. That is the reason that Peter asked you whether this is
> something that might happen.
> 
> If you confirm they cannot be both set, I would the version you just
> mentioned.

I'm pretty sure they can't be set together on any sane hardware.  A
bonkers hypervisor or CPU could do it of course, but they'd be crazy.

BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set.
That would be a nice place to talk about the assumption.
Nadav Amit Oct. 25, 2021, 5:51 p.m. UTC | #7
> On Oct 25, 2021, at 10:45 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 10/25/21 9:19 AM, Nadav Amit wrote:
>> That was my first version, but I was concerned that perhaps there is
>> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
>> be set. That is the reason that Peter asked you whether this is
>> something that might happen.
>> 
>> If you confirm they cannot be both set, I would the version you just
>> mentioned.
> 
> I'm pretty sure they can't be set together on any sane hardware.  A
> bonkers hypervisor or CPU could do it of course, but they'd be crazy.
> 
> BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set.
> That would be a nice place to talk about the assumption.
> 

I can do that. But be aware that if the assumption is broken, it might
lead to the application getting stuck in an infinite loop of
page-faults instead of receiving SIGSEGV.
Dave Hansen Oct. 25, 2021, 6 p.m. UTC | #8
On 10/25/21 10:51 AM, Nadav Amit wrote:
>> On Oct 25, 2021, at 10:45 AM, Dave Hansen <dave.hansen@intel.com> wrote:
>> On 10/25/21 9:19 AM, Nadav Amit wrote:
>>> That was my first version, but I was concerned that perhaps there is
>>> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
>>> be set. That is the reason that Peter asked you whether this is
>>> something that might happen.
>>>
>>> If you confirm they cannot be both set, I would the version you just
>>> mentioned.
>> I'm pretty sure they can't be set together on any sane hardware.  A
>> bonkers hypervisor or CPU could do it of course, but they'd be crazy.
>>
>> BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set.
>> That would be a nice place to talk about the assumption.
>>
> I can do that. But be aware that if the assumption is broken, it might
> lead to the application getting stuck in an infinite loop of
> page-faults instead of receiving SIGSEGV.

If we have a bonkers hypervisor/CPU, I'm OK with a process that hangs
like that, especially if we can ^C it and see its stream of page faults
with tracing or whatever.

Couldn't we just also do:

	if ((code & (X86_PF_WRITE|X86_PF_INSN) ==
                    (X86_PF_WRITE|X86_PF_INSN)) {
		WARN_ON_ONCE(1);
		return 1;
	}

That should give you the WARN_ON_ONCE() and also return an affirmative
access_error(), resulting in a SIGSEGV.

(I'm not sure I like the indentation as I wrote it here... just do what
looks best in the code)
diff mbox series

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b2eefdefc108..e776130473ce 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1100,10 +1100,17 @@  access_error(unsigned long error_code, struct vm_area_struct *vma)
 				       (error_code & X86_PF_INSTR), foreign))
 		return 1;
 
-	if (error_code & X86_PF_WRITE) {
+	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
 		/* write, present and write, not present: */
-		if (unlikely(!(vma->vm_flags & VM_WRITE)))
+		if ((error_code & X86_PF_WRITE) &&
+		    unlikely(!(vma->vm_flags & VM_WRITE)))
 			return 1;
+
+		/* exec, present and exec, not present: */
+		if ((error_code & X86_PF_INSTR) &&
+		    unlikely(!(vma->vm_flags & VM_EXEC)))
+			return 1;
+
 		return 0;
 	}