mbox series

[RESEND,v3,0/5] mm/mprotect: avoid unnecessary TLB flushes

Message ID 20220311190749.338281-1-namit@vmware.com (mailing list archive)
Headers show
Series mm/mprotect: avoid unnecessary TLB flushes | expand

Message

Nadav Amit March 11, 2022, 7:07 p.m. UTC
From: Nadav Amit <namit@vmware.com>

This patch-set is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar
and further optimizations for MADV_COLD and userfaultfd would be
possible.

Sorry for the time between it took me to get to v3.

Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and
   do better/fewer flushes. This would also be handy for later
   userfaultfd enhancements.
2. Avoid TLB flushes on permission demotion. This optimization is
   the one that provides most of the performance benefits. Note that
   the previous batching infrastructure changes are needed for that to
   happen.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
   prevent the A/D bits from changing.

Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I therre
ran a microbenchmark: a loop that does the following on anonymous
memory, just as a sanity check to see that time is saved by avoiding TLB
flushes. The loop goes:

	mprotect(p, PAGE_SIZE, PROT_READ)
	mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
	*p = 0; // make the page writable

The test was run in KVM guest with 1 or 2 threads (the second thread
was busy-looping). I measured the time (cycles) of each operation:

		1 thread		2 threads
		mmots	+patch		mmots	+patch
PROT_READ	3494	2725 (-22%)	8630	7788 (-10%)
PROT_READ|WRITE	3952	2724 (-31%)	9075	2865 (-68%)

[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]

The exact numbers are really meaningless, but the benefit is clear.
There are 2 interesting results though. 

(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved

(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.

--

v2 -> v3:
* Fix orders of patches (order could lead to breakage)
* Better comments
* Clearer KNL detection [Dave]
* Assertion on PF error-code [Dave]
* Comments, code, function names improvements [PeterZ]
* Flush on access-bit clearing on PMD changes to follow the way
  flushing on x86 is done today in the kernel.

v1 -> v2:
* Wrong detection of permission demotion [Andrea]
* Better comments [Andrea]
* Handle THP [Andrea]
* Batching across VMAs [Peter Xu]
* Avoid open-coding PTE analysis
* Fix wrong use of the mmu_gather()


*** BLURB HERE ***

Nadav Amit (5):
  x86: Detection of Knights Landing A/D leak
  x86/mm: check exec permissions on fault
  mm/mprotect: use mmu_gather
  mm/mprotect: do not flush on permission promotion
  mm: avoid unnecessary flush on change_huge_pmd()

 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/pgtable.h       |  5 ++
 arch/x86/include/asm/pgtable_types.h |  2 +
 arch/x86/include/asm/tlbflush.h      | 82 ++++++++++++++++++++++++
 arch/x86/kernel/cpu/intel.c          |  5 ++
 arch/x86/mm/fault.c                  | 22 ++++++-
 arch/x86/mm/pgtable.c                | 10 +++
 fs/exec.c                            |  6 +-
 include/asm-generic/tlb.h            | 14 +++++
 include/linux/huge_mm.h              |  5 +-
 include/linux/mm.h                   |  5 +-
 include/linux/pgtable.h              | 20 ++++++
 mm/huge_memory.c                     | 19 ++++--
 mm/mprotect.c                        | 94 +++++++++++++++-------------
 mm/pgtable-generic.c                 |  8 +++
 mm/userfaultfd.c                     |  6 +-
 16 files changed, 248 insertions(+), 56 deletions(-)

Comments

Dave Hansen March 11, 2022, 7:41 p.m. UTC | #1
On 3/11/22 11:07, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> access_error() currently does not check for execution permission
> violation. As a result, spurious page-faults due to execution permission
> violation cause SIGSEGV.

This is a bit muddy on the problem statement.  I get that spurious
faults can theoretically cause this, but *do* they in practice on
current kernels?

> It appears not to be an issue so far, but the next patches avoid TLB
> flushes on permission promotion, which can lead to this scenario. nodejs
> for instance crashes when TLB flush is avoided on permission promotion.

By "it appears not to be an issue", do you mean that this suboptimal
behavior can not be triggered, period?  Or, it can be triggered but
folks seem not to care that it can be triggered?

I *think* these can be triggered today.  I think it takes two threads
that do something like:

	Thread 1			Thread 2
	========			========
	ptr = malloc();
	memcpy(ptr, &code, len);
	exec_now = 1;
					while (!exec_now);
					call(ptr);		
					// fault	
	mprotect(ptr, PROT_EXEC, len);
					// fault sees VM_EXEC


But that has a bug: exec_now is set before the mprotect().  It's not
sane code.

Can any sane code trigger this?

> Add a check to prevent access_error() from returning mistakenly that
> spurious page-faults due to instruction fetch are a reason for an access
> error.
> 
> It is assumed that error code bits of "instruction fetch" and "write" in
> the hardware error code are mutual exclusive, and the change assumes so.
> However, to be on the safe side, especially if hypervisors misbehave,
> assert this is the case and warn otherwise.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Will Deacon <will@kernel.org>
> Cc: Yu Zhao <yuzhao@google.com>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: x86@kernel.org
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  arch/x86/mm/fault.c | 22 ++++++++++++++++++++--
>  1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index d0074c6ed31a..ad0ef0a6087a 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>  				       (error_code & X86_PF_INSTR), foreign))
>  		return 1;
>  
> -	if (error_code & X86_PF_WRITE) {
> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
> +		/*
> +		 * CPUs are not expected to set the two error code bits
> +		 * together, but to ensure that hypervisors do not misbehave,
> +		 * run an additional sanity check.
> +		 */
> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
> +					(X86_PF_WRITE|X86_PF_INSTR)) {
> +			WARN_ON_ONCE(1);
> +			return 1;
> +		}

access_error() is only used on the do_user_addr_fault() side of things.
 Can we stick this check somewhere that also works for kernel address
faults?  This is a generic sanity check.  It can also be in a separate
patch.

Also, we should *probably* stop talking about CPUs here.  If there's
ever something wonky with error code bits, I'd put my money on a weird
hypervisor before any kind of CPU issue.

>  		/* write, present and write, not present: */
> -		if (unlikely(!(vma->vm_flags & VM_WRITE)))
> +		if ((error_code & X86_PF_WRITE) &&
> +		    unlikely(!(vma->vm_flags & VM_WRITE)))
> +			return 1;
> +
> +		/* exec, present and exec, not present: */
> +		if ((error_code & X86_PF_INSTR) &&
> +		    unlikely(!(vma->vm_flags & VM_EXEC)))
>  			return 1;
> +
>  		return 0;
>  	}

This is getting really ugly.  I think we've gone over this before, but
it escapes me.  Why do we need a common (X86_PF_WRITE | X86_PF_INSTR)
block of code?  Why can't we just add a simple X86_PF_INSN if() that
mirrors the current X86_PF_WRITE one?


        if (error_code & X86_PF_INSN) {
                /* present and not exec: */
                if (unlikely(!(vma->vm_flags & VM_EXEC)))
                        return 1;
                return 0;
        }
Nadav Amit March 11, 2022, 8:38 p.m. UTC | #2
> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/11/22 11:07, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> access_error() currently does not check for execution permission
>> violation. As a result, spurious page-faults due to execution permission
>> violation cause SIGSEGV.
> 
> This is a bit muddy on the problem statement.  I get that spurious
> faults can theoretically cause this, but *do* they in practice on
> current kernels?
> 
>> It appears not to be an issue so far, but the next patches avoid TLB
>> flushes on permission promotion, which can lead to this scenario. nodejs
>> for instance crashes when TLB flush is avoided on permission promotion.
> 
> By "it appears not to be an issue", do you mean that this suboptimal
> behavior can not be triggered, period?  Or, it can be triggered but
> folks seem not to care that it can be triggered?
> 
> I *think* these can be triggered today.  I think it takes two threads
> that do something like:
> 
> 	Thread 1			Thread 2
> 	========			========
> 	ptr = malloc();
> 	memcpy(ptr, &code, len);
> 	exec_now = 1;
> 					while (!exec_now);
> 					call(ptr);		
> 					// fault	
> 	mprotect(ptr, PROT_EXEC, len);
> 					// fault sees VM_EXEC
> 
> 
> But that has a bug: exec_now is set before the mprotect().  It's not
> sane code.
> 
> Can any sane code trigger this?

Well, regarding this question and the previous one: I do not think that
this scenario is possible today since mprotect() holds the mmap_lock
for write. There is no other code that I am aware of that toggles
the NX bit on a present entry.

But I will not bet my life on it. That’s the reason for the somewhat
vague phrasing that I used.

>> 
>> index d0074c6ed31a..ad0ef0a6087a 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>> 				       (error_code & X86_PF_INSTR), foreign))
>> 		return 1;
>> 
>> -	if (error_code & X86_PF_WRITE) {
>> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>> +		/*
>> +		 * CPUs are not expected to set the two error code bits
>> +		 * together, but to ensure that hypervisors do not misbehave,
>> +		 * run an additional sanity check.
>> +		 */
>> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>> +					(X86_PF_WRITE|X86_PF_INSTR)) {
>> +			WARN_ON_ONCE(1);
>> +			return 1;
>> +		}
> 
> access_error() is only used on the do_user_addr_fault() side of things.
> Can we stick this check somewhere that also works for kernel address
> faults?  This is a generic sanity check.  It can also be in a separate
> patch.

I can wrap it in a different function and also call it from
do_kern_addr_fault() or spurious_kernel_fault().

Anyhow, spurious_kernel_fault() should handle spurious faults on
executable code correctly. 

> 
> Also, we should *probably* stop talking about CPUs here.  If there's
> ever something wonky with error code bits, I'd put my money on a weird
> hypervisor before any kind of CPU issue.

I thought I manage to convey exactly that in the comment. Can you provide
a better phrasing?

> 
>> 		/* write, present and write, not present: */
>> -		if (unlikely(!(vma->vm_flags & VM_WRITE)))
>> +		if ((error_code & X86_PF_WRITE) &&
>> +		    unlikely(!(vma->vm_flags & VM_WRITE)))
>> +			return 1;
>> +
>> +		/* exec, present and exec, not present: */
>> +		if ((error_code & X86_PF_INSTR) &&
>> +		    unlikely(!(vma->vm_flags & VM_EXEC)))
>> 			return 1;
>> +
>> 		return 0;
>> 	}
> 
> This is getting really ugly.  I think we've gone over this before, but
> it escapes me.  Why do we need a common (X86_PF_WRITE | X86_PF_INSTR)
> block of code?  Why can't we just add a simple X86_PF_INSN if() that
> mirrors the current X86_PF_WRITE one?
> 
> 
>        if (error_code & X86_PF_INSN) {
>                /* present and not exec: */
>                if (unlikely(!(vma->vm_flags & VM_EXEC)))
>                        return 1;
>                return 0;
>        }

You are correct. My bad. I will fix it.
Dave Hansen March 11, 2022, 8:59 p.m. UTC | #3
On 3/11/22 12:38, Nadav Amit wrote:
>> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote:
...
>> Can any sane code trigger this?
> 
> Well, regarding this question and the previous one: I do not think that
> this scenario is possible today since mprotect() holds the mmap_lock
> for write. There is no other code that I am aware of that toggles
> the NX bit on a present entry.
> 
> But I will not bet my life on it. That’s the reason for the somewhat
> vague phrasing that I used.

From the userspace perspective, mmap(MAP_FIXED) can do this too.  But,
sane userspace can't rely on the syscall to have done any work and the
TLB flushing is currently done before the syscall returns.

I'd put it this way:

	Today, it is possible for a thread to end up in access_error()
	for a PF_INSN fault and observe a VM_EXEC VMA.  If you are
	generous, this could be considered a spurious fault.

	However, the faulting thread would have had to race with the
	thread which was changing the PTE and the VMA and is currently
	*in* mprotect() (or some other syscall).  In other words, the
	faulting thread can encounter this situation, but it never had
	any assurance from the kernel that it wouldn't fault.  This is
	because the faulting thread never had a chance to observe the
	syscall return.

	There is no evidence that the existing behavior can cause any
	issues with sane userspace.

>>> index d0074c6ed31a..ad0ef0a6087a 100644
>>> --- a/arch/x86/mm/fault.c
>>> +++ b/arch/x86/mm/fault.c
>>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>>> 				       (error_code & X86_PF_INSTR), foreign))
>>> 		return 1;
>>>
>>> -	if (error_code & X86_PF_WRITE) {
>>> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>>> +		/*
>>> +		 * CPUs are not expected to set the two error code bits
>>> +		 * together, but to ensure that hypervisors do not misbehave,
>>> +		 * run an additional sanity check.
>>> +		 */
>>> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>>> +					(X86_PF_WRITE|X86_PF_INSTR)) {
>>> +			WARN_ON_ONCE(1);
>>> +			return 1;
>>> +		}
>>
>> access_error() is only used on the do_user_addr_fault() side of things.
>> Can we stick this check somewhere that also works for kernel address
>> faults?  This is a generic sanity check.  It can also be in a separate
>> patch.
> 
> I can wrap it in a different function and also call it from
> do_kern_addr_fault() or spurious_kernel_fault().
> 
> Anyhow, spurious_kernel_fault() should handle spurious faults on
> executable code correctly. 

This is really about checking the sanity of the "hardware"-provided
error code.  Let's just do it in  handle_page_fault(), maybe hidden in a
function like:

void check_error_code_sanity(unsigned long error_code)
{
	WARN_ON_ONCE(...);
}

You can leave the X86_PF_PK check in place for now.  It's probably going
away soon anyway.

>> Also, we should *probably* stop talking about CPUs here.  If there's
>> ever something wonky with error code bits, I'd put my money on a weird
>> hypervisor before any kind of CPU issue.
> 
> I thought I manage to convey exactly that in the comment. Can you provide
> a better phrasing?

Maybe:

	/*
	 * X86_PF_INSTR for instruction _fetches_.  Fetches never write.
	 * X86_PF_WRITE should never be set with X86_PF_INSTR.
	 *
	 * This is most likely due to a buggy hypervisor.
	 */
Nadav Amit March 11, 2022, 9:16 p.m. UTC | #4
> On Mar 11, 2022, at 12:59 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 3/11/22 12:38, Nadav Amit wrote:
>>> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> ...
>>> Can any sane code trigger this?
>> 
>> Well, regarding this question and the previous one: I do not think that
>> this scenario is possible today since mprotect() holds the mmap_lock
>> for write. There is no other code that I am aware of that toggles
>> the NX bit on a present entry.
>> 
>> But I will not bet my life on it. That’s the reason for the somewhat
>> vague phrasing that I used.
> 
> From the userspace perspective, mmap(MAP_FIXED) can do this too.  But,
> sane userspace can't rely on the syscall to have done any work and the
> TLB flushing is currently done before the syscall returns.
> 
> I'd put it this way:
> 
> 	Today, it is possible for a thread to end up in access_error()
> 	for a PF_INSN fault and observe a VM_EXEC VMA.  If you are
> 	generous, this could be considered a spurious fault.
> 
> 	However, the faulting thread would have had to race with the
> 	thread which was changing the PTE and the VMA and is currently
> 	*in* mprotect() (or some other syscall).  In other words, the
> 	faulting thread can encounter this situation, but it never had
> 	any assurance from the kernel that it wouldn't fault.  This is
> 	because the faulting thread never had a chance to observe the
> 	syscall return.
> 
> 	There is no evidence that the existing behavior can cause any
> 	issues with sane userspace.

Done. Thanks.

> 
>>>> index d0074c6ed31a..ad0ef0a6087a 100644
>>>> --- a/arch/x86/mm/fault.c
>>>> +++ b/arch/x86/mm/fault.c
>>>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
>>>> 				       (error_code & X86_PF_INSTR), foreign))
>>>> 		return 1;
>>>> 
>>>> -	if (error_code & X86_PF_WRITE) {
>>>> +	if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
>>>> +		/*
>>>> +		 * CPUs are not expected to set the two error code bits
>>>> +		 * together, but to ensure that hypervisors do not misbehave,
>>>> +		 * run an additional sanity check.
>>>> +		 */
>>>> +		if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) ==
>>>> +					(X86_PF_WRITE|X86_PF_INSTR)) {
>>>> +			WARN_ON_ONCE(1);
>>>> +			return 1;
>>>> +		}
>>> 
>>> access_error() is only used on the do_user_addr_fault() side of things.
>>> Can we stick this check somewhere that also works for kernel address
>>> faults?  This is a generic sanity check.  It can also be in a separate
>>> patch.
>> 
>> I can wrap it in a different function and also call it from
>> do_kern_addr_fault() or spurious_kernel_fault().
>> 
>> Anyhow, spurious_kernel_fault() should handle spurious faults on
>> executable code correctly. 
> 
> This is really about checking the sanity of the "hardware"-provided
> error code.  Let's just do it in  handle_page_fault(), maybe hidden in a
> function like:
> 
> void check_error_code_sanity(unsigned long error_code)
> {
> 	WARN_ON_ONCE(...);
> }
> 
> You can leave the X86_PF_PK check in place for now.  It's probably going
> away soon anyway.

Done. Thanks. But note that removing the check from access_error() means
that if the assertion is broken, userspace might crash inadvertently
(in contrast to the version I sent, which would have potentially led to
infinite stream of page-faults). I don’t know which behavior is better,
so let’s go with your version and just hope it doesn’t happen.

> 
>>> Also, we should *probably* stop talking about CPUs here.  If there's
>>> ever something wonky with error code bits, I'd put my money on a weird
>>> hypervisor before any kind of CPU issue.
>> 
>> I thought I manage to convey exactly that in the comment. Can you provide
>> a better phrasing?
> 
> Maybe:
> 
> 	/*
> 	 * X86_PF_INSTR for instruction _fetches_.  Fetches never write.
> 	 * X86_PF_WRITE should never be set with X86_PF_INSTR.
> 	 *
> 	 * This is most likely due to a buggy hypervisor.
> 	 */

Done, thank you.
Dave Hansen March 11, 2022, 9:23 p.m. UTC | #5
On 3/11/22 13:16, Nadav Amit wrote:
>> This is really about checking the sanity of the "hardware"-provided
>> error code.  Let's just do it in  handle_page_fault(), maybe hidden in a
>> function like:
>>
>> void check_error_code_sanity(unsigned long error_code)
>> {
>> 	WARN_ON_ONCE(...);
>> }
>>
>> You can leave the X86_PF_PK check in place for now.  It's probably going
>> away soon anyway.
> Done. Thanks. But note that removing the check from access_error() means
> that if the assertion is broken, userspace might crash inadvertently
> (in contrast to the version I sent, which would have potentially led to
> infinite stream of page-faults). I don’t know which behavior is better,
> so let’s go with your version and just hope it doesn’t happen.

Actually, crashing sounds much nicer to me than infinite page faults.
It's a lot easier to debug, *especially* with a warning on dmesg.