Message ID | 20220311190749.338281-1-namit@vmware.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/mprotect: avoid unnecessary TLB flushes | expand |
On 3/11/22 11:07, Nadav Amit wrote: > From: Nadav Amit <namit@vmware.com> > > access_error() currently does not check for execution permission > violation. As a result, spurious page-faults due to execution permission > violation cause SIGSEGV. This is a bit muddy on the problem statement. I get that spurious faults can theoretically cause this, but *do* they in practice on current kernels? > It appears not to be an issue so far, but the next patches avoid TLB > flushes on permission promotion, which can lead to this scenario. nodejs > for instance crashes when TLB flush is avoided on permission promotion. By "it appears not to be an issue", do you mean that this suboptimal behavior can not be triggered, period? Or, it can be triggered but folks seem not to care that it can be triggered? I *think* these can be triggered today. I think it takes two threads that do something like: Thread 1 Thread 2 ======== ======== ptr = malloc(); memcpy(ptr, &code, len); exec_now = 1; while (!exec_now); call(ptr); // fault mprotect(ptr, PROT_EXEC, len); // fault sees VM_EXEC But that has a bug: exec_now is set before the mprotect(). It's not sane code. Can any sane code trigger this? > Add a check to prevent access_error() from returning mistakenly that > spurious page-faults due to instruction fetch are a reason for an access > error. > > It is assumed that error code bits of "instruction fetch" and "write" in > the hardware error code are mutual exclusive, and the change assumes so. > However, to be on the safe side, especially if hypervisors misbehave, > assert this is the case and warn otherwise. > > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Andrew Cooper <andrew.cooper3@citrix.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Will Deacon <will@kernel.org> > Cc: Yu Zhao <yuzhao@google.com> > Cc: Nick Piggin <npiggin@gmail.com> > Cc: x86@kernel.org > Signed-off-by: Nadav Amit <namit@vmware.com> > --- > arch/x86/mm/fault.c | 22 ++++++++++++++++++++-- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index d0074c6ed31a..ad0ef0a6087a 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) > (error_code & X86_PF_INSTR), foreign)) > return 1; > > - if (error_code & X86_PF_WRITE) { > + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { > + /* > + * CPUs are not expected to set the two error code bits > + * together, but to ensure that hypervisors do not misbehave, > + * run an additional sanity check. > + */ > + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) == > + (X86_PF_WRITE|X86_PF_INSTR)) { > + WARN_ON_ONCE(1); > + return 1; > + } access_error() is only used on the do_user_addr_fault() side of things. Can we stick this check somewhere that also works for kernel address faults? This is a generic sanity check. It can also be in a separate patch. Also, we should *probably* stop talking about CPUs here. If there's ever something wonky with error code bits, I'd put my money on a weird hypervisor before any kind of CPU issue. > /* write, present and write, not present: */ > - if (unlikely(!(vma->vm_flags & VM_WRITE))) > + if ((error_code & X86_PF_WRITE) && > + unlikely(!(vma->vm_flags & VM_WRITE))) > + return 1; > + > + /* exec, present and exec, not present: */ > + if ((error_code & X86_PF_INSTR) && > + unlikely(!(vma->vm_flags & VM_EXEC))) > return 1; > + > return 0; > } This is getting really ugly. I think we've gone over this before, but it escapes me. Why do we need a common (X86_PF_WRITE | X86_PF_INSTR) block of code? Why can't we just add a simple X86_PF_INSN if() that mirrors the current X86_PF_WRITE one? if (error_code & X86_PF_INSN) { /* present and not exec: */ if (unlikely(!(vma->vm_flags & VM_EXEC))) return 1; return 0; }
> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 3/11/22 11:07, Nadav Amit wrote: >> From: Nadav Amit <namit@vmware.com> >> >> access_error() currently does not check for execution permission >> violation. As a result, spurious page-faults due to execution permission >> violation cause SIGSEGV. > > This is a bit muddy on the problem statement. I get that spurious > faults can theoretically cause this, but *do* they in practice on > current kernels? > >> It appears not to be an issue so far, but the next patches avoid TLB >> flushes on permission promotion, which can lead to this scenario. nodejs >> for instance crashes when TLB flush is avoided on permission promotion. > > By "it appears not to be an issue", do you mean that this suboptimal > behavior can not be triggered, period? Or, it can be triggered but > folks seem not to care that it can be triggered? > > I *think* these can be triggered today. I think it takes two threads > that do something like: > > Thread 1 Thread 2 > ======== ======== > ptr = malloc(); > memcpy(ptr, &code, len); > exec_now = 1; > while (!exec_now); > call(ptr); > // fault > mprotect(ptr, PROT_EXEC, len); > // fault sees VM_EXEC > > > But that has a bug: exec_now is set before the mprotect(). It's not > sane code. > > Can any sane code trigger this? Well, regarding this question and the previous one: I do not think that this scenario is possible today since mprotect() holds the mmap_lock for write. There is no other code that I am aware of that toggles the NX bit on a present entry. But I will not bet my life on it. That’s the reason for the somewhat vague phrasing that I used. >> >> index d0074c6ed31a..ad0ef0a6087a 100644 >> --- a/arch/x86/mm/fault.c >> +++ b/arch/x86/mm/fault.c >> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) >> (error_code & X86_PF_INSTR), foreign)) >> return 1; >> >> - if (error_code & X86_PF_WRITE) { >> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { >> + /* >> + * CPUs are not expected to set the two error code bits >> + * together, but to ensure that hypervisors do not misbehave, >> + * run an additional sanity check. >> + */ >> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) == >> + (X86_PF_WRITE|X86_PF_INSTR)) { >> + WARN_ON_ONCE(1); >> + return 1; >> + } > > access_error() is only used on the do_user_addr_fault() side of things. > Can we stick this check somewhere that also works for kernel address > faults? This is a generic sanity check. It can also be in a separate > patch. I can wrap it in a different function and also call it from do_kern_addr_fault() or spurious_kernel_fault(). Anyhow, spurious_kernel_fault() should handle spurious faults on executable code correctly. > > Also, we should *probably* stop talking about CPUs here. If there's > ever something wonky with error code bits, I'd put my money on a weird > hypervisor before any kind of CPU issue. I thought I manage to convey exactly that in the comment. Can you provide a better phrasing? > >> /* write, present and write, not present: */ >> - if (unlikely(!(vma->vm_flags & VM_WRITE))) >> + if ((error_code & X86_PF_WRITE) && >> + unlikely(!(vma->vm_flags & VM_WRITE))) >> + return 1; >> + >> + /* exec, present and exec, not present: */ >> + if ((error_code & X86_PF_INSTR) && >> + unlikely(!(vma->vm_flags & VM_EXEC))) >> return 1; >> + >> return 0; >> } > > This is getting really ugly. I think we've gone over this before, but > it escapes me. Why do we need a common (X86_PF_WRITE | X86_PF_INSTR) > block of code? Why can't we just add a simple X86_PF_INSN if() that > mirrors the current X86_PF_WRITE one? > > > if (error_code & X86_PF_INSN) { > /* present and not exec: */ > if (unlikely(!(vma->vm_flags & VM_EXEC))) > return 1; > return 0; > } You are correct. My bad. I will fix it.
On 3/11/22 12:38, Nadav Amit wrote: >> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote: ... >> Can any sane code trigger this? > > Well, regarding this question and the previous one: I do not think that > this scenario is possible today since mprotect() holds the mmap_lock > for write. There is no other code that I am aware of that toggles > the NX bit on a present entry. > > But I will not bet my life on it. That’s the reason for the somewhat > vague phrasing that I used. From the userspace perspective, mmap(MAP_FIXED) can do this too. But, sane userspace can't rely on the syscall to have done any work and the TLB flushing is currently done before the syscall returns. I'd put it this way: Today, it is possible for a thread to end up in access_error() for a PF_INSN fault and observe a VM_EXEC VMA. If you are generous, this could be considered a spurious fault. However, the faulting thread would have had to race with the thread which was changing the PTE and the VMA and is currently *in* mprotect() (or some other syscall). In other words, the faulting thread can encounter this situation, but it never had any assurance from the kernel that it wouldn't fault. This is because the faulting thread never had a chance to observe the syscall return. There is no evidence that the existing behavior can cause any issues with sane userspace. >>> index d0074c6ed31a..ad0ef0a6087a 100644 >>> --- a/arch/x86/mm/fault.c >>> +++ b/arch/x86/mm/fault.c >>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) >>> (error_code & X86_PF_INSTR), foreign)) >>> return 1; >>> >>> - if (error_code & X86_PF_WRITE) { >>> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { >>> + /* >>> + * CPUs are not expected to set the two error code bits >>> + * together, but to ensure that hypervisors do not misbehave, >>> + * run an additional sanity check. >>> + */ >>> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) == >>> + (X86_PF_WRITE|X86_PF_INSTR)) { >>> + WARN_ON_ONCE(1); >>> + return 1; >>> + } >> >> access_error() is only used on the do_user_addr_fault() side of things. >> Can we stick this check somewhere that also works for kernel address >> faults? This is a generic sanity check. It can also be in a separate >> patch. > > I can wrap it in a different function and also call it from > do_kern_addr_fault() or spurious_kernel_fault(). > > Anyhow, spurious_kernel_fault() should handle spurious faults on > executable code correctly. This is really about checking the sanity of the "hardware"-provided error code. Let's just do it in handle_page_fault(), maybe hidden in a function like: void check_error_code_sanity(unsigned long error_code) { WARN_ON_ONCE(...); } You can leave the X86_PF_PK check in place for now. It's probably going away soon anyway. >> Also, we should *probably* stop talking about CPUs here. If there's >> ever something wonky with error code bits, I'd put my money on a weird >> hypervisor before any kind of CPU issue. > > I thought I manage to convey exactly that in the comment. Can you provide > a better phrasing? Maybe: /* * X86_PF_INSTR for instruction _fetches_. Fetches never write. * X86_PF_WRITE should never be set with X86_PF_INSTR. * * This is most likely due to a buggy hypervisor. */
> On Mar 11, 2022, at 12:59 PM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 3/11/22 12:38, Nadav Amit wrote: >>> On Mar 11, 2022, at 11:41 AM, Dave Hansen <dave.hansen@intel.com> wrote: > ... >>> Can any sane code trigger this? >> >> Well, regarding this question and the previous one: I do not think that >> this scenario is possible today since mprotect() holds the mmap_lock >> for write. There is no other code that I am aware of that toggles >> the NX bit on a present entry. >> >> But I will not bet my life on it. That’s the reason for the somewhat >> vague phrasing that I used. > > From the userspace perspective, mmap(MAP_FIXED) can do this too. But, > sane userspace can't rely on the syscall to have done any work and the > TLB flushing is currently done before the syscall returns. > > I'd put it this way: > > Today, it is possible for a thread to end up in access_error() > for a PF_INSN fault and observe a VM_EXEC VMA. If you are > generous, this could be considered a spurious fault. > > However, the faulting thread would have had to race with the > thread which was changing the PTE and the VMA and is currently > *in* mprotect() (or some other syscall). In other words, the > faulting thread can encounter this situation, but it never had > any assurance from the kernel that it wouldn't fault. This is > because the faulting thread never had a chance to observe the > syscall return. > > There is no evidence that the existing behavior can cause any > issues with sane userspace. Done. Thanks. > >>>> index d0074c6ed31a..ad0ef0a6087a 100644 >>>> --- a/arch/x86/mm/fault.c >>>> +++ b/arch/x86/mm/fault.c >>>> @@ -1107,10 +1107,28 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) >>>> (error_code & X86_PF_INSTR), foreign)) >>>> return 1; >>>> >>>> - if (error_code & X86_PF_WRITE) { >>>> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { >>>> + /* >>>> + * CPUs are not expected to set the two error code bits >>>> + * together, but to ensure that hypervisors do not misbehave, >>>> + * run an additional sanity check. >>>> + */ >>>> + if ((error_code & (X86_PF_WRITE|X86_PF_INSTR)) == >>>> + (X86_PF_WRITE|X86_PF_INSTR)) { >>>> + WARN_ON_ONCE(1); >>>> + return 1; >>>> + } >>> >>> access_error() is only used on the do_user_addr_fault() side of things. >>> Can we stick this check somewhere that also works for kernel address >>> faults? This is a generic sanity check. It can also be in a separate >>> patch. >> >> I can wrap it in a different function and also call it from >> do_kern_addr_fault() or spurious_kernel_fault(). >> >> Anyhow, spurious_kernel_fault() should handle spurious faults on >> executable code correctly. > > This is really about checking the sanity of the "hardware"-provided > error code. Let's just do it in handle_page_fault(), maybe hidden in a > function like: > > void check_error_code_sanity(unsigned long error_code) > { > WARN_ON_ONCE(...); > } > > You can leave the X86_PF_PK check in place for now. It's probably going > away soon anyway. Done. Thanks. But note that removing the check from access_error() means that if the assertion is broken, userspace might crash inadvertently (in contrast to the version I sent, which would have potentially led to infinite stream of page-faults). I don’t know which behavior is better, so let’s go with your version and just hope it doesn’t happen. > >>> Also, we should *probably* stop talking about CPUs here. If there's >>> ever something wonky with error code bits, I'd put my money on a weird >>> hypervisor before any kind of CPU issue. >> >> I thought I manage to convey exactly that in the comment. Can you provide >> a better phrasing? > > Maybe: > > /* > * X86_PF_INSTR for instruction _fetches_. Fetches never write. > * X86_PF_WRITE should never be set with X86_PF_INSTR. > * > * This is most likely due to a buggy hypervisor. > */ Done, thank you.
On 3/11/22 13:16, Nadav Amit wrote: >> This is really about checking the sanity of the "hardware"-provided >> error code. Let's just do it in handle_page_fault(), maybe hidden in a >> function like: >> >> void check_error_code_sanity(unsigned long error_code) >> { >> WARN_ON_ONCE(...); >> } >> >> You can leave the X86_PF_PK check in place for now. It's probably going >> away soon anyway. > Done. Thanks. But note that removing the check from access_error() means > that if the assertion is broken, userspace might crash inadvertently > (in contrast to the version I sent, which would have potentially led to > infinite stream of page-faults). I don’t know which behavior is better, > so let’s go with your version and just hope it doesn’t happen. Actually, crashing sounds much nicer to me than infinite page faults. It's a lot easier to debug, *especially* with a warning on dmesg.
From: Nadav Amit <namit@vmware.com> This patch-set is intended to remove unnecessary TLB flushes during mprotect() syscalls. Once this patch-set make it through, similar and further optimizations for MADV_COLD and userfaultfd would be possible. Sorry for the time between it took me to get to v3. Basically, there are 3 optimizations in this patch-set: 1. Use TLB batching infrastructure to batch flushes across VMAs and do better/fewer flushes. This would also be handy for later userfaultfd enhancements. 2. Avoid TLB flushes on permission demotion. This optimization is the one that provides most of the performance benefits. Note that the previous batching infrastructure changes are needed for that to happen. 3. Avoiding TLB flushes on change_huge_pmd() that are only needed to prevent the A/D bits from changing. Andrew asked for some benchmark numbers. I do not have an easy determinate macrobenchmark in which it is easy to show benefit. I therre ran a microbenchmark: a loop that does the following on anonymous memory, just as a sanity check to see that time is saved by avoiding TLB flushes. The loop goes: mprotect(p, PAGE_SIZE, PROT_READ) mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE) *p = 0; // make the page writable The test was run in KVM guest with 1 or 2 threads (the second thread was busy-looping). I measured the time (cycles) of each operation: 1 thread 2 threads mmots +patch mmots +patch PROT_READ 3494 2725 (-22%) 8630 7788 (-10%) PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%) [ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ] The exact numbers are really meaningless, but the benefit is clear. There are 2 interesting results though. (1) PROT_READ is cheaper, while one can expect it not to be affected. This is presumably due to TLB miss that is saved (2) Without memory access (*p = 0), the speedup of the patch is even greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush. As a result both operations on the patched kernel take roughly ~1500 cycles (with either 1 or 2 threads), whereas on mmotm their cost is as high as presented in the table. -- v2 -> v3: * Fix orders of patches (order could lead to breakage) * Better comments * Clearer KNL detection [Dave] * Assertion on PF error-code [Dave] * Comments, code, function names improvements [PeterZ] * Flush on access-bit clearing on PMD changes to follow the way flushing on x86 is done today in the kernel. v1 -> v2: * Wrong detection of permission demotion [Andrea] * Better comments [Andrea] * Handle THP [Andrea] * Batching across VMAs [Peter Xu] * Avoid open-coding PTE analysis * Fix wrong use of the mmu_gather() *** BLURB HERE *** Nadav Amit (5): x86: Detection of Knights Landing A/D leak x86/mm: check exec permissions on fault mm/mprotect: use mmu_gather mm/mprotect: do not flush on permission promotion mm: avoid unnecessary flush on change_huge_pmd() arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/pgtable.h | 5 ++ arch/x86/include/asm/pgtable_types.h | 2 + arch/x86/include/asm/tlbflush.h | 82 ++++++++++++++++++++++++ arch/x86/kernel/cpu/intel.c | 5 ++ arch/x86/mm/fault.c | 22 ++++++- arch/x86/mm/pgtable.c | 10 +++ fs/exec.c | 6 +- include/asm-generic/tlb.h | 14 +++++ include/linux/huge_mm.h | 5 +- include/linux/mm.h | 5 +- include/linux/pgtable.h | 20 ++++++ mm/huge_memory.c | 19 ++++-- mm/mprotect.c | 94 +++++++++++++++------------- mm/pgtable-generic.c | 8 +++ mm/userfaultfd.c | 6 +- 16 files changed, 248 insertions(+), 56 deletions(-)