Message ID | 20211021122112.592634-4-namit@vmware.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/mprotect: avoid unnecessary TLB flushes | expand |
On Thu, Oct 21, 2021 at 05:21:10AM -0700, Nadav Amit wrote: > From: Nadav Amit <namit@vmware.com> > Add a check to prevent access_error() from returning mistakenly that > page-faults due to instruction fetch are not allowed. Intel SDM does not > indicate whether "instruction fetch" and "write" in the hardware error > code are mutual exclusive, so check both before returning whether the > access is allowed. Dave, can we get that clarified? It seems a bit naf and leads to confusing code IMO. Other than that, the change looks ok to me. > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index b2eefdefc108..e776130473ce 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -1100,10 +1100,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) > (error_code & X86_PF_INSTR), foreign)) > return 1; > > - if (error_code & X86_PF_WRITE) { > + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { > /* write, present and write, not present: */ > - if (unlikely(!(vma->vm_flags & VM_WRITE))) > + if ((error_code & X86_PF_WRITE) && > + unlikely(!(vma->vm_flags & VM_WRITE))) > return 1; > + > + /* exec, present and exec, not present: */ > + if ((error_code & X86_PF_INSTR) && > + unlikely(!(vma->vm_flags & VM_EXEC))) > + return 1; > + > return 0; > }
On 25/10/2021 11:59, Peter Zijlstra wrote: > On Thu, Oct 21, 2021 at 05:21:10AM -0700, Nadav Amit wrote: >> From: Nadav Amit <namit@vmware.com> >> Add a check to prevent access_error() from returning mistakenly that >> page-faults due to instruction fetch are not allowed. Intel SDM does not >> indicate whether "instruction fetch" and "write" in the hardware error >> code are mutual exclusive, so check both before returning whether the >> access is allowed. > Dave, can we get that clarified? It seems a bit naf and leads to > confusing code IMO. There is no such thing as an instruction fetch (a read) causing a modification to the mapping. From this point of view, you'd never expect to see them both set. However, be aware that INSTR is only reported for NX || SMEP. Without either, instruction vs data accesses are distinguished internally (can be demonstrated with SMAP) but not visible in the pagefault error code. ~Andrew
On 10/21/21 5:21 AM, Nadav Amit wrote: > access_error() currently does not check for execution permission > violation. Ye > As a result, spurious page-faults due to execution permission > violation cause SIGSEGV. While I could totally believe that something is goofy when VMAs are being changed underneath a page fault, I'm having trouble figuring out why the "if (error_code & X86_PF_WRITE)" code is being modified. > It appears not to be an issue so far, but the next patches avoid TLB > flushes on permission promotion, which can lead to this scenario. nodejs > for instance crashes when TLB flush is avoided on permission promotion. Just to be clear, "promotion" is going from something like: W=0->W=1 or NX=1->NX=0 right? I tend to call that "relaxing" permissions. Currently, X86_PF_WRITE faults are considered an access error unless the VMA to which the write occurred allows writes. Returning "no access error" permits continuing and handling the copy-on-write. It sounds like you want to expand that. You want to add a whole class of new faults that can be ignored: not just that some COW handling might be necessary, but that the PTE itself might be out of date. Just like a "COW fault" may just result in setting the PTE.W=1 and moving on with our day, an instruction fault might now just end up with setting PTE.NX=0 and also moving on with our day. I'm really confused why the "error_code & X86_PF_WRITE" case is getting modified. I would have expected it to be something like just adding: /* read, instruction fetch */ if (error_code & X86_PF_INSN) { /* Avoid enforcing access error if spurious: */ if (unlikely(!(vma->vm_flags & VM_EXEC))) return 1; return 0; } I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common other than both being able to (now) be generated spuriously.
On 10/25/21 3:59 AM, Peter Zijlstra wrote: >> Add a check to prevent access_error() from returning mistakenly that >> page-faults due to instruction fetch are not allowed. Intel SDM does not >> indicate whether "instruction fetch" and "write" in the hardware error >> code are mutual exclusive, so check both before returning whether the >> access is allowed. > Dave, can we get that clarified? It seems a bit naf and leads to > confusing code IMO. We can, but there are quite a few implicit relationships in those bits. PF_INSN and PF_PK can't ever be set together, for instance. It's pretty clear as long as you have fetch==read in your head.
> On Oct 25, 2021, at 7:20 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/21/21 5:21 AM, Nadav Amit wrote: >> access_error() currently does not check for execution permission >> violation. > Ye > >> As a result, spurious page-faults due to execution permission >> violation cause SIGSEGV. > > While I could totally believe that something is goofy when VMAs are > being changed underneath a page fault, I'm having trouble figuring out > why the "if (error_code & X86_PF_WRITE)" code is being modified. In the scenario I mentioned the VMAs are not changed underneath the page-fault. They change *before* the page-fault, but there are residues of the old PTE in the TLB. > >> It appears not to be an issue so far, but the next patches avoid TLB >> flushes on permission promotion, which can lead to this scenario. nodejs >> for instance crashes when TLB flush is avoided on permission promotion. > > Just to be clear, "promotion" is going from something like: > > W=0->W=1 > or > NX=1->NX=0 > > right? I tend to call that "relaxing" permissions. I specifically talk about NX=1>NX=0. I can change the language to “relaxing”. > > Currently, X86_PF_WRITE faults are considered an access error unless the > VMA to which the write occurred allows writes. Returning "no access > error" permits continuing and handling the copy-on-write. > > It sounds like you want to expand that. You want to add a whole class > of new faults that can be ignored: not just that some COW handling might > be necessary, but that the PTE itself might be out of date. Just like > a "COW fault" may just result in setting the PTE.W=1 and moving on with > our day, an instruction fault might now just end up with setting > PTE.NX=0 and also moving on with our day. You raise an interesting idea (which can easily be implemented with uffd), but no - I had none of that in my mind. My only purpose is to deal with actual spurious page-faults that I encountered when I removed the TLB flush the happens after NX=1->NX=0. I am actually surprised that the kernel makes such a strong assumption that every change of NX=1->NX=0 would be followed by a TLB flush, and that during these changes the mm is locked for write. But that is the case. If you do not have this change and a PTE is changed from NX=1->NX=0 and *later* you access the page, you can have a page-fault due to stale PTE, and get a SIGSEGV since access_error() is wrong to assume that this is an invalid access. I did not change and there are no changes to the VMA during the page-fault. The page-fault handler would do pretty much nothing and return to user-space which would retry the instruction. [ page-fault triggers an implicit TLB flush of the offending PTE ] > > I'm really confused why the "error_code & X86_PF_WRITE" case is getting > modified. I would have expected it to be something like just adding: > > /* read, instruction fetch */ > if (error_code & X86_PF_INSN) { > /* Avoid enforcing access error if spurious: */ > if (unlikely(!(vma->vm_flags & VM_EXEC))) > return 1; > return 0; > } > > I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common > other than both being able to (now) be generated spuriously. That was my first version, but I was concerned that perhaps there is some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can be set. That is the reason that Peter asked you whether this is something that might happen. If you confirm they cannot be both set, I would the version you just mentioned.
On 10/25/21 9:19 AM, Nadav Amit wrote: > That was my first version, but I was concerned that perhaps there is > some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can > be set. That is the reason that Peter asked you whether this is > something that might happen. > > If you confirm they cannot be both set, I would the version you just > mentioned. I'm pretty sure they can't be set together on any sane hardware. A bonkers hypervisor or CPU could do it of course, but they'd be crazy. BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set. That would be a nice place to talk about the assumption.
> On Oct 25, 2021, at 10:45 AM, Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/25/21 9:19 AM, Nadav Amit wrote: >> That was my first version, but I was concerned that perhaps there is >> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can >> be set. That is the reason that Peter asked you whether this is >> something that might happen. >> >> If you confirm they cannot be both set, I would the version you just >> mentioned. > > I'm pretty sure they can't be set together on any sane hardware. A > bonkers hypervisor or CPU could do it of course, but they'd be crazy. > > BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set. > That would be a nice place to talk about the assumption. > I can do that. But be aware that if the assumption is broken, it might lead to the application getting stuck in an infinite loop of page-faults instead of receiving SIGSEGV.
On 10/25/21 10:51 AM, Nadav Amit wrote: >> On Oct 25, 2021, at 10:45 AM, Dave Hansen <dave.hansen@intel.com> wrote: >> On 10/25/21 9:19 AM, Nadav Amit wrote: >>> That was my first version, but I was concerned that perhaps there is >>> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can >>> be set. That is the reason that Peter asked you whether this is >>> something that might happen. >>> >>> If you confirm they cannot be both set, I would the version you just >>> mentioned. >> I'm pretty sure they can't be set together on any sane hardware. A >> bonkers hypervisor or CPU could do it of course, but they'd be crazy. >> >> BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set. >> That would be a nice place to talk about the assumption. >> > I can do that. But be aware that if the assumption is broken, it might > lead to the application getting stuck in an infinite loop of > page-faults instead of receiving SIGSEGV. If we have a bonkers hypervisor/CPU, I'm OK with a process that hangs like that, especially if we can ^C it and see its stream of page faults with tracing or whatever. Couldn't we just also do: if ((code & (X86_PF_WRITE|X86_PF_INSN) == (X86_PF_WRITE|X86_PF_INSN)) { WARN_ON_ONCE(1); return 1; } That should give you the WARN_ON_ONCE() and also return an affirmative access_error(), resulting in a SIGSEGV. (I'm not sure I like the indentation as I wrote it here... just do what looks best in the code)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index b2eefdefc108..e776130473ce 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1100,10 +1100,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) (error_code & X86_PF_INSTR), foreign)) return 1; - if (error_code & X86_PF_WRITE) { + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) { /* write, present and write, not present: */ - if (unlikely(!(vma->vm_flags & VM_WRITE))) + if ((error_code & X86_PF_WRITE) && + unlikely(!(vma->vm_flags & VM_WRITE))) return 1; + + /* exec, present and exec, not present: */ + if ((error_code & X86_PF_INSTR) && + unlikely(!(vma->vm_flags & VM_EXEC))) + return 1; + return 0; }