Message ID | 20180607143705.3531-7-yu-cheng.yu@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jun 7, 2018 at 7:40 AM Yu-cheng Yu <yu-cheng.yu@intel.com> wrote: > > The function ptep_set_wrprotect()/huge_ptep_set_wrprotect() is > used by copy_page_range()/copy_hugetlb_page_range() to copy > PTEs. > > On x86, when the shadow stack is enabled, only a shadow stack > PTE has the read-only and _PAGE_DIRTY_HW combination. Upon > making a dirty PTE read-only, we move its _PAGE_DIRTY_HW to > _PAGE_DIRTY_SW. > > When ptep_set_wrprotect() moves _PAGE_DIRTY_HW to _PAGE_DIRTY_SW, > if the PTE is writable and the mm is shared, another task could > race to set _PAGE_DIRTY_HW again. > > Introduce ptep_set_wrprotect_flush(), pmdp_set_wrprotect_flush(), > and huge_ptep_set_wrprotect_flush() to make sure this does not > happen. > This patch adds flushes where they didn't previously exist. > +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > +{ > + bool rw; > + > + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); > + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { > + struct mm_struct *mm = vma->vm_mm; > + pte_t pte; > + > + if (rw && (atomic_read(&mm->mm_users) > 1)) > + pte = ptep_clear_flush(vma, addr, ptep); Why are you clearing the pte? > -#define __HAVE_ARCH_PMDP_SET_WRPROTECT > -static inline void pmdp_set_wrprotect(struct mm_struct *mm, > - unsigned long addr, pmd_t *pmdp) > +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT_FLUSH > +static inline void huge_ptep_set_wrprotect_flush(struct vm_area_struct *vma, > + unsigned long addr, pte_t *ptep) > { > - clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp); > + ptep_set_wrprotect_flush(vma, addr, ptep); Maybe I'm just missing something, but you're changed the semantics of this function significantly.
On 06/07/2018 09:24 AM, Andy Lutomirski wrote: >> +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep) >> +{ >> + bool rw; >> + >> + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); >> + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { >> + struct mm_struct *mm = vma->vm_mm; >> + pte_t pte; >> + >> + if (rw && (atomic_read(&mm->mm_users) > 1)) >> + pte = ptep_clear_flush(vma, addr, ptep); > Why are you clearing the pte? I think I insisted on this being in there. First of all, we need to flush the TLB eventually because we need the shadowstack PTE permissions to be in effect. But, generally, we can't clear a dirty bit in a "live" PTE without flushing. The processor can keep writing until we flush, and even keep setting it whenever _it_ allows a write, which it can do based on stale TLB contents. Practically, I think a walk to set the dirty bit is mostly the same as a TLB miss, but that's certainly not guaranteed forever. That's even ignoring all the fun errata we have.
On Thu, Jun 7, 2018 at 11:23 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > On 06/07/2018 09:24 AM, Andy Lutomirski wrote: > >> +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, > >> + unsigned long addr, pte_t *ptep) > >> +{ > >> + bool rw; > >> + > >> + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); > >> + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { > >> + struct mm_struct *mm = vma->vm_mm; > >> + pte_t pte; > >> + > >> + if (rw && (atomic_read(&mm->mm_users) > 1)) > >> + pte = ptep_clear_flush(vma, addr, ptep); > > Why are you clearing the pte? > > I think I insisted on this being in there. > > First of all, we need to flush the TLB eventually because we need the > shadowstack PTE permissions to be in effect. > > But, generally, we can't clear a dirty bit in a "live" PTE without > flushing. The processor can keep writing until we flush, and even keep > setting it whenever _it_ allows a write, which it can do based on stale > TLB contents. Practically, I think a walk to set the dirty bit is > mostly the same as a TLB miss, but that's certainly not guaranteed forever. > > That's even ignoring all the fun errata we have. Maybe make it a separate patch, then? It seems like this patch is doing two almost entirely separate things: adding some flushes and adding the magic hwdirty -> swdirty transition. As it stands, it's quite difficult to read the patch and figure out what it does.
On 06/07/2018 09:24 AM, Andy Lutomirski wrote: >> +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, >> + unsigned long addr, pte_t *ptep) >> +{ >> + bool rw; >> + >> + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); >> + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { >> + struct mm_struct *mm = vma->vm_mm; >> + pte_t pte; >> + >> + if (rw && (atomic_read(&mm->mm_users) > 1)) >> + pte = ptep_clear_flush(vma, addr, ptep); > Why are you clearing the pte? I found my notes on the subject. :) Here's the sequence that causes the problem. This could happen any time we try to take a PTE from read-write to read-only. P==Present, W=Write, D=Dirty: CPU0 does a write, sees PTE with P=1,W=1,D=0 CPU0 decides to set D=1 CPU1 comes in and sets W=0 CPU0 does locked operation to set D=1 CPU0 sees P=1,W=0,D=0 CPU0 sets back P=1,W=0,D=1 CPU0 loads P=1,W=0,D=1 into the TLB CPU0 attempts to continue the write, but sees W=0 in the TLB and a #PF is generated because of the write fault. The problem with this is that we end up with a shadowstack-PTE (Write=0,Dirty=1) where we didn't want one. This, unfortunately, imposes extra TLB flushing overhead on the R/W->R/O transitions that does not exist before shadowstack enabling. Yu-cheng, could you please add this to the patch description?
On Thu, 2018-06-07 at 13:29 -0700, Dave Hansen wrote: > On 06/07/2018 09:24 AM, Andy Lutomirski wrote: > > >> +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, > >> + unsigned long addr, pte_t *ptep) > >> +{ > >> + bool rw; > >> + > >> + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); > >> + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { > >> + struct mm_struct *mm = vma->vm_mm; > >> + pte_t pte; > >> + > >> + if (rw && (atomic_read(&mm->mm_users) > 1)) > >> + pte = ptep_clear_flush(vma, addr, ptep); > > Why are you clearing the pte? > > I found my notes on the subject. :) > > Here's the sequence that causes the problem. This could happen any time > we try to take a PTE from read-write to read-only. P==Present, W=Write, > D=Dirty: > > CPU0 does a write, sees PTE with P=1,W=1,D=0 > CPU0 decides to set D=1 > CPU1 comes in and sets W=0 > CPU0 does locked operation to set D=1 > CPU0 sees P=1,W=0,D=0 > CPU0 sets back P=1,W=0,D=1 > CPU0 loads P=1,W=0,D=1 into the TLB > CPU0 attempts to continue the write, but sees W=0 in the TLB and a #PF > is generated because of the write fault. > > The problem with this is that we end up with a shadowstack-PTE > (Write=0,Dirty=1) where we didn't want one. This, unfortunately, > imposes extra TLB flushing overhead on the R/W->R/O transitions that > does not exist before shadowstack enabling. > > Yu-cheng, could you please add this to the patch description? I will add that.
On Thu, Jun 7, 2018 at 1:30 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > On 06/07/2018 09:24 AM, Andy Lutomirski wrote: > > >> +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, > >> + unsigned long addr, pte_t *ptep) > >> +{ > >> + bool rw; > >> + > >> + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); > >> + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { > >> + struct mm_struct *mm = vma->vm_mm; > >> + pte_t pte; > >> + > >> + if (rw && (atomic_read(&mm->mm_users) > 1)) > >> + pte = ptep_clear_flush(vma, addr, ptep); > > Why are you clearing the pte? > > I found my notes on the subject. :) > > Here's the sequence that causes the problem. This could happen any time > we try to take a PTE from read-write to read-only. P==Present, W=Write, > D=Dirty: > > CPU0 does a write, sees PTE with P=1,W=1,D=0 > CPU0 decides to set D=1 > CPU1 comes in and sets W=0 > CPU0 does locked operation to set D=1 > CPU0 sees P=1,W=0,D=0 > CPU0 sets back P=1,W=0,D=1 > CPU0 loads P=1,W=0,D=1 into the TLB > CPU0 attempts to continue the write, but sees W=0 in the TLB and a #PF > is generated because of the write fault. > > The problem with this is that we end up with a shadowstack-PTE > (Write=0,Dirty=1) where we didn't want one. This, unfortunately, > imposes extra TLB flushing overhead on the R/W->R/O transitions that > does not exist before shadowstack enabling. > So what exactly do the architects want the OS to do? AFAICS the only valid ways to clear the dirty bit are: --- Choice 1 --- a) Set P=0. b) Flush using an IPI c) Read D (so we know if the page was actually dirty) d) Set P=1,W=0,D=0 and we need to handle spurious faults that happen between steps (a) and (c). This isn't so easy because the straightforward "is the fault spurious" check is going to think it's *not* spurious. --- Choice 2 --- a) Set W=0 b) flush c) Test and clear D and we need to handle the spurious fault between b and c. At least this particular spurious fault is easier to handle since we can check the error code. But surely the right solution is to get the architecture team to see if they can fix the dirty-bit-setting algorithm or, even better, to look and see if the dirty-bit-setting algorithm is *already* better and just document it. If the cpu does a locked set-bit on D in your example, the CPU is just being silly. The CPU should make the whole operation fully atomic: when trying to write to a page that's D=0 in the TLB, it should re-walk the page tables and, atomically, load the PTE and, if it's W=1,D=0, set D=1. I'd honestly be a bit surprised if modern CPUs don't already do this. (Hmm. If the CPUs were that smart, then we wouldn't need a flush at all in some cases. If we lock cmpxchg to change W=1,D=0 to W=0,D=0, then we know that no other CPU can subsequently write the page without re-walking, and we don't need to flush.) Can you ask the architecture folks to clarify the situation? And, if your notes are indeed correct, don't we need code to handle spurious faults? --Andy
On 06/07/2018 05:59 PM, Andy Lutomirski wrote: > Can you ask the architecture folks to clarify the situation? And, if > your notes are indeed correct, don't we need code to handle spurious > faults? I'll double check that I didn't misunderstand the situation and that it has not changed on processors with shadow stacks. But, as far as spurious faults, wouldn't it just be a fault because we've transiently gone to Present=0? We already do that when clearing the Dirty bit, so I'm not sure that's new. We surely already handle that one.
Hi Yu-cheng, Thank you for the patch! Yet something to improve: [auto build test ERROR on asm-generic/master] [also build test ERROR on v4.17 next-20180607] [cannot apply to tip/x86/core mmotm/master] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Yu-cheng-Yu/Control-Flow-Enforcement-Part-2/20180608-111152 base: https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git master config: sparc64-allmodconfig (attached as .config) compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=sparc64 All error/warnings (new ones prefixed by >>): In file included from arch/sparc/include/asm/pgtable_64.h:1064:0, from arch/sparc/include/asm/pgtable.h:5, from include/linux/memremap.h:8, from include/linux/mm.h:27, from mm//swap.c:16: include/asm-generic/pgtable.h: In function 'huge_ptep_set_wrorptect_flush': >> include/asm-generic/pgtable.h:129:2: error: implicit declaration of function 'huge_ptep_set_wrprotect'; did you mean 'huge_ptep_set_wrorptect_flush'? [-Werror=implicit-function-declaration] huge_ptep_set_wrprotect(vma->vm_mm, addr, ptep); ^~~~~~~~~~~~~~~~~~~~~~~ huge_ptep_set_wrorptect_flush In file included from include/linux/hugetlb.h:445:0, from mm//swap.c:35: arch/sparc/include/asm/hugetlb.h: At top level: >> arch/sparc/include/asm/hugetlb.h:59:20: warning: conflicting types for 'huge_ptep_set_wrprotect' static inline void huge_ptep_set_wrprotect(struct mm_struct *mm, ^~~~~~~~~~~~~~~~~~~~~~~ >> arch/sparc/include/asm/hugetlb.h:59:20: error: static declaration of 'huge_ptep_set_wrprotect' follows non-static declaration In file included from arch/sparc/include/asm/pgtable_64.h:1064:0, from arch/sparc/include/asm/pgtable.h:5, from include/linux/memremap.h:8, from include/linux/mm.h:27, from mm//swap.c:16: include/asm-generic/pgtable.h:129:2: note: previous implicit declaration of 'huge_ptep_set_wrprotect' was here huge_ptep_set_wrprotect(vma->vm_mm, addr, ptep); ^~~~~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +129 include/asm-generic/pgtable.h 123 124 #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT_FLUSH 125 static inline void huge_ptep_set_wrorptect_flush(struct vm_area_struct *vma, 126 unsigned long addr, 127 pte_t *ptep) 128 { > 129 huge_ptep_set_wrprotect(vma->vm_mm, addr, ptep); 130 } 131 #endif 132 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Yu-cheng, Thank you for the patch! Yet something to improve: [auto build test ERROR on asm-generic/master] [also build test ERROR on v4.17 next-20180607] [cannot apply to tip/x86/core mmotm/master] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Yu-cheng-Yu/Control-Flow-Enforcement-Part-2/20180608-111152 base: https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git master config: xtensa-allmodconfig (attached as .config) compiler: xtensa-linux-gcc (GCC) 8.1.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=xtensa All errors (new ones prefixed by >>): In file included from arch/xtensa/include/asm/pgtable.h:444, from include/linux/memremap.h:8, from include/linux/mm.h:27, from include/linux/pid_namespace.h:7, from include/linux/ptrace.h:10, from arch/xtensa/kernel/asm-offsets.c:21: include/asm-generic/pgtable.h: In function 'huge_ptep_set_wrorptect_flush': >> include/asm-generic/pgtable.h:129:2: error: implicit declaration of function 'huge_ptep_set_wrprotect'; did you mean 'ptep_set_wrprotect'? [-Werror=implicit-function-declaration] huge_ptep_set_wrprotect(vma->vm_mm, addr, ptep); ^~~~~~~~~~~~~~~~~~~~~~~ ptep_set_wrprotect cc1: some warnings being treated as errors make[2]: *** [arch/xtensa/kernel/asm-offsets.s] Error 1 make[2]: Target '__build' not remade because of errors. make[1]: *** [prepare0] Error 2 make[1]: Target 'prepare' not remade because of errors. make: *** [sub-make] Error 2 vim +129 include/asm-generic/pgtable.h 123 124 #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT_FLUSH 125 static inline void huge_ptep_set_wrorptect_flush(struct vm_area_struct *vma, 126 unsigned long addr, 127 pte_t *ptep) 128 { > 129 huge_ptep_set_wrprotect(vma->vm_mm, addr, ptep); 130 } 131 #endif 132 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 0996f8a6979a..1053b940b35c 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1148,11 +1148,27 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, return pte; } -#define __HAVE_ARCH_PTEP_SET_WRPROTECT -static inline void ptep_set_wrprotect(struct mm_struct *mm, - unsigned long addr, pte_t *ptep) -{ - clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); +#define __HAVE_ARCH_PTEP_SET_WRPROTECT_FLUSH +extern pte_t ptep_clear_flush(struct vm_area_struct *vma, + unsigned long address, + pte_t *ptep); +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + bool rw; + + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte); + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { + struct mm_struct *mm = vma->vm_mm; + pte_t pte; + + if (rw && (atomic_read(&mm->mm_users) > 1)) + pte = ptep_clear_flush(vma, addr, ptep); + else + pte = *ptep; + pte = pte_move_flags(pte, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW); + set_pte_at(mm, addr, ptep, pte); + } } #define flush_tlb_fix_spurious_fault(vma, address) do { } while (0) @@ -1198,11 +1214,33 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm, return native_pudp_get_and_clear(pudp); } -#define __HAVE_ARCH_PMDP_SET_WRPROTECT -static inline void pmdp_set_wrprotect(struct mm_struct *mm, - unsigned long addr, pmd_t *pmdp) +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT_FLUSH +static inline void huge_ptep_set_wrprotect_flush(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) { - clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp); + ptep_set_wrprotect_flush(vma, addr, ptep); +} + +#define __HAVE_ARCH_PMDP_SET_WRPROTECT_FLUSH +extern pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmdp); +static inline void pmdp_set_wrprotect_flush(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmdp) +{ bool rw; + + rw = test_and_clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp); + if (IS_ENABLED(CONFIG_X86_INTEL_SHADOW_STACK_USER)) { + struct mm_struct *mm = vma->vm_mm; + pmd_t pmd; + + if (rw && (atomic_read(&mm->mm_users) > 1)) + pmd = pmdp_huge_clear_flush(vma, addr, pmdp); + else + pmd = *pmdp; + pmd = pmd_move_flags(pmd, _PAGE_DIRTY_HW, _PAGE_DIRTY_SW); + set_pmd_at(mm, addr, pmdp, pmd); + } } #define pud_write pud_write diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 3f6f998509f0..9bcfdfc045bb 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -121,6 +121,15 @@ static inline int pmdp_clear_flush_young(struct vm_area_struct *vma, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif +#ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT_FLUSH +static inline void huge_ptep_set_wrorptect_flush(struct vm_area_struct *vma, + unsigned long addr, + pte_t *ptep) +{ + huge_ptep_set_wrprotect(vma->vm_mm, addr, ptep); +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, @@ -226,6 +235,15 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres } #endif +#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT_FLUSH +static inline void ptep_set_wrprotect_flush(struct vm_area_struct *vma, + unsigned long address, + pte_t *ptep) +{ + ptep_set_wrprotect(vma->vm_mm, address, ptep); +} +#endif + #ifndef pte_savedwrite #define pte_savedwrite pte_write #endif @@ -266,6 +284,14 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif +#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT_FLUSH +static inline void pmdp_set_wrprotect_flush(struct vm_area_struct *vma, + unsigned long address, + pmd_t *pmdp) +{ + pmdp_set_wrprotect(vma->vm_mm, address, pmdp); +} +#endif #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline void pudp_set_wrprotect(struct mm_struct *mm,
The function ptep_set_wrprotect()/huge_ptep_set_wrprotect() is used by copy_page_range()/copy_hugetlb_page_range() to copy PTEs. On x86, when the shadow stack is enabled, only a shadow stack PTE has the read-only and _PAGE_DIRTY_HW combination. Upon making a dirty PTE read-only, we move its _PAGE_DIRTY_HW to _PAGE_DIRTY_SW. When ptep_set_wrprotect() moves _PAGE_DIRTY_HW to _PAGE_DIRTY_SW, if the PTE is writable and the mm is shared, another task could race to set _PAGE_DIRTY_HW again. Introduce ptep_set_wrprotect_flush(), pmdp_set_wrprotect_flush(), and huge_ptep_set_wrprotect_flush() to make sure this does not happen. Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> --- arch/x86/include/asm/pgtable.h | 56 +++++++++++++++++++++++++++++++++++------- include/asm-generic/pgtable.h | 26 ++++++++++++++++++++ 2 files changed, 73 insertions(+), 9 deletions(-)