diff mbox series

[v2,12/39] x86/mm: Update ptep_set_wrprotect() and pmdp_set_wrprotect() for transition from _PAGE_DIRTY to _PAGE_COW

Message ID 20220929222936.14584-13-rick.p.edgecombe@intel.com (mailing list archive)
State New
Headers show
Series Shadowstacks for userspace | expand

Commit Message

Rick Edgecombe Sept. 29, 2022, 10:29 p.m. UTC
From: Yu-cheng Yu <yu-cheng.yu@intel.com>

When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
stack. Copy-on-write PTes then have Write=0,Cow=1.

When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
become a transient shadow stack PTE in two cases:

The first case is that some processors can start a write but end up seeing
a Write=0 PTE by the time they get to the Dirty bit, creating a transient
shadow stack PTE. However, this will not occur on processors supporting
Shadow Stack, and a TLB flush is not necessary.

The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
atomically, a transient shadow stack PTE can be created as a result.
Thus, prevent that with cmpxchg.

Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
insights to the issue.  Jann Horn provided the cmpxchg solution.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Compile out some code due to clang build error
 - Clarify commit log (dhansen)
 - Normalize PTE bit descriptions between patches (dhansen)
 - Update comment with text from (dhansen)

Yu-cheng v30:
 - Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).

 arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Comments

Kirill A . Shutemov Oct. 3, 2022, 5:43 p.m. UTC | #1
On Thu, Sep 29, 2022 at 03:29:09PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2f2963429f48..58c7bf9d7392 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>  static inline void ptep_set_wrprotect(struct mm_struct *mm,
>  				      unsigned long addr, pte_t *ptep)
>  {
> +#ifdef CONFIG_X86_SHADOW_STACK
> +	/*
> +	 * Avoid accidentally creating shadow stack PTEs
> +	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
> +	 * the hardware setting Dirty=1.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pte_t old_pte, new_pte;
> +
> +		old_pte = READ_ONCE(*ptep);
> +		do {
> +			new_pte = pte_wrprotect(old_pte);
> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> +		return;
> +	}
> +#endif
>  	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
>  }

Okay, this addresses my previous question. The need in cmpxchg is
unfortunate, but well.
Nadav Amit Oct. 3, 2022, 6:11 p.m. UTC | #2
On Sep 29, 2022, at 3:29 PM, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> When Shadow Stack is in use, Write=0,Dirty=1 PTE are reserved for shadow
> stack. Copy-on-write PTes then have Write=0,Cow=1.
> 
> When a PTE goes from Write=1,Dirty=1 to Write=0,Cow=1, it could
> become a transient shadow stack PTE in two cases:
> 
> The first case is that some processors can start a write but end up seeing
> a Write=0 PTE by the time they get to the Dirty bit, creating a transient
> shadow stack PTE. However, this will not occur on processors supporting
> Shadow Stack, and a TLB flush is not necessary.
> 
> The second case is that when _PAGE_DIRTY is replaced with _PAGE_COW non-
> atomically, a transient shadow stack PTE can be created as a result.
> Thus, prevent that with cmpxchg.
> 
> Dave Hansen, Jann Horn, Andy Lutomirski, and Peter Zijlstra provided many
> insights to the issue.  Jann Horn provided the cmpxchg solution.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> 
> v2:
> - Compile out some code due to clang build error
> - Clarify commit log (dhansen)
> - Normalize PTE bit descriptions between patches (dhansen)
> - Update comment with text from (dhansen)
> 
> Yu-cheng v30:
> - Replace (pmdval_t) cast with CONFIG_PGTABLE_LEVELES > 2 (Borislav Petkov).
> 
> arch/x86/include/asm/pgtable.h | 36 ++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2f2963429f48..58c7bf9d7392 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1287,6 +1287,23 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> static inline void ptep_set_wrprotect(struct mm_struct *mm,
> 				      unsigned long addr, pte_t *ptep)
> {
> +#ifdef CONFIG_X86_SHADOW_STACK
> +	/*
> +	 * Avoid accidentally creating shadow stack PTEs
> +	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
> +	 * the hardware setting Dirty=1.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
> +		pte_t old_pte, new_pte;
> +
> +		old_pte = READ_ONCE(*ptep);
> +		do {
> +			new_pte = pte_wrprotect(old_pte);
> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
> +
> +		return;
> +	}
> +#endif

There is no way of using IS_ENABLED() here instead of these ifdefs?

Did you have a look at ptep_set_access_flags() and friends and checked they
do not need to be changed too? Perhaps you should at least add some
assertion just to ensure nothing breaks.
Dave Hansen Oct. 3, 2022, 6:51 p.m. UTC | #3
On 10/3/22 11:11, Nadav Amit wrote:
>> +#ifdef CONFIG_X86_SHADOW_STACK
>> +	/*
>> +	 * Avoid accidentally creating shadow stack PTEs
>> +	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
>> +	 * the hardware setting Dirty=1.
>> +	 */
>> +	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
>> +		pte_t old_pte, new_pte;
>> +
>> +		old_pte = READ_ONCE(*ptep);
>> +		do {
>> +			new_pte = pte_wrprotect(old_pte);
>> +		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
>> +
>> +		return;
>> +	}
>> +#endif
> There is no way of using IS_ENABLED() here instead of these ifdefs?

Actually, both the existing #ifdef and an IS_ENABLED() check would be
is superfluous as-is.

Adding X86_FEATURE_SHSTK disabled-features.h gives cpu_feature_enabled()
compile-time optimizations for free.  No need for *any* additional
CONFIG_* checks.

The only issue would be if the #ifdef'd code won't even compile with
X86_FEATURE_SHSTK disabled.
Rick Edgecombe Oct. 3, 2022, 10:28 p.m. UTC | #4
On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
> Did you have a look at ptep_set_access_flags() and friends and
> checked they
> do not need to be changed too? 

ptep_set_access_flags() doesn't actually set any additional dirty bits
on x86, so I think it's ok.

> Perhaps you should at least add some
> assertion just to ensure nothing breaks.

You mean in ptep_set_access_flags()? I think some assertions would be
really great, I'm just not sure where.
Nadav Amit Oct. 3, 2022, 11:17 p.m. UTC | #5
On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:

> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>> Did you have a look at ptep_set_access_flags() and friends and
>> checked they
>> do not need to be changed too? 
> 
> ptep_set_access_flags() doesn't actually set any additional dirty bits
> on x86, so I think it's ok.

Are you sure about that? (lost my confidence today so I am hesitant).

Looking on insert_pfn(), I see:

                        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
                        if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...

This appears to set the dirty bit while potentially leaving the write-bit
clear. This is the scenario you want to avoid, no?

>> Perhaps you should at least add some
>> assertion just to ensure nothing breaks.
> 
> You mean in ptep_set_access_flags()? I think some assertions would be
> really great, I'm just not sure where.

Yes, on x86’s version of the function.
Nadav Amit Oct. 3, 2022, 11:20 p.m. UTC | #6
On Oct 3, 2022, at 4:17 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
>> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>>> Did you have a look at ptep_set_access_flags() and friends and
>>> checked they
>>> do not need to be changed too? 
>> 
>> ptep_set_access_flags() doesn't actually set any additional dirty bits
>> on x86, so I think it's ok.
> 
> Are you sure about that? (lost my confidence today so I am hesitant).
> 
> Looking on insert_pfn(), I see:
> 
>                        entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>                        if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...
> 
> This appears to set the dirty bit while potentially leaving the write-bit
> clear. This is the scenario you want to avoid, no?

No. I am not paying attention. Ignore.
Nadav Amit Oct. 3, 2022, 11:25 p.m. UTC | #7
On Oct 3, 2022, at 4:20 PM, Nadav Amit <nadav.amit@gmail.com> wrote:

> On Oct 3, 2022, at 4:17 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
>> On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>> 
>>> On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
>>>> Did you have a look at ptep_set_access_flags() and friends and
>>>> checked they
>>>> do not need to be changed too? 
>>> 
>>> ptep_set_access_flags() doesn't actually set any additional dirty bits
>>> on x86, so I think it's ok.
>> 
>> Are you sure about that? (lost my confidence today so I am hesitant).
>> 
>> Looking on insert_pfn(), I see:
>> 
>>                       entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>>                       if (ptep_set_access_flags(vma, addr, pte, entry, 1)) ...
>> 
>> This appears to set the dirty bit while potentially leaving the write-bit
>> clear. This is the scenario you want to avoid, no?
> 
> No. I am not paying attention. Ignore.

Sorry for the spam. Just this “dirty” argument is confusing. This indeed
seems like a flow that can set the dirty bit. I think.
Rick Edgecombe Oct. 3, 2022, 11:38 p.m. UTC | #8
On Mon, 2022-10-03 at 16:25 -0700, Nadav Amit wrote:
> On Oct 3, 2022, at 4:20 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> > On Oct 3, 2022, at 4:17 PM, Nadav Amit <nadav.amit@gmail.com>
> > wrote:
> > 
> > > On Oct 3, 2022, at 3:28 PM, Edgecombe, Rick P <
> > > rick.p.edgecombe@intel.com> wrote:
> > > 
> > > > On Mon, 2022-10-03 at 11:11 -0700, Nadav Amit wrote:
> > > > > Did you have a look at ptep_set_access_flags() and friends
> > > > > and
> > > > > checked they
> > > > > do not need to be changed too? 
> > > > 
> > > > ptep_set_access_flags() doesn't actually set any additional
> > > > dirty bits
> > > > on x86, so I think it's ok.
> > > 
> > > Are you sure about that? (lost my confidence today so I am
> > > hesitant).
> > > 
> > > Looking on insert_pfn(), I see:
> > > 
> > >                        entry = maybe_mkwrite(pte_mkdirty(entry),
> > > vma);
> > >                        if (ptep_set_access_flags(vma, addr, pte,
> > > entry, 1)) ...
> > > 
> > > This appears to set the dirty bit while potentially leaving the
> > > write-bit
> > > clear. This is the scenario you want to avoid, no?
> > 
> > No. I am not paying attention. Ignore.
> 
> Sorry for the spam. Just this “dirty” argument is confusing. This
> indeed
> seems like a flow that can set the dirty bit. I think.

I think the HW dirty bit will not be set here. How it works is,
pte_mkdirty() will not actually set the HW dirty bit, but instead the
software COW bit. Here is the relevant snippet:

static inline pte_t pte_mkdirty(pte_t pte)
{
	pteval_t dirty = _PAGE_DIRTY;

	/* Avoid creating Dirty=1,Write=0 PTEs */
	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
		dirty = _PAGE_COW;

	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
}

So for a !VM_WRITE vma, you end up with Write=0,Cow=1 PTE passed
into ptep_set_access_flags(). Does it make sense?
Nadav Amit Oct. 4, 2022, 12:40 a.m. UTC | #9
On Oct 3, 2022, at 4:38 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:

> I think the HW dirty bit will not be set here. How it works is,
> pte_mkdirty() will not actually set the HW dirty bit, but instead the
> software COW bit. Here is the relevant snippet:
> 
> static inline pte_t pte_mkdirty(pte_t pte)
> {
> 	pteval_t dirty = _PAGE_DIRTY;
> 
> 	/* Avoid creating Dirty=1,Write=0 PTEs */
> 	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
> 		dirty = _PAGE_COW;
> 
> 	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
> }
> 
> So for a !VM_WRITE vma, you end up with Write=0,Cow=1 PTE passed
> into ptep_set_access_flags(). Does it make sense?

Thanks for your patience with me. I should have read the series in order.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2f2963429f48..58c7bf9d7392 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1287,6 +1287,23 @@  static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pte_t *ptep)
 {
+#ifdef CONFIG_X86_SHADOW_STACK
+	/*
+	 * Avoid accidentally creating shadow stack PTEs
+	 * (Write=0,Dirty=1).  Use cmpxchg() to prevent races with
+	 * the hardware setting Dirty=1.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pte_t old_pte, new_pte;
+
+		old_pte = READ_ONCE(*ptep);
+		do {
+			new_pte = pte_wrprotect(old_pte);
+		} while (!try_cmpxchg(&ptep->pte, &old_pte.pte, new_pte.pte));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
 }
 
@@ -1339,6 +1356,25 @@  static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long addr, pmd_t *pmdp)
 {
+#ifdef CONFIG_X86_SHADOW_STACK
+	/*
+	 * If Shadow Stack is enabled, pmd_wrprotect() moves _PAGE_DIRTY
+	 * to _PAGE_COW (see comments at pmd_wrprotect()).
+	 * When a thread reads a RW=1, Dirty=0 PMD and before changing it
+	 * to RW=0, Dirty=0, another thread could have written to the page
+	 * and the PMD is RW=1, Dirty=1 now.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pmd_t old_pmd, new_pmd;
+
+		old_pmd = READ_ONCE(*pmdp);
+		do {
+			new_pmd = pmd_wrprotect(old_pmd);
+		} while (!try_cmpxchg(&pmdp->pmd, &old_pmd.pmd, new_pmd.pmd));
+
+		return;
+	}
+#endif
 	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
 }