diff mbox series

[v2,10/39] x86/mm: Introduce _PAGE_COW

Message ID 20220929222936.14584-11-rick.p.edgecombe@intel.com (mailing list archive)
State New
Headers show
Series Shadowstacks for userspace | expand

Commit Message

Edgecombe, Rick P Sept. 29, 2022, 10:29 p.m. UTC
From: Yu-cheng Yu <yu-cheng.yu@intel.com>

There is essentially no room left in the x86 hardware PTEs on some OSes
(not Linux). That left the hardware architects looking for a way to
represent a new memory type (shadow stack) within the existing bits.
They chose to repurpose a lightly-used state: Write=0,Dirty=1.

The reason it's lightly used is that Dirty=1 is normally set _before_ a
write. A write with a Write=0 PTE would typically only generate a fault,
not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
stacks will no longer exhibit this oddity.

The kernel should avoid inadvertently creating shadow stack memory because
it is security sensitive. So given the above, all it needs to do is avoid
manually crating Write=0,Dirty=1 PTEs in software.

In places where Linux normally creates Write=0,Dirty=1, it can use the
software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1. This
clearly separates shadow stack from other data, and results in the
following:

(a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
    Previously when a typical anonymous writable mapping was made COW via
    fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead
    use the Cow bit.
(b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
    is in a R/O VMA, and get_user_pages() needs a writable copy. The page
    fault handler creates a copy of the page and sets the new copy's PTE
    as Write=0 and Cow=1.
(c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
(d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack
    page is being shared among processes (this happens at fork()), its PTE
    is made Dirty=0, so the next shadow stack access causes a fault, and
    the page is duplicated and Dirty=1 is set again. This is the COW
    equivalent for shadow stack pages, even though it's copy-on-access
    rather than copy-on-write.
(e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without
    shadow stack support set Dirty=1.

Define _PAGE_COW and update pte_*() helpers and apply the same changes to
pmd and pud.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_COW. No space is consumed in 32-bit kernels
because shadow stacks are not enabled there.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---

v2:
 - Update commit log with comments (Dave Hansen)
 - Add comments in code to explain pte modification code better (Dave)
 - Clarify info on the meaning of various Write,Cow,Dirty combinations

 arch/x86/include/asm/pgtable.h       | 210 ++++++++++++++++++++++++---
 arch/x86/include/asm/pgtable_types.h |  42 +++++-
 2 files changed, 231 insertions(+), 21 deletions(-)

Comments

Jann Horn Sept. 30, 2022, 3:16 p.m. UTC | #1
On Fri, Sep 30, 2022 at 12:30 AM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
> The reason it's lightly used is that Dirty=1 is normally set _before_ a
> write. A write with a Write=0 PTE would typically only generate a fault,
> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
> stacks will no longer exhibit this oddity.

Stupid question, since I just recently learned that IOMMUv2 is a
thing: I assume this also holds for IOMMUs that implement IOMMUv2/SVA,
where the IOMMU directly walks the userspace page tables, and not just
for the CPU core?
Kirill A . Shutemov Oct. 3, 2022, 4:26 p.m. UTC | #2
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> +/*
> + * Normally the Dirty bit is used to denote COW memory on x86. But
> + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> + * since the Dirty=1,Write=0 will result in the memory being treated
> + * as shaodw stack by the HW. So when creating COW memory, a software
> + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and
> + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
> + * transition it to the shadow stack compatible version of COW (Cow=1).
> + */
> +
> +static inline pte_t pte_mkcow(pte_t pte)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return pte;
> +
> +	pte = pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_set_flags(pte, _PAGE_COW);
> +}
> +
> +static inline pte_t pte_clear_cow(pte_t pte)
> +{
> +	/*
> +	 * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
> +	 * See the _PAGE_COW definition for more details.
> +	 */
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return pte;
> +
> +	/*
> +	 * PTE is getting copied-on-write, so it will be dirtied
> +	 * if writable, or made shadow stack if shadow stack and
> +	 * being copied on access. Set they dirty bit for both
> +	 * cases.
> +	 */
> +	pte = pte_set_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_COW);
> +}

These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the _PAGE_COW
logic for all machines with 64-bit entries. It will get you much more
coverage and more universal rules.

> +
>  #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>  static inline int pte_uffd_wp(pte_t pte)
>  {
> @@ -319,7 +381,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
>  
>  static inline pte_t pte_mkclean(pte_t pte)
>  {
> -	return pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
>  }
>  
>  static inline pte_t pte_mkold(pte_t pte)
> @@ -329,7 +391,16 @@ static inline pte_t pte_mkold(pte_t pte)
>  
>  static inline pte_t pte_wrprotect(pte_t pte)
>  {
> -	return pte_clear_flags(pte, _PAGE_RW);
> +	pte = pte_clear_flags(pte, _PAGE_RW);
> +
> +	/*
> +	 * Blindly clearing _PAGE_RW might accidentally create
> +	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
> +	 * dirty value to the software bit.
> +	 */
> +	if (pte_dirty(pte))
> +		pte = pte_mkcow(pte);
> +	return pte;
>  }

Hm. What about ptep/pmdp_set_wrprotect()? They clear _PAGE_RW blindly.
Edgecombe, Rick P Oct. 3, 2022, 9:36 p.m. UTC | #3
On Mon, 2022-10-03 at 19:26 +0300, Kirill A . Shutemov wrote:
> On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > +/*
> > + * Normally the Dirty bit is used to denote COW memory on x86. But
> > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> > + * since the Dirty=1,Write=0 will result in the memory being
> > treated
> > + * as shaodw stack by the HW. So when creating COW memory, a
> > software
> > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow()
> > and
> > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1)
> > and
> > + * transition it to the shadow stack compatible version of COW
> > (Cow=1).
> > + */
> > +
> > +static inline pte_t pte_mkcow(pte_t pte)
> > +{
> > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +             return pte;
> > +
> > +     pte = pte_clear_flags(pte, _PAGE_DIRTY);
> > +     return pte_set_flags(pte, _PAGE_COW);
> > +}
> > +
> > +static inline pte_t pte_clear_cow(pte_t pte)
> > +{
> > +     /*
> > +      * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
> > +      * See the _PAGE_COW definition for more details.
> > +      */
> > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > +             return pte;
> > +
> > +     /*
> > +      * PTE is getting copied-on-write, so it will be dirtied
> > +      * if writable, or made shadow stack if shadow stack and
> > +      * being copied on access. Set they dirty bit for both
> > +      * cases.
> > +      */
> > +     pte = pte_set_flags(pte, _PAGE_DIRTY);
> > +     return pte_clear_flags(pte, _PAGE_COW);
> > +}
> 
> These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the
> _PAGE_COW
> logic for all machines with 64-bit entries. It will get you much more
> coverage and more universal rules.

Yes, I didn't like them either at first. The reasoning originally was
that _PAGE_COW is a bit more work and it might show up for some
benchmark.

Looking at this again though, it is just a few more operations on
memory that is already getting touched either way. It must be a very
tiny amount of impact if any. I'm fine removing them. Having just one
set of logic around this would make it easier to reason about.

Dave, any thoughts on this?
Jann Horn Oct. 3, 2022, 9:54 p.m. UTC | #4
On Mon, Oct 3, 2022 at 11:36 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Mon, 2022-10-03 at 19:26 +0300, Kirill A . Shutemov wrote:
> > On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > > +/*
> > > + * Normally the Dirty bit is used to denote COW memory on x86. But
> > > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> > > + * since the Dirty=1,Write=0 will result in the memory being
> > > treated
> > > + * as shaodw stack by the HW. So when creating COW memory, a
> > > software
> > > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow()
> > > and
> > > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1)
> > > and
> > > + * transition it to the shadow stack compatible version of COW
> > > (Cow=1).
> > > + */
> > > +
> > > +static inline pte_t pte_mkcow(pte_t pte)
> > > +{
> > > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > > +             return pte;
> > > +
> > > +     pte = pte_clear_flags(pte, _PAGE_DIRTY);
> > > +     return pte_set_flags(pte, _PAGE_COW);
> > > +}
> > > +
> > > +static inline pte_t pte_clear_cow(pte_t pte)
> > > +{
> > > +     /*
> > > +      * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
> > > +      * See the _PAGE_COW definition for more details.
> > > +      */
> > > +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> > > +             return pte;
> > > +
> > > +     /*
> > > +      * PTE is getting copied-on-write, so it will be dirtied
> > > +      * if writable, or made shadow stack if shadow stack and
> > > +      * being copied on access. Set they dirty bit for both
> > > +      * cases.
> > > +      */
> > > +     pte = pte_set_flags(pte, _PAGE_DIRTY);
> > > +     return pte_clear_flags(pte, _PAGE_COW);
> > > +}
> >
> > These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the
> > _PAGE_COW
> > logic for all machines with 64-bit entries. It will get you much more
> > coverage and more universal rules.
>
> Yes, I didn't like them either at first. The reasoning originally was
> that _PAGE_COW is a bit more work and it might show up for some
> benchmark.
>
> Looking at this again though, it is just a few more operations on
> memory that is already getting touched either way. It must be a very
> tiny amount of impact if any. I'm fine removing them. Having just one
> set of logic around this would make it easier to reason about.
>
> Dave, any thoughts on this?

But the rules wouldn't actually be universal - you'd still have to
look at X86_FEATURE_SHSTK in code that wants to figure out whether a
PTE is shadow stack (on a newer CPU) or readonly dirty (on an older
CPU that can set dirty bits on non-present PTEs), right?
Dave Hansen Oct. 3, 2022, 10:14 p.m. UTC | #5
On 10/3/22 14:36, Edgecombe, Rick P wrote:
>>> +static inline pte_t pte_clear_cow(pte_t pte)
>>> +{
>>> +     /*
>>> +      * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
>>> +      * See the _PAGE_COW definition for more details.
>>> +      */
>>> +     if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
>>> +             return pte;
>>> +
>>> +     /*
>>> +      * PTE is getting copied-on-write, so it will be dirtied
>>> +      * if writable, or made shadow stack if shadow stack and
>>> +      * being copied on access. Set they dirty bit for both
>>> +      * cases.
>>> +      */
>>> +     pte = pte_set_flags(pte, _PAGE_DIRTY);
>>> +     return pte_clear_flags(pte, _PAGE_COW);
>>> +}
>> These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the
>> _PAGE_COW
>> logic for all machines with 64-bit entries. It will get you much more
>> coverage and more universal rules.
> Yes, I didn't like them either at first. The reasoning originally was
> that _PAGE_COW is a bit more work and it might show up for some
> benchmark.
> 
> Looking at this again though, it is just a few more operations on
> memory that is already getting touched either way. It must be a very
> tiny amount of impact if any. I'm fine removing them. Having just one
> set of logic around this would make it easier to reason about.
> 
> Dave, any thoughts on this?

The cpu_feature_enabled(X86_FEATURE_SHSTK) checks enable both
compile-time and runtime optimization.  What makes this even more fun is:

+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_COW      (_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW      (_AT(pteval_t, 0))
+#endif

which I think means that the pte_clear_flags() goes away if
CONFIG_X86_SHADOW_STACK is disabled.  So, what Rick posted here ends up
doing the following with:

	  | X86_FEATURE_SHSTK=1	|  X86_FEATURE_SHSTK=0
==========+=====================+========================
CONFIG=n  |  compiled out	|  compiled out
CONFIG=y  |  set/clear		|  boot-time patched out


If we pull the cpu_feature_enabled() out, I think we end up getting
behavior like this:

	  | X86_FEATURE_SHSTK=1	|  X86_FEATURE_SHSTK=0
==========+=====================+========================
CONFIG=n  |  set _PAGE_DIRTY	|  set _PAGE_DIRTY
CONFIG=y  |  set/clear		|  set/clear

It ends up adding instruction overhead (set _PAGE_DIRTY) to two cases
where it completely compiled out before.  It also adds runtime overhead
(the "tiny amount of impact" you mentioned) to set/clear where it would
have runtime patched out before.

None of this is a deal breaker in terms of runtime overhead.  But, I do
think the benefits of the cpu_feature_enabled() are worth it, even if
it's just an optimization.  You could move it to the end of the series
and we can debate it on its own merits if you want.
Andrew Cooper Oct. 5, 2022, 2:17 a.m. UTC | #6
On 29/09/2022 23:29, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> There is essentially no room left in the x86 hardware PTEs on some OSes
> (not Linux). That left the hardware architects looking for a way to
> represent a new memory type (shadow stack) within the existing bits.
> They chose to repurpose a lightly-used state: Write=0,Dirty=1.

How does "Some OSes have a greater dependence on software available bits
in PTEs than Linux" sound?

> The reason it's lightly used is that Dirty=1 is normally set _before_ a
> write. A write with a Write=0 PTE would typically only generate a fault,
> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
> stacks will no longer exhibit this oddity.

Again, an interesting anecdote but not salient information here.

> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 6496ec84b953..ad201dae7316 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -134,9 +142,17 @@ static inline int pte_young(pte_t pte)
>  	return pte_flags(pte) & _PAGE_ACCESSED;
>  }
>  
> -static inline int pmd_dirty(pmd_t pmd)
> +static inline bool pmd_dirty(pmd_t pmd)
>  {
> -	return pmd_flags(pmd) & _PAGE_DIRTY;
> +	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
> +}
> +
> +static inline bool pmd_shstk(pmd_t pmd)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
> +		return false;
> +
> +	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;

(flags & PSE|RW|D) == PSE|D;

R/O+D can exist higher in the paging structures and does not convey
type=shstk-ness to later stages of the walk.


However, there is a further complication which is bound rear its head
sooner or later, and warrants discussing.

type=shstk isn't actually only R/O+D on the leaf PTE; its also R/W on
the accumulated access rights on non-leaf PTEs.

Specifically, if you clear the RW bit on any higher level in the
pagetable, then everything mapped by that PTE ceases to be of type
shstk, even if the leaf has the R/O+D bit combination.

This is allegedly a feature for the database folks, where they can
create R/O and R/W aliases of the same memory, sharing intermediate
pagetables, where the R/W alias will set D bits per usual and the R/O
alias needs not to transmogrify itself into a shadow stack.

~Andrew
Peter Zijlstra Oct. 5, 2022, 11:33 a.m. UTC | #7
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:

Mucho confusion here:

> (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
> (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
> (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack

are all identical cases;

> (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
> (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without

as are these.
Dave Hansen Oct. 5, 2022, 2:08 p.m. UTC | #8
On 10/4/22 19:17, Andrew Cooper wrote:
> On 29/09/2022 23:29, Rick Edgecombe wrote:
>> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>
>> There is essentially no room left in the x86 hardware PTEs on some OSes
>> (not Linux). That left the hardware architects looking for a way to
>> represent a new memory type (shadow stack) within the existing bits.
>> They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> How does "Some OSes have a greater dependence on software available bits
> in PTEs than Linux" sound?
> 
>> The reason it's lightly used is that Dirty=1 is normally set _before_ a
>> write. A write with a Write=0 PTE would typically only generate a fault,
>> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the
>> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow
>> stacks will no longer exhibit this oddity.
> Again, an interesting anecdote but not salient information here.

As much as I like the sound of my own voice (and anecdotes), I agree
that this is a bit oblique for the patch.  Maybe this anecdote should
get banished elsewhere.

The changelog here could definitely get to the point faster.
Edgecombe, Rick P Oct. 5, 2022, 11:01 p.m. UTC | #9
On Wed, 2022-10-05 at 02:17 +0000, Andrew Cooper wrote:
> (flags & PSE|RW|D) == PSE|D;
> 
> R/O+D can exist higher in the paging structures and does not convey
> type=shstk-ness to later stages of the walk.

Hmm, yes. I guess it would be more correct to check if it's a leaf as
well.

> 
> 
> However, there is a further complication which is bound rear its head
> sooner or later, and warrants discussing.
> 
> type=shstk isn't actually only R/O+D on the leaf PTE; its also R/W on
> the accumulated access rights on non-leaf PTEs.
> 
> Specifically, if you clear the RW bit on any higher level in the
> pagetable, then everything mapped by that PTE ceases to be of type
> shstk, even if the leaf has the R/O+D bit combination.
> 
> This is allegedly a feature for the database folks, where they can
> create R/O and R/W aliases of the same memory, sharing intermediate
> pagetables, where the R/W alias will set D bits per usual and the R/O
> alias needs not to transmogrify itself into a shadow stack.

Thanks, I somehow missed this corner of the architecture. It looks like
this is not an issue for Linux at the moment because non-leaf PTEs
should have Write=1. I guess we need to keep this in mind if we ever
have Write=0 upper level PTEs though. Maybe a comment around
_PAGE_TABLE would be useful.
Edgecombe, Rick P Oct. 5, 2022, 11:06 p.m. UTC | #10
On Wed, 2022-10-05 at 07:08 -0700, Dave Hansen wrote:
> On 10/4/22 19:17, Andrew Cooper wrote:
> > On 29/09/2022 23:29, Rick Edgecombe wrote:
> > > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > > 
> > > There is essentially no room left in the x86 hardware PTEs on
> > > some OSes
> > > (not Linux). That left the hardware architects looking for a way
> > > to
> > > represent a new memory type (shadow stack) within the existing
> > > bits.
> > > They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> > 
> > How does "Some OSes have a greater dependence on software available
> > bits
> > in PTEs than Linux" sound?
> > 
> > > The reason it's lightly used is that Dirty=1 is normally set
> > > _before_ a
> > > write. A write with a Write=0 PTE would typically only generate a
> > > fault,
> > > not set Dirty=1. Hardware can (rarely) both set Write=1 *and*
> > > generate the
> > > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which
> > > supports shadow
> > > stacks will no longer exhibit this oddity.
> > 
> > Again, an interesting anecdote but not salient information here.
> 
> As much as I like the sound of my own voice (and anecdotes), I agree
> that this is a bit oblique for the patch.  Maybe this anecdote should
> get banished elsewhere.
> 
> The changelog here could definitely get to the point faster.

Although this text was inherited, I thought it was useful to disperse
any "huh, I wonder why" thoughts that may be lingering in the readers
head as they try to grok the rest of the text. I'll shorten it as
suggested. Thanks all.
Edgecombe, Rick P Oct. 6, 2022, 4:10 p.m. UTC | #11
On Fri, 2022-09-30 at 17:16 +0200, Jann Horn wrote:
> On Fri, Sep 30, 2022 at 12:30 AM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > The reason it's lightly used is that Dirty=1 is normally set
> > _before_ a
> > write. A write with a Write=0 PTE would typically only generate a
> > fault,
> > not set Dirty=1. Hardware can (rarely) both set Write=1 *and*
> > generate the
> > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports
> > shadow
> > stacks will no longer exhibit this oddity.
> 
> Stupid question, since I just recently learned that IOMMUv2 is a
> thing: I assume this also holds for IOMMUs that implement
> IOMMUv2/SVA,
> where the IOMMU directly walks the userspace page tables, and not
> just
> for the CPU core?

Sorry for the delay, I had to go find out. IOMMU behaves similar to the
CET CPUs in this regard. Thanks for the question.
Peter Zijlstra Oct. 14, 2022, 9:41 a.m. UTC | #12
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> There is essentially no room left in the x86 hardware PTEs on some OSes
> (not Linux). That left the hardware architects looking for a way to
> represent a new memory type (shadow stack) within the existing bits.
> They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> 
> The reason it's lightly used is that Dirty=1 is normally set _before_ a
> write. A write with a Write=0 PTE would typically only generate a fault,
> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the

s/Write/Dirty/

> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow

s/Dirty=0,Write=1/Write=0,Dirty=1/

> stacks will no longer exhibit this oddity.
> 
> The kernel should avoid inadvertently creating shadow stack memory because
> it is security sensitive. So given the above, all it needs to do is avoid
> manually crating Write=0,Dirty=1 PTEs in software.

Whichever way around you choose, please be consistent.

> In places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other
> words, whenever Linux needs to create Write=0,Dirty=1, it instead creates
> Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1. This
> clearly separates shadow stack from other data, and results in the
> following:
> 
> (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
>     Previously when a typical anonymous writable mapping was made COW via
>     fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead
>     use the Cow bit.
> (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
>     is in a R/O VMA, and get_user_pages() needs a writable copy. The page
>     fault handler creates a copy of the page and sets the new copy's PTE
>     as Write=0 and Cow=1.
> (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
> (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack
>     page is being shared among processes (this happens at fork()), its PTE
>     is made Dirty=0, so the next shadow stack access causes a fault, and
>     the page is duplicated and Dirty=1 is set again. This is the COW
>     equivalent for shadow stack pages, even though it's copy-on-access
>     rather than copy-on-write.
> (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without
>     shadow stack support set Dirty=1.

Please restureture this (and the comment) something like:


  (Write=0,Dirty=0,Cow=1):

	- copy_present_pte(): A modified copy-on-write page.
	- ...


  (Write=0,Dirty=1,Cow=0):

	- FEATURE_CET:  Shadow Stack entry
	- !FEATURE_CET: see the above Cow=1 cases
Peter Zijlstra Oct. 14, 2022, 9:42 a.m. UTC | #13
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> @@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>  	return native_make_pte(v & ~clear);
>  }
>  
> +/*
> + * Normally the Dirty bit is used to denote COW memory on x86. But

This is misleading; this isn't an x86 specific thing. The core-mm code
does this.

> + * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
> + * since the Dirty=1,Write=0 will result in the memory being treated
> + * as shaodw stack by the HW. So when creating COW memory, a software
> + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and
> + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
> + * transition it to the shadow stack compatible version of COW (Cow=1).
> + */
Edgecombe, Rick P Oct. 14, 2022, 3:52 p.m. UTC | #14
On Fri, 2022-10-14 at 11:41 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > 
> > There is essentially no room left in the x86 hardware PTEs on some
> > OSes
> > (not Linux). That left the hardware architects looking for a way to
> > represent a new memory type (shadow stack) within the existing
> > bits.
> > They chose to repurpose a lightly-used state: Write=0,Dirty=1.
> > 
> > The reason it's lightly used is that Dirty=1 is normally set
> > _before_ a
> > write. A write with a Write=0 PTE would typically only generate a
> > fault,
> > not set Dirty=1. Hardware can (rarely) both set Write=1 *and*
> > generate the
> 
> s/Write/Dirty/

Oops, yes.

> 
> > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports
> > shadow
> 
> s/Dirty=0,Write=1/Write=0,Dirty=1/

Ok, I'll scrub the series for the order.

> 
> > stacks will no longer exhibit this oddity.
> > 
> > The kernel should avoid inadvertently creating shadow stack memory
> > because
> > it is security sensitive. So given the above, all it needs to do is
> > avoid
> > manually crating Write=0,Dirty=1 PTEs in software.
> 
> Whichever way around you choose, please be consistent.
> 
> > In places where Linux normally creates Write=0,Dirty=1, it can use
> > the
> > software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In
> > other
> > words, whenever Linux needs to create Write=0,Dirty=1, it instead
> > creates
> > Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1.
> > This
> > clearly separates shadow stack from other data, and results in the
> > following:
> > 
> > (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
> >      Previously when a typical anonymous writable mapping was made
> > COW via
> >      fork(), the kernel would mark it Write=0,Dirty=1. Now it will
> > instead
> >      use the Cow bit.
> > (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The
> > user page
> >      is in a R/O VMA, and get_user_pages() needs a writable copy.
> > The page
> >      fault handler creates a copy of the page and sets the new
> > copy's PTE
> >      as Write=0 and Cow=1.
> > (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
> > (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a
> > shadow stack
> >      page is being shared among processes (this happens at fork()),
> > its PTE
> >      is made Dirty=0, so the next shadow stack access causes a
> > fault, and
> >      the page is duplicated and Dirty=1 is set again. This is the
> > COW
> >      equivalent for shadow stack pages, even though it's copy-on-
> > access
> >      rather than copy-on-write.
> > (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor
> > without
> >      shadow stack support set Dirty=1.
> 
> Please restureture this (and the comment) something like:
> 
> 
>   (Write=0,Dirty=0,Cow=1):
> 
>         - copy_present_pte(): A modified copy-on-write page.
>         - ...
> 
> 
>   (Write=0,Dirty=1,Cow=0):
> 
>         - FEATURE_CET:  Shadow Stack entry
>         - !FEATURE_CET: see the above Cow=1 cases

Yes, I incorporated feedback from your earlier comment. Sorry for bad
communication.
Edgecombe, Rick P Oct. 14, 2022, 6:06 p.m. UTC | #15
On Fri, 2022-10-14 at 11:42 +0200, Peter Zijlstra wrote:
> On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote:
> > @@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte,
> > pteval_t clear)
> >        return native_make_pte(v & ~clear);
> >   }
> >   
> > +/*
> > + * Normally the Dirty bit is used to denote COW memory on x86. But
> 
> This is misleading; this isn't an x86 specific thing. The core-mm
> code
> does this.

Well pte_mkdirty() does map to other HW bits on different
architectures. But yea, it's confusing.

Hmm, is this comment a bit stale either way now though? In the past it
was probably more accurate to say core MM code used it to "detect"
cowed memory. But the GUP pte_dirty() check was changed recently:


https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5535be3099717646781ce1540cf725965d680e7b

I don't think any code is looking specifically for COWed memory using
the PTE dirty bit anymore, it just happens to coincide with it. Double
checking my understanding...

Maybe this would be more accurate?

/*
 * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in the
 * case of X86_FEATURE_SHSTK, the software COW bit is used, since the
 * Dirty=1,Write=0 will result in the memory being treated as shaodw
 * stack by the HW. So when creating COW memory, a software bit is used
 * _PAGE_BIT_COW. The following functions pte_mkcow() and
 * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
 * transition it to the shadow stack compatible version of COW (Cow=1).
 */
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6496ec84b953..ad201dae7316 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -124,9 +124,17 @@  extern pmdval_t early_pmd_flags;
  * The following only work if pte_present() is true.
  * Undefined behaviour if not..
  */
-static inline int pte_dirty(pte_t pte)
+static inline bool pte_dirty(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_DIRTY;
+	return pte_flags(pte) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pte_shstk(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pte_young(pte_t pte)
@@ -134,9 +142,17 @@  static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
-static inline int pmd_dirty(pmd_t pmd)
+static inline bool pmd_dirty(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_DIRTY;
+	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
+static inline bool pmd_shstk(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return false;
+
+	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
 }
 
 static inline int pmd_young(pmd_t pmd)
@@ -144,9 +160,9 @@  static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
-static inline int pud_dirty(pud_t pud)
+static inline bool pud_dirty(pud_t pud)
 {
-	return pud_flags(pud) & _PAGE_DIRTY;
+	return pud_flags(pud) & _PAGE_DIRTY_BITS;
 }
 
 static inline int pud_young(pud_t pud)
@@ -156,13 +172,21 @@  static inline int pud_young(pud_t pud)
 
 static inline int pte_write(pte_t pte)
 {
-	return pte_flags(pte) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
 }
 
 #define pmd_write pmd_write
 static inline int pmd_write(pmd_t pmd)
 {
-	return pmd_flags(pmd) & _PAGE_RW;
+	/*
+	 * Shadow stack pages are logically writable, but do not have
+	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
+	 */
+	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
 }
 
 #define pud_write pud_write
@@ -300,6 +324,44 @@  static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+/*
+ * Normally the Dirty bit is used to denote COW memory on x86. But
+ * in the case of X86_FEATURE_SHSTK, the software COW bit is used,
+ * since the Dirty=1,Write=0 will result in the memory being treated
+ * as shaodw stack by the HW. So when creating COW memory, a software
+ * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and
+ * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and
+ * transition it to the shadow stack compatible version of COW (Cow=1).
+ */
+
+static inline pte_t pte_mkcow(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	pte = pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_COW);
+}
+
+static inline pte_t pte_clear_cow(pte_t pte)
+{
+	/*
+	 * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels.
+	 * See the _PAGE_COW definition for more details.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pte;
+
+	/*
+	 * PTE is getting copied-on-write, so it will be dirtied
+	 * if writable, or made shadow stack if shadow stack and
+	 * being copied on access. Set they dirty bit for both
+	 * cases.
+	 */
+	pte = pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -319,7 +381,7 @@  static inline pte_t pte_clear_uffd_wp(pte_t pte)
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
 }
 
 static inline pte_t pte_mkold(pte_t pte)
@@ -329,7 +391,16 @@  static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_wrprotect(pte_t pte)
 {
-	return pte_clear_flags(pte, _PAGE_RW);
+	pte = pte_clear_flags(pte, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pte_dirty(pte))
+		pte = pte_mkcow(pte);
+	return pte;
 }
 
 static inline pte_t pte_mkexec(pte_t pte)
@@ -339,7 +410,19 @@  static inline pte_t pte_mkexec(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pteval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating Dirty=1,Write=0 PTEs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte))
+		dirty = _PAGE_COW;
+
+	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pte_t pte_mkwrite_shstk(pte_t pte)
+{
+	/* pte_clear_cow() also sets Dirty=1 */
+	return pte_clear_cow(pte);
 }
 
 static inline pte_t pte_mkyoung(pte_t pte)
@@ -349,7 +432,12 @@  static inline pte_t pte_mkyoung(pte_t pte)
 
 static inline pte_t pte_mkwrite(pte_t pte)
 {
-	return pte_set_flags(pte, _PAGE_RW);
+	pte = pte_set_flags(pte, _PAGE_RW);
+
+	if (pte_dirty(pte))
+		pte = pte_clear_cow(pte);
+
+	return pte;
 }
 
 static inline pte_t pte_mkhuge(pte_t pte)
@@ -396,6 +484,26 @@  static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_mkcow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pmd_t pmd_clear_cow(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pmd;
+
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_COW);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pmd_uffd_wp(pmd_t pmd)
 {
@@ -420,17 +528,36 @@  static inline pmd_t pmd_mkold(pmd_t pmd)
 
 static inline pmd_t pmd_mkclean(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
 }
 
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
-	return pmd_clear_flags(pmd, _PAGE_RW);
+	pmd = pmd_clear_flags(pmd, _PAGE_RW);
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pmd_dirty(pmd))
+		pmd = pmd_mkcow(pmd);
+	return pmd;
 }
 
 static inline pmd_t pmd_mkdirty(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pmdval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PMDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pmd_write(pmd))
+		dirty = _PAGE_COW;
+
+	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
+}
+
+static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
+{
+	return pmd_clear_cow(pmd);
 }
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
@@ -450,7 +577,11 @@  static inline pmd_t pmd_mkyoung(pmd_t pmd)
 
 static inline pmd_t pmd_mkwrite(pmd_t pmd)
 {
-	return pmd_set_flags(pmd, _PAGE_RW);
+	pmd = pmd_set_flags(pmd, _PAGE_RW);
+
+	if (pmd_dirty(pmd))
+		pmd = pmd_clear_cow(pmd);
+	return pmd;
 }
 
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
@@ -467,6 +598,26 @@  static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+/* See comments above pte_mkcow() */
+static inline pud_t pud_mkcow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_set_flags(pud, _PAGE_COW);
+}
+
+/* See comments above pte_mkcow() */
+static inline pud_t pud_clear_cow(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		return pud;
+
+	pud = pud_set_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_COW);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
@@ -474,17 +625,32 @@  static inline pud_t pud_mkold(pud_t pud)
 
 static inline pud_t pud_mkclean(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
 }
 
 static inline pud_t pud_wrprotect(pud_t pud)
 {
-	return pud_clear_flags(pud, _PAGE_RW);
+	pud = pud_clear_flags(pud, _PAGE_RW);
+
+	/*
+	 * Blindly clearing _PAGE_RW might accidentally create
+	 * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
+	 * dirty value to the software bit.
+	 */
+	if (pud_dirty(pud))
+		pud = pud_mkcow(pud);
+	return pud;
 }
 
 static inline pud_t pud_mkdirty(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+	pudval_t dirty = _PAGE_DIRTY;
+
+	/* Avoid creating (HW)Dirty=1, Write=0 PUDs */
+	if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pud_write(pud))
+		dirty = _PAGE_COW;
+
+	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
 }
 
 static inline pud_t pud_mkdevmap(pud_t pud)
@@ -504,7 +670,11 @@  static inline pud_t pud_mkyoung(pud_t pud)
 
 static inline pud_t pud_mkwrite(pud_t pud)
 {
-	return pud_set_flags(pud, _PAGE_RW);
+	pud = pud_set_flags(pud, _PAGE_RW);
+
+	if (pud_dirty(pud))
+		pud = pud_clear_cow(pud);
+	return pud;
 }
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ff82237e7b6b..85d88c0f9618 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@ 
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@ 
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * Indicates a copy-on-write page.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_BIT_COW		_PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_COW		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -117,6 +127,36 @@ 
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs
+ * from shadow stack PTEs:
+ *  (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page.
+ *	Previously when a typical anonymous writable mapping was made COW via
+ *	fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead
+ *	use the Cow bit.
+ *  (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page
+ *	is in a R/O VMA, and get_user_pages() needs a writable copy. The page
+ *	fault handler creates a copy of the page and sets the new copy's PTE
+ *	as Write=0 and Cow=1.
+ *  (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE.
+ *  (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack
+ *	page is being shared among processes (this happens at fork()), its PTE
+ *	is made Dirty=0, so the next shadow stack access causes a fault, and
+ *	the page is duplicated and Dirty=1 is set again. This is the COW
+ *	equivalent for shadow stack pages, even though it's copy-on-access
+ *	rather than copy-on-write.
+ *  (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without
+ *	shadow stack support set Dirty=1.
+ */
+#ifdef CONFIG_X86_SHADOW_STACK
+#define _PAGE_COW	(_AT(pteval_t, 1) << _PAGE_BIT_COW)
+#else
+#define _PAGE_COW	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*