diff mbox series

[v6,14/41] x86/mm: Introduce _PAGE_SAVED_DIRTY

Message ID 20230218211433.26859-15-rick.p.edgecombe@intel.com (mailing list archive)
State New
Headers show
Series Shadow stacks for userspace | expand

Commit Message

Rick Edgecombe Feb. 18, 2023, 9:14 p.m. UTC
Some OSes have a greater dependence on software available bits in PTEs than
Linux. That left the hardware architects looking for a way to represent a
new memory type (shadow stack) within the existing bits. They chose to
repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
shadow stack memory, Linux should avoid creating memory with this PTE bit
combination unless it intends for it to be shadow stack.

The reason it's lightly used is that Dirty=1 is normally set by HW
_before_ a write. A write with a Write=0 PTE would typically only generate
a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
supports shadow stacks will no longer exhibit this oddity.

So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
in places where Linux normally creates Write=0,Dirty=1, it can use the
software-defined _PAGE_SAVED_DIRTY in place of the hardware _PAGE_DIRTY.
In other words, whenever Linux needs to create Write=0,Dirty=1, it instead
creates Write=0,SavedDirty=1 except for shadow stack, which is
Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
combinations would be set as follows for various types of memory:

(Write=0,SavedDirty=1,Dirty=0):
 - A modified, copy-on-write (COW) page. Previously when a typical
   anonymous writable mapping was made COW via fork(), the kernel would
   mark it Write=0,Dirty=1. Now it will instead use the SavedDirty bit.
   This happens in copy_present_pte().
 - A R/O page that has been COW'ed. The user page is in a R/O VMA,
   and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
   handler creates a copy of the page and sets the new copy's PTE as
   Write=0 and SavedDirty=1.
 - A shared shadow stack PTE. When a shadow stack page is being shared
   among processes (this happens at fork()), its PTE is made Dirty=0, so
   the next shadow stack access causes a fault, and the page is
   duplicated and Dirty=1 is set again. This is the COW equivalent for
   shadow stack pages, even though it's copy-on-access rather than
   copy-on-write.

(Write=0,SavedDirty=0,Dirty=1):
 - A shadow stack PTE.
 - A Cow PTE created when a processor without shadow stack support set
   Dirty=1.

There are six bits left available to software in the 64-bit PTE after
consuming a bit for _PAGE_SAVED_DIRTY. No space is consumed in 32-bit
kernels because shadow stacks are not enabled there.

Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to start
creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are in place.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

---
v6:
 - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
 - Add _PAGE_SAVED_DIRTY to _PAGE_CHG_MASK

v5:
 - Fix log, comments and whitespace (Boris)
 - Remove capitalization on shadow stack (Boris)

v4:
 - Teach pte_flags_need_flush() about _PAGE_COW bit
 - Break apart patch for better bisectability

v3:
 - Add comment around _PAGE_TABLE in response to comment
   from (Andrew Cooper)
 - Check for PSE in pmd_shstk (Andrew Cooper)
 - Get to the point quicker in commit log (Andrew Cooper)
 - Clarify and reorder commit log for why the PTE bit examples have
   multiple entries. Apply same changes for comment. (peterz)
 - Fix comment that implied dirty bit for COW was a specific x86 thing
   (peterz)
 - Fix swapping of Write/Dirty (PeterZ)
---
 arch/x86/include/asm/pgtable.h       | 79 ++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable_types.h | 65 ++++++++++++++++++++---
 arch/x86/include/asm/tlbflush.h      |  3 +-
 3 files changed, 138 insertions(+), 9 deletions(-)

Comments

David Hildenbrand Feb. 20, 2023, 11:32 a.m. UTC | #1
On 18.02.23 22:14, Rick Edgecombe wrote:
> Some OSes have a greater dependence on software available bits in PTEs than
> Linux. That left the hardware architects looking for a way to represent a
> new memory type (shadow stack) within the existing bits. They chose to
> repurpose a lightly-used state: Write=0,Dirty=1. So in order to support
> shadow stack memory, Linux should avoid creating memory with this PTE bit
> combination unless it intends for it to be shadow stack.
> 
> The reason it's lightly used is that Dirty=1 is normally set by HW
> _before_ a write. A write with a Write=0 PTE would typically only generate
> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1 *and*
> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware which
> supports shadow stacks will no longer exhibit this oddity.
> 
> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve this,
> in places where Linux normally creates Write=0,Dirty=1, it can use the
> software-defined _PAGE_SAVED_DIRTY in place of the hardware _PAGE_DIRTY.
> In other words, whenever Linux needs to create Write=0,Dirty=1, it instead
> creates Write=0,SavedDirty=1 except for shadow stack, which is
> Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
> combinations would be set as follows for various types of memory:
I would simplify (see below) and not repeat what the patch contains as 
comments already that detailed.

> 
> (Write=0,SavedDirty=1,Dirty=0):
>   - A modified, copy-on-write (COW) page. Previously when a typical
>     anonymous writable mapping was made COW via fork(), the kernel would
>     mark it Write=0,Dirty=1. Now it will instead use the SavedDirty bit.
>     This happens in copy_present_pte().
>   - A R/O page that has been COW'ed. The user page is in a R/O VMA,
>     and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
>     handler creates a copy of the page and sets the new copy's PTE as
>     Write=0 and SavedDirty=1.
>   - A shared shadow stack PTE. When a shadow stack page is being shared
>     among processes (this happens at fork()), its PTE is made Dirty=0, so
>     the next shadow stack access causes a fault, and the page is
>     duplicated and Dirty=1 is set again. This is the COW equivalent for
>     shadow stack pages, even though it's copy-on-access rather than
>     copy-on-write.
> 
> (Write=0,SavedDirty=0,Dirty=1):
>   - A shadow stack PTE.
>   - A Cow PTE created when a processor without shadow stack support set
>     Dirty=1.
> 
> There are six bits left available to software in the 64-bit PTE after
> consuming a bit for _PAGE_SAVED_DIRTY. No space is consumed in 32-bit
> kernels because shadow stacks are not enabled there.
> 
> Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to start
> creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are in place.
> 
> Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> Tested-by: John Allen <john.allen@amd.com>
> Reviewed-by: Kees Cook <keescook@chromium.org>
> Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> ---
> v6:
>   - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
>   - Add _PAGE_SAVED_DIRTY to _PAGE_CHG_MASK
> 
> v5:
>   - Fix log, comments and whitespace (Boris)
>   - Remove capitalization on shadow stack (Boris)
> 
> v4:
>   - Teach pte_flags_need_flush() about _PAGE_COW bit
>   - Break apart patch for better bisectability
> 
> v3:
>   - Add comment around _PAGE_TABLE in response to comment
>     from (Andrew Cooper)
>   - Check for PSE in pmd_shstk (Andrew Cooper)
>   - Get to the point quicker in commit log (Andrew Cooper)
>   - Clarify and reorder commit log for why the PTE bit examples have
>     multiple entries. Apply same changes for comment. (peterz)
>   - Fix comment that implied dirty bit for COW was a specific x86 thing
>     (peterz)
>   - Fix swapping of Write/Dirty (PeterZ)
> ---
>   arch/x86/include/asm/pgtable.h       | 79 ++++++++++++++++++++++++++++
>   arch/x86/include/asm/pgtable_types.h | 65 ++++++++++++++++++++---
>   arch/x86/include/asm/tlbflush.h      |  3 +-
>   3 files changed, 138 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2b423d697490..110e552eb602 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -301,6 +301,45 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
>   	return native_make_pte(v & ~clear);
>   }
>   
> +/*
> + * COW and other write protection operations can result in Dirty=1,Write=0
> + * PTEs. But in the case of X86_FEATURE_USER_SHSTK, the software SavedDirty bit
> + * is used, since the Dirty=1,Write=0 will result in the memory being treated as
> + * shadow stack by the HW. So when creating dirty, write-protected memory, a
> + * software bit is used _PAGE_BIT_SAVED_DIRTY. The following functions
> + * pte_mksaveddirty() and pte_clear_saveddirty() take a conventional dirty,
> + * write-protected PTE (Write=0,Dirty=1) and transition it to the shadow stack
> + * compatible version. (Write=0,SavedDirty=1).
> + */
> +static inline pte_t pte_mksaveddirty(pte_t pte)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pte;
> +
> +	pte = pte_clear_flags(pte, _PAGE_DIRTY);
> +	return pte_set_flags(pte, _PAGE_SAVED_DIRTY);
> +}
> +
> +static inline pte_t pte_clear_saveddirty(pte_t pte)
> +{
> +	/*
> +	 * _PAGE_SAVED_DIRTY is unnecessary on !X86_FEATURE_USER_SHSTK kernels,
> +	 * since the HW dirty bit can be used without creating shadow stack
> +	 * memory. See the _PAGE_SAVED_DIRTY definition for more details.
> +	 */
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pte;
> +
> +	/*
> +	 * PTE is getting copied-on-write, so it will be dirtied
> +	 * if writable, or made shadow stack if shadow stack and
> +	 * being copied on access. Set the dirty bit for both
> +	 * cases.
> +	 */
> +	pte = pte_set_flags(pte, _PAGE_DIRTY);
> +	return pte_clear_flags(pte, _PAGE_SAVED_DIRTY);
> +}
> +
>   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>   static inline int pte_uffd_wp(pte_t pte)
>   {
> @@ -420,6 +459,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
>   	return native_make_pmd(v & ~clear);
>   }
>   
> +/* See comments above pte_mksaveddirty() */
> +static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pmd;
> +
> +	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
> +	return pmd_set_flags(pmd, _PAGE_SAVED_DIRTY);
> +}
> +
> +/* See comments above pte_mksaveddirty() */
> +static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pmd;
> +
> +	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
> +	return pmd_clear_flags(pmd, _PAGE_SAVED_DIRTY);
> +}
> +
>   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
>   static inline int pmd_uffd_wp(pmd_t pmd)
>   {
> @@ -491,6 +550,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
>   	return native_make_pud(v & ~clear);
>   }
>   
> +/* See comments above pte_mksaveddirty() */
> +static inline pud_t pud_mksaveddirty(pud_t pud)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pud;
> +
> +	pud = pud_clear_flags(pud, _PAGE_DIRTY);
> +	return pud_set_flags(pud, _PAGE_SAVED_DIRTY);
> +}
> +
> +/* See comments above pte_mksaveddirty() */
> +static inline pud_t pud_clear_saveddirty(pud_t pud)
> +{
> +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> +		return pud;
> +
> +	pud = pud_set_flags(pud, _PAGE_DIRTY);
> +	return pud_clear_flags(pud, _PAGE_SAVED_DIRTY);
> +}
> +
>   static inline pud_t pud_mkold(pud_t pud)
>   {
>   	return pud_clear_flags(pud, _PAGE_ACCESSED);
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 0646ad00178b..3b420b6c0584 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -21,7 +21,8 @@
>   #define _PAGE_BIT_SOFTW2	10	/* " */
>   #define _PAGE_BIT_SOFTW3	11	/* " */
>   #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> -#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
> +#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
> +#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
>   #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
>   #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
>   #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
> @@ -34,6 +35,15 @@
>   #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
>   #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
>   
> +/*
> + * Indicates a Saved Dirty bit page.
> + */
> +#ifdef CONFIG_X86_USER_SHADOW_STACK
> +#define _PAGE_BIT_SAVED_DIRTY		_PAGE_BIT_SOFTW5 /* copy-on-write */

Nope, not "copy-on-write" :) It's more like "dirty bit when the hw-dirty 
bit cannot be used". Maybe simply drop the comment.

> +#else
> +#define _PAGE_BIT_SAVED_DIRTY		0
> +#endif
> +
>   /* If _PAGE_BIT_PRESENT is clear, we use these: */
>   /* - if the user mapped it with PROT_NONE; pte_present gives true */
>   #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
> @@ -117,6 +127,40 @@
>   #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
>   #endif
>   
> +/*
> + * The hardware requires shadow stack to be read-only and Dirty.
> + * _PAGE_SAVED_DIRTY is a software-only bit used to separate copy-on-write
> + * PTEs from shadow stack PTEs:

I'd suggest phrasing this differently. COW is just one scenario where 
this can happen. Also, I don't think that the description of 
"separation" is correct.

Something like the following maybe?

"
However, there are valid cases where the kernel might create read-only 
PTEs that are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty 
tracking). In this case, the _PAGE_SAVED_DIRTY bit is used instead of 
the HW-dirty bit, to avoid creating a wrong "shadow stack" PTEs. Such 
PTEs have (Write=0,SavedDirty=1,Dirty=0) set.

Note that on processors without shadow stack support, the 
_PAGE_SAVED_DIRTY remains unused.
"

The I would simply drop below (which is also too COW-specific I think).

> + *
> + * (Write=0,SavedDirty=1,Dirty=0):
> + *  - A modified, copy-on-write (COW) page. Previously when a typical
> + *    anonymous writable mapping was made COW via fork(), the kernel would
> + *    mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
> + *    happens in copy_present_pte().
> + *  - A R/O page that has been COW'ed. The user page is in a R/O VMA,
> + *    and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
> + *    handler creates a copy of the page and sets the new copy's PTE as
> + *    Write=0 and SavedDirty=1.
> + *  - A shared shadow stack PTE. When a shadow stack page is being shared
> + *    among processes (this happens at fork()), its PTE is made Dirty=0, so
> + *    the next shadow stack access causes a fault, and the page is
> + *    duplicated and Dirty=1 is set again. This is the COW equivalent for
> + *    shadow stack pages, even though it's copy-on-access rather than
> + *    copy-on-write.
> + *
> + * (Write=0,SavedDirty=0,Dirty=1):
> + *  - A shadow stack PTE.
> + *  - A Cow PTE created when a processor without shadow stack support set
> + *    Dirty=1.
> + */
Rick Edgecombe Feb. 20, 2023, 9:38 p.m. UTC | #2
On Mon, 2023-02-20 at 12:32 +0100, David Hildenbrand wrote:
> On 18.02.23 22:14, Rick Edgecombe wrote:
> > Some OSes have a greater dependence on software available bits in
> > PTEs than
> > Linux. That left the hardware architects looking for a way to
> > represent a
> > new memory type (shadow stack) within the existing bits. They chose
> > to
> > repurpose a lightly-used state: Write=0,Dirty=1. So in order to
> > support
> > shadow stack memory, Linux should avoid creating memory with this
> > PTE bit
> > combination unless it intends for it to be shadow stack.
> > 
> > The reason it's lightly used is that Dirty=1 is normally set by HW
> > _before_ a write. A write with a Write=0 PTE would typically only
> > generate
> > a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1
> > *and*
> > generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware
> > which
> > supports shadow stacks will no longer exhibit this oddity.
> > 
> > So that leaves Write=0,Dirty=1 PTEs created in software. To achieve
> > this,
> > in places where Linux normally creates Write=0,Dirty=1, it can use
> > the
> > software-defined _PAGE_SAVED_DIRTY in place of the hardware
> > _PAGE_DIRTY.
> > In other words, whenever Linux needs to create Write=0,Dirty=1, it
> > instead
> > creates Write=0,SavedDirty=1 except for shadow stack, which is
> > Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
> > combinations would be set as follows for various types of memory:
> 
> I would simplify (see below) and not repeat what the patch contains
> as 
> comments already that detailed.

This verbiage has had quite a bit of x86 maintainer attention already.
I hear what you are saying, but I'm a bit hesitant to take style
suggestions at this point for fear of the situation where people ask
for changes back and forth across different versions. Unless any x86
maintainers want to chime in again? More responses below.

> 
> > 
> > (Write=0,SavedDirty=1,Dirty=0):
> >   - A modified, copy-on-write (COW) page. Previously when a typical
> >     anonymous writable mapping was made COW via fork(), the kernel
> > would
> >     mark it Write=0,Dirty=1. Now it will instead use the SavedDirty
> > bit.
> >     This happens in copy_present_pte().
> >   - A R/O page that has been COW'ed. The user page is in a R/O VMA,
> >     and get_user_pages(FOLL_FORCE) needs a writable copy. The page
> > fault
> >     handler creates a copy of the page and sets the new copy's PTE
> > as
> >     Write=0 and SavedDirty=1.
> >   - A shared shadow stack PTE. When a shadow stack page is being
> > shared
> >     among processes (this happens at fork()), its PTE is made
> > Dirty=0, so
> >     the next shadow stack access causes a fault, and the page is
> >     duplicated and Dirty=1 is set again. This is the COW equivalent
> > for
> >     shadow stack pages, even though it's copy-on-access rather than
> >     copy-on-write.
> > 
> > (Write=0,SavedDirty=0,Dirty=1):
> >   - A shadow stack PTE.
> >   - A Cow PTE created when a processor without shadow stack support
> > set
> >     Dirty=1.
> > 
> > There are six bits left available to software in the 64-bit PTE
> > after
> > consuming a bit for _PAGE_SAVED_DIRTY. No space is consumed in 32-
> > bit
> > kernels because shadow stacks are not enabled there.
> > 
> > Implement only the infrastructure for _PAGE_SAVED_DIRTY. Changes to
> > start
> > creating _PAGE_SAVED_DIRTY PTEs will follow once other pieces are
> > in place.
> > 
> > Tested-by: Pengfei Xu <pengfei.xu@intel.com>
> > Tested-by: John Allen <john.allen@amd.com>
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> > Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > ---
> > v6:
> >   - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
> >   - Add _PAGE_SAVED_DIRTY to _PAGE_CHG_MASK
> > 
> > v5:
> >   - Fix log, comments and whitespace (Boris)
> >   - Remove capitalization on shadow stack (Boris)
> > 
> > v4:
> >   - Teach pte_flags_need_flush() about _PAGE_COW bit
> >   - Break apart patch for better bisectability
> > 
> > v3:
> >   - Add comment around _PAGE_TABLE in response to comment
> >     from (Andrew Cooper)
> >   - Check for PSE in pmd_shstk (Andrew Cooper)
> >   - Get to the point quicker in commit log (Andrew Cooper)
> >   - Clarify and reorder commit log for why the PTE bit examples
> > have
> >     multiple entries. Apply same changes for comment. (peterz)
> >   - Fix comment that implied dirty bit for COW was a specific x86
> > thing
> >     (peterz)
> >   - Fix swapping of Write/Dirty (PeterZ)
> > ---
> >   arch/x86/include/asm/pgtable.h       | 79
> > ++++++++++++++++++++++++++++
> >   arch/x86/include/asm/pgtable_types.h | 65 ++++++++++++++++++++---
> >   arch/x86/include/asm/tlbflush.h      |  3 +-
> >   3 files changed, 138 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/pgtable.h
> > b/arch/x86/include/asm/pgtable.h
> > index 2b423d697490..110e552eb602 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -301,6 +301,45 @@ static inline pte_t pte_clear_flags(pte_t pte,
> > pteval_t clear)
> >   	return native_make_pte(v & ~clear);
> >   }
> >   
> > +/*
> > + * COW and other write protection operations can result in
> > Dirty=1,Write=0
> > + * PTEs. But in the case of X86_FEATURE_USER_SHSTK, the software
> > SavedDirty bit
> > + * is used, since the Dirty=1,Write=0 will result in the memory
> > being treated as
> > + * shadow stack by the HW. So when creating dirty, write-protected 
> > memory, a
> > + * software bit is used _PAGE_BIT_SAVED_DIRTY. The following
> > functions
> > + * pte_mksaveddirty() and pte_clear_saveddirty() take a
> > conventional dirty,
> > + * write-protected PTE (Write=0,Dirty=1) and transition it to the
> > shadow stack
> > + * compatible version. (Write=0,SavedDirty=1).
> > + */
> > +static inline pte_t pte_mksaveddirty(pte_t pte)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +		return pte;
> > +
> > +	pte = pte_clear_flags(pte, _PAGE_DIRTY);
> > +	return pte_set_flags(pte, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > +static inline pte_t pte_clear_saveddirty(pte_t pte)
> > +{
> > +	/*
> > +	 * _PAGE_SAVED_DIRTY is unnecessary on !X86_FEATURE_USER_SHSTK
> > kernels,
> > +	 * since the HW dirty bit can be used without creating shadow
> > stack
> > +	 * memory. See the _PAGE_SAVED_DIRTY definition for more
> > details.
> > +	 */
> > +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +		return pte;
> > +
> > +	/*
> > +	 * PTE is getting copied-on-write, so it will be dirtied
> > +	 * if writable, or made shadow stack if shadow stack and
> > +	 * being copied on access. Set the dirty bit for both
> > +	 * cases.
> > +	 */
> > +	pte = pte_set_flags(pte, _PAGE_DIRTY);
> > +	return pte_clear_flags(pte, _PAGE_SAVED_DIRTY);
> > +}
> > +
> >   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> >   static inline int pte_uffd_wp(pte_t pte)
> >   {
> > @@ -420,6 +459,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd,
> > pmdval_t clear)
> >   	return native_make_pmd(v & ~clear);
> >   }
> >   
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +		return pmd;
> > +
> > +	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
> > +	return pmd_set_flags(pmd, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +		return pmd;
> > +
> > +	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
> > +	return pmd_clear_flags(pmd, _PAGE_SAVED_DIRTY);
> > +}
> > +
> >   #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> >   static inline int pmd_uffd_wp(pmd_t pmd)
> >   {
> > @@ -491,6 +550,26 @@ static inline pud_t pud_clear_flags(pud_t pud,
> > pudval_t clear)
> >   	return native_make_pud(v & ~clear);
> >   }
> >   
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pud_t pud_mksaveddirty(pud_t pud)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +		return pud;
> > +
> > +	pud = pud_clear_flags(pud, _PAGE_DIRTY);
> > +	return pud_set_flags(pud, _PAGE_SAVED_DIRTY);
> > +}
> > +
> > +/* See comments above pte_mksaveddirty() */
> > +static inline pud_t pud_clear_saveddirty(pud_t pud)
> > +{
> > +	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
> > +		return pud;
> > +
> > +	pud = pud_set_flags(pud, _PAGE_DIRTY);
> > +	return pud_clear_flags(pud, _PAGE_SAVED_DIRTY);
> > +}
> > +
> >   static inline pud_t pud_mkold(pud_t pud)
> >   {
> >   	return pud_clear_flags(pud, _PAGE_ACCESSED);
> > diff --git a/arch/x86/include/asm/pgtable_types.h
> > b/arch/x86/include/asm/pgtable_types.h
> > index 0646ad00178b..3b420b6c0584 100644
> > --- a/arch/x86/include/asm/pgtable_types.h
> > +++ b/arch/x86/include/asm/pgtable_types.h
> > @@ -21,7 +21,8 @@
> >   #define _PAGE_BIT_SOFTW2	10	/* " */
> >   #define _PAGE_BIT_SOFTW3	11	/* " */
> >   #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
> > -#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
> > +#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
> > +#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
> >   #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4
> > */
> >   #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4
> > */
> >   #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4
> > */
> > @@ -34,6 +35,15 @@
> >   #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software
> > dirty tracking */
> >   #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
> >   
> > +/*
> > + * Indicates a Saved Dirty bit page.
> > + */
> > +#ifdef CONFIG_X86_USER_SHADOW_STACK
> > +#define _PAGE_BIT_SAVED_DIRTY		_PAGE_BIT_SOFTW5 /*
> > copy-on-write */
> 
> Nope, not "copy-on-write" :) It's more like "dirty bit when the hw-
> dirty 
> bit cannot be used". Maybe simply drop the comment.

Oops, I missed this when I scrubbed _PAGE_COW. Thanks. Will fix.

> 
> > +#else
> > +#define _PAGE_BIT_SAVED_DIRTY		0
> > +#endif
> > +
> >   /* If _PAGE_BIT_PRESENT is clear, we use these: */
> >   /* - if the user mapped it with PROT_NONE; pte_present gives true
> > */
> >   #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
> > @@ -117,6 +127,40 @@
> >   #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
> >   #endif
> >   
> > +/*
> > + * The hardware requires shadow stack to be read-only and Dirty.
> > + * _PAGE_SAVED_DIRTY is a software-only bit used to separate copy-
> > on-write
> > + * PTEs from shadow stack PTEs:
> 
> I'd suggest phrasing this differently. COW is just one scenario
> where 
> this can happen. Also, I don't think that the description of 
> "separation" is correct.
> 
> Something like the following maybe?
> 
> "
> However, there are valid cases where the kernel might create read-
> only 
> PTEs that are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty 
> tracking). In this case, the _PAGE_SAVED_DIRTY bit is used instead
> of 
> the HW-dirty bit, to avoid creating a wrong "shadow stack" PTEs.
> Such 
> PTEs have (Write=0,SavedDirty=1,Dirty=0) set.
> 
> Note that on processors without shadow stack support, the 
> _PAGE_SAVED_DIRTY remains unused.
> "
> 
> The I would simply drop below (which is also too COW-specific I
> think).

COW is the main situation where shadow stacks become read-only. So, as
an example it is nice in that COW covers all the scenarios discussed.
Again, do any x86 maintainers want to weigh in here?

> 
> > + *
> > + * (Write=0,SavedDirty=1,Dirty=0):
> > + *  - A modified, copy-on-write (COW) page. Previously when a
> > typical
> > + *    anonymous writable mapping was made COW via fork(), the
> > kernel would
> > + *    mark it Write=0,Dirty=1. Now it will instead use the Cow
> > bit. This
> > + *    happens in copy_present_pte().
> > + *  - A R/O page that has been COW'ed. The user page is in a R/O
> > VMA,
> > + *    and get_user_pages(FOLL_FORCE) needs a writable copy. The
> > page fault
> > + *    handler creates a copy of the page and sets the new copy's
> > PTE as
> > + *    Write=0 and SavedDirty=1.
> > + *  - A shared shadow stack PTE. When a shadow stack page is being
> > shared
> > + *    among processes (this happens at fork()), its PTE is made
> > Dirty=0, so
> > + *    the next shadow stack access causes a fault, and the page is
> > + *    duplicated and Dirty=1 is set again. This is the COW
> > equivalent for
> > + *    shadow stack pages, even though it's copy-on-access rather
> > than
> > + *    copy-on-write.
> > + *
> > + * (Write=0,SavedDirty=0,Dirty=1):
> > + *  - A shadow stack PTE.
> > + *  - A Cow PTE created when a processor without shadow stack
> > support set
> > + *    Dirty=1.
> > + */
> 
>
David Hildenbrand Feb. 21, 2023, 8:38 a.m. UTC | #3
On 20.02.23 22:38, Edgecombe, Rick P wrote:
> On Mon, 2023-02-20 at 12:32 +0100, David Hildenbrand wrote:
>> On 18.02.23 22:14, Rick Edgecombe wrote:
>>> Some OSes have a greater dependence on software available bits in
>>> PTEs than
>>> Linux. That left the hardware architects looking for a way to
>>> represent a
>>> new memory type (shadow stack) within the existing bits. They chose
>>> to
>>> repurpose a lightly-used state: Write=0,Dirty=1. So in order to
>>> support
>>> shadow stack memory, Linux should avoid creating memory with this
>>> PTE bit
>>> combination unless it intends for it to be shadow stack.
>>>
>>> The reason it's lightly used is that Dirty=1 is normally set by HW
>>> _before_ a write. A write with a Write=0 PTE would typically only
>>> generate
>>> a fault, not set Dirty=1. Hardware can (rarely) both set Dirty=1
>>> *and*
>>> generate the fault, resulting in a Write=0,Dirty=1 PTE. Hardware
>>> which
>>> supports shadow stacks will no longer exhibit this oddity.
>>>
>>> So that leaves Write=0,Dirty=1 PTEs created in software. To achieve
>>> this,
>>> in places where Linux normally creates Write=0,Dirty=1, it can use
>>> the
>>> software-defined _PAGE_SAVED_DIRTY in place of the hardware
>>> _PAGE_DIRTY.
>>> In other words, whenever Linux needs to create Write=0,Dirty=1, it
>>> instead
>>> creates Write=0,SavedDirty=1 except for shadow stack, which is
>>> Write=0,Dirty=1. Further differentiated by VMA flags, these PTE bit
>>> combinations would be set as follows for various types of memory:
>>
>> I would simplify (see below) and not repeat what the patch contains
>> as
>> comments already that detailed.
> 
> This verbiage has had quite a bit of x86 maintainer attention already.
> I hear what you are saying, but I'm a bit hesitant to take style
> suggestions at this point for fear of the situation where people ask
> for changes back and forth across different versions. Unless any x86
> maintainers want to chime in again? More responses below.

Sure, for my taste this is (1) too repetitive (2) too verbose (3) to 
specialized. But whatever x86 maintainers prefer.

[...]

>> "
>> However, there are valid cases where the kernel might create read-
>> only
>> PTEs that are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty
>> tracking). In this case, the _PAGE_SAVED_DIRTY bit is used instead
>> of
>> the HW-dirty bit, to avoid creating a wrong "shadow stack" PTEs.
>> Such
>> PTEs have (Write=0,SavedDirty=1,Dirty=0) set.
>>
>> Note that on processors without shadow stack support, the
>> _PAGE_SAVED_DIRTY remains unused.
>> "
>>
>> The I would simply drop below (which is also too COW-specific I
>> think).
> 
> COW is the main situation where shadow stacks become read-only. So, as
> an example it is nice in that COW covers all the scenarios discussed.
> Again, do any x86 maintainers want to weigh in here?

Again, I'd not specialize on COW in all patches to much (IMHO, it 
creates more confusion than it actually helps for understanding what's 
happening) and just call it a read-only PTE that is dirty. Simple as 
that. And it's easy to see why that's problematic, because read-only 
PTEs that are dirty would be identified as shadow stack PTEs, which we 
want to work around.

Again, just my 2 cents. I'm not an x86 maintainer ;)
Rick Edgecombe Feb. 21, 2023, 8:08 p.m. UTC | #4
On Tue, 2023-02-21 at 09:38 +0100, David Hildenbrand wrote:
> Again, I'd not specialize on COW in all patches to much (IMHO, it 
> creates more confusion than it actually helps for understanding
> what's 
> happening) and just call it a read-only PTE that is dirty. Simple as 
> that. And it's easy to see why that's problematic, because read-only 
> PTEs that are dirty would be identified as shadow stack PTEs, which
> we 
> want to work around.
> 
> Again, just my 2 cents. I'm not an x86 maintainer ;)

Right, I see the point. Let's see if they have any opinion. There is a
bit of a historical reason for the focus on COW. As you well know the
dirty bit used to be important for that case. But I think it's still
not a terrible example. It covers some typical cases, but yes we don't
want to mislead the reader that it is a Cow only scenario.
Dave Hansen Feb. 21, 2023, 8:13 p.m. UTC | #5
On 2/21/23 00:38, David Hildenbrand wrote:> Sure, for my taste this is
(1) too repetitive (2) too verbose (3) to
> specialized. But whatever x86 maintainers prefer.

At this point, I'm not going to be too nitpicky.  I personally think we
need to get _something_ merged.  We can then nitpick it to death once
its in the tree.

So I prefer whatever will move the set along. ;)
Rick Edgecombe Feb. 22, 2023, 1:02 a.m. UTC | #6
On Tue, 2023-02-21 at 12:13 -0800, Dave Hansen wrote:
> On 2/21/23 00:38, David Hildenbrand wrote:> Sure, for my taste this
> is
> (1) too repetitive (2) too verbose (3) to
> > specialized. But whatever x86 maintainers prefer.
> 
> At this point, I'm not going to be too nitpicky.  I personally think
> we
> need to get _something_ merged.  We can then nitpick it to death once
> its in the tree.
> 
> So I prefer whatever will move the set along. ;)

Ok, David's general suggestion across these x86/mm patches is to make
things less COW specific. Sounds like you don't have a problem with
that. I'll just do that and hope I don't stir up any additional
concerns. Thanks all.
David Hildenbrand Feb. 22, 2023, 9:05 a.m. UTC | #7
On 21.02.23 21:13, Dave Hansen wrote:
> On 2/21/23 00:38, David Hildenbrand wrote:> Sure, for my taste this is
> (1) too repetitive (2) too verbose (3) to
>> specialized. But whatever x86 maintainers prefer.
> 
> At this point, I'm not going to be too nitpicky.  I personally think we
> need to get _something_ merged.  We can then nitpick it to death once
> its in the tree.

Yes, but ... do we have to rush right now?

This series wasn't in -next and we're in the merge window. Is the plan 
to still include it into this merge window?

Also, I think concise patch descriptions and comments are not 
necessarily nitpicking like "please rename that variable".

> 
> So I prefer whatever will move the set along. ;)

If the plan is to merge it in the next merge window (which I suspect, 
but I might be wrong), I suggest including it in -next fairly soonish, 
and in the meantime, polish the remaining bits.

Knowing the plan would be good ;)
Dave Hansen Feb. 22, 2023, 5:23 p.m. UTC | #8
On 2/22/23 01:05, David Hildenbrand wrote:
> This series wasn't in -next and we're in the merge window. Is the plan
> to still include it into this merge window?

No way.  It's 6.4 material at the earliest.

I'm just saying to Rick not to worry _too_ much about earlier feedback
from me if folks have more recent review feedback.
David Hildenbrand Feb. 22, 2023, 5:27 p.m. UTC | #9
On 22.02.23 18:23, Dave Hansen wrote:
> On 2/22/23 01:05, David Hildenbrand wrote:
>> This series wasn't in -next and we're in the merge window. Is the plan
>> to still include it into this merge window?
> 
> No way.  It's 6.4 material at the earliest.
> 
> I'm just saying to Rick not to worry _too_ much about earlier feedback
> from me if folks have more recent review feedback.

Great. So I hope we can get this into -next soon and that we'll only get 
non-earth-shattering feedback so this can land in 6.4.
Kees Cook Feb. 22, 2023, 5:42 p.m. UTC | #10
On February 22, 2023 9:27:35 AM PST, David Hildenbrand <david@redhat.com> wrote:
>On 22.02.23 18:23, Dave Hansen wrote:
>> On 2/22/23 01:05, David Hildenbrand wrote:
>>> This series wasn't in -next and we're in the merge window. Is the plan
>>> to still include it into this merge window?
>> 
>> No way.  It's 6.4 material at the earliest.
>> 
>> I'm just saying to Rick not to worry _too_ much about earlier feedback
>> from me if folks have more recent review feedback.
>
>Great. So I hope we can get this into -next soon and that we'll only get non-earth-shattering feedback so this can land in 6.4.

Yes please. Who's going to take it? :)
Dave Hansen Feb. 22, 2023, 5:54 p.m. UTC | #11
On 2/22/23 09:42, Kees Cook wrote:
> On February 22, 2023 9:27:35 AM PST, David Hildenbrand <david@redhat.com> wrote:
>> On 22.02.23 18:23, Dave Hansen wrote:
>>> On 2/22/23 01:05, David Hildenbrand wrote:
>>>> This series wasn't in -next and we're in the merge window. Is the plan
>>>> to still include it into this merge window?
>>> No way.  It's 6.4 material at the earliest.
>>>
>>> I'm just saying to Rick not to worry _too_ much about earlier feedback
>>> from me if folks have more recent review feedback.
>> Great. So I hope we can get this into -next soon and that we'll only get non-earth-shattering feedback so this can land in 6.4.
> Yes please. Who's going to take it? 
Kees Cook Feb. 22, 2023, 7:39 p.m. UTC | #12
On Wed, Feb 22, 2023 at 09:54:36AM -0800, Dave Hansen wrote:
> On 2/22/23 09:42, Kees Cook wrote:
> > On February 22, 2023 9:27:35 AM PST, David Hildenbrand <david@redhat.com> wrote:
> >> On 22.02.23 18:23, Dave Hansen wrote:
> >>> On 2/22/23 01:05, David Hildenbrand wrote:
> >>>> This series wasn't in -next and we're in the merge window. Is the plan
> >>>> to still include it into this merge window?
> >>> No way.  It's 6.4 material at the earliest.
> >>>
> >>> I'm just saying to Rick not to worry _too_ much about earlier feedback
> >>> from me if folks have more recent review feedback.
> >> Great. So I hope we can get this into -next soon and that we'll only get non-earth-shattering feedback so this can land in 6.4.
> > Yes please. Who's going to take it? 
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 2b423d697490..110e552eb602 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -301,6 +301,45 @@  static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear)
 	return native_make_pte(v & ~clear);
 }
 
+/*
+ * COW and other write protection operations can result in Dirty=1,Write=0
+ * PTEs. But in the case of X86_FEATURE_USER_SHSTK, the software SavedDirty bit
+ * is used, since the Dirty=1,Write=0 will result in the memory being treated as
+ * shadow stack by the HW. So when creating dirty, write-protected memory, a
+ * software bit is used _PAGE_BIT_SAVED_DIRTY. The following functions
+ * pte_mksaveddirty() and pte_clear_saveddirty() take a conventional dirty,
+ * write-protected PTE (Write=0,Dirty=1) and transition it to the shadow stack
+ * compatible version. (Write=0,SavedDirty=1).
+ */
+static inline pte_t pte_mksaveddirty(pte_t pte)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pte;
+
+	pte = pte_clear_flags(pte, _PAGE_DIRTY);
+	return pte_set_flags(pte, _PAGE_SAVED_DIRTY);
+}
+
+static inline pte_t pte_clear_saveddirty(pte_t pte)
+{
+	/*
+	 * _PAGE_SAVED_DIRTY is unnecessary on !X86_FEATURE_USER_SHSTK kernels,
+	 * since the HW dirty bit can be used without creating shadow stack
+	 * memory. See the _PAGE_SAVED_DIRTY definition for more details.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pte;
+
+	/*
+	 * PTE is getting copied-on-write, so it will be dirtied
+	 * if writable, or made shadow stack if shadow stack and
+	 * being copied on access. Set the dirty bit for both
+	 * cases.
+	 */
+	pte = pte_set_flags(pte, _PAGE_DIRTY);
+	return pte_clear_flags(pte, _PAGE_SAVED_DIRTY);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pte_uffd_wp(pte_t pte)
 {
@@ -420,6 +459,26 @@  static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
 	return native_make_pmd(v & ~clear);
 }
 
+/* See comments above pte_mksaveddirty() */
+static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pmd;
+
+	pmd = pmd_clear_flags(pmd, _PAGE_DIRTY);
+	return pmd_set_flags(pmd, _PAGE_SAVED_DIRTY);
+}
+
+/* See comments above pte_mksaveddirty() */
+static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pmd;
+
+	pmd = pmd_set_flags(pmd, _PAGE_DIRTY);
+	return pmd_clear_flags(pmd, _PAGE_SAVED_DIRTY);
+}
+
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
 static inline int pmd_uffd_wp(pmd_t pmd)
 {
@@ -491,6 +550,26 @@  static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear)
 	return native_make_pud(v & ~clear);
 }
 
+/* See comments above pte_mksaveddirty() */
+static inline pud_t pud_mksaveddirty(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pud;
+
+	pud = pud_clear_flags(pud, _PAGE_DIRTY);
+	return pud_set_flags(pud, _PAGE_SAVED_DIRTY);
+}
+
+/* See comments above pte_mksaveddirty() */
+static inline pud_t pud_clear_saveddirty(pud_t pud)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+		return pud;
+
+	pud = pud_set_flags(pud, _PAGE_DIRTY);
+	return pud_clear_flags(pud, _PAGE_SAVED_DIRTY);
+}
+
 static inline pud_t pud_mkold(pud_t pud)
 {
 	return pud_clear_flags(pud, _PAGE_ACCESSED);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0646ad00178b..3b420b6c0584 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -21,7 +21,8 @@ 
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
-#define _PAGE_BIT_SOFTW4	58	/* available for programmer */
+#define _PAGE_BIT_SOFTW4	57	/* available for programmer */
+#define _PAGE_BIT_SOFTW5	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
 #define _PAGE_BIT_PKEY_BIT2	61	/* Protection Keys, bit 3/4 */
@@ -34,6 +35,15 @@ 
 #define _PAGE_BIT_SOFT_DIRTY	_PAGE_BIT_SOFTW3 /* software dirty tracking */
 #define _PAGE_BIT_DEVMAP	_PAGE_BIT_SOFTW4
 
+/*
+ * Indicates a Saved Dirty bit page.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_BIT_SAVED_DIRTY		_PAGE_BIT_SOFTW5 /* copy-on-write */
+#else
+#define _PAGE_BIT_SAVED_DIRTY		0
+#endif
+
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
 /* - if the user mapped it with PROT_NONE; pte_present gives true */
 #define _PAGE_BIT_PROTNONE	_PAGE_BIT_GLOBAL
@@ -117,6 +127,40 @@ 
 #define _PAGE_SOFTW4	(_AT(pteval_t, 0))
 #endif
 
+/*
+ * The hardware requires shadow stack to be read-only and Dirty.
+ * _PAGE_SAVED_DIRTY is a software-only bit used to separate copy-on-write
+ * PTEs from shadow stack PTEs:
+ *
+ * (Write=0,SavedDirty=1,Dirty=0):
+ *  - A modified, copy-on-write (COW) page. Previously when a typical
+ *    anonymous writable mapping was made COW via fork(), the kernel would
+ *    mark it Write=0,Dirty=1. Now it will instead use the Cow bit. This
+ *    happens in copy_present_pte().
+ *  - A R/O page that has been COW'ed. The user page is in a R/O VMA,
+ *    and get_user_pages(FOLL_FORCE) needs a writable copy. The page fault
+ *    handler creates a copy of the page and sets the new copy's PTE as
+ *    Write=0 and SavedDirty=1.
+ *  - A shared shadow stack PTE. When a shadow stack page is being shared
+ *    among processes (this happens at fork()), its PTE is made Dirty=0, so
+ *    the next shadow stack access causes a fault, and the page is
+ *    duplicated and Dirty=1 is set again. This is the COW equivalent for
+ *    shadow stack pages, even though it's copy-on-access rather than
+ *    copy-on-write.
+ *
+ * (Write=0,SavedDirty=0,Dirty=1):
+ *  - A shadow stack PTE.
+ *  - A Cow PTE created when a processor without shadow stack support set
+ *    Dirty=1.
+ */
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+#define _PAGE_SAVED_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_SAVED_DIRTY)
+#else
+#define _PAGE_SAVED_DIRTY	(_AT(pteval_t, 0))
+#endif
+
+#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_SAVED_DIRTY)
+
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
 
 /*
@@ -125,9 +169,9 @@ 
  * instance, and is *not* included in this mask since
  * pte_modify() does modify it.
  */
-#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
-			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
-			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |  \
+#define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		     \
+			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY_BITS | \
+			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |	     \
 			 _PAGE_UFFD_WP)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
@@ -186,12 +230,17 @@  enum page_cache_mode {
 #define PAGE_READONLY	     __pg(__PP|   0|_USR|___A|__NX|   0|   0|   0)
 #define PAGE_READONLY_EXEC   __pg(__PP|   0|_USR|___A|   0|   0|   0|   0)
 
-#define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
-#define _KERNPG_TABLE_NOENC	 (__PP|__RW|   0|___A|   0|___D|   0|   0)
-#define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
+/*
+ * Page tables needs to have Write=1 in order for any lower PTEs to be
+ * writable. This includes shadow stack memory (Write=0, Dirty=1)
+ */
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
+#define _KERNPG_TABLE_NOENC	 (__PP|__RW|   0|___A|   0|___D|   0|   0)
+#define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
+
+#define __PAGE_KERNEL		 (__PP|__RW|   0|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_EXEC	 (__PP|__RW|   0|___A|   0|___D|   0|___G)
 #define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..6c5ef14060a8 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -273,7 +273,8 @@  static inline bool pte_flags_need_flush(unsigned long oldflags,
 	const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
 					_PAGE_ACCESSED;
 	const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
-					_PAGE_SOFTW3 | _PAGE_SOFTW4;
+					_PAGE_SOFTW3 | _PAGE_SOFTW4 |
+					_PAGE_SAVED_DIRTY;
 	const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
 			  _PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
 			  _PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |