Message ID | 20220929222936.14584-11-rick.p.edgecombe@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Shadowstacks for userspace | expand |
On Fri, Sep 30, 2022 at 12:30 AM Rick Edgecombe <rick.p.edgecombe@intel.com> wrote: > The reason it's lightly used is that Dirty=1 is normally set _before_ a > write. A write with a Write=0 PTE would typically only generate a fault, > not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow > stacks will no longer exhibit this oddity. Stupid question, since I just recently learned that IOMMUv2 is a thing: I assume this also holds for IOMMUs that implement IOMMUv2/SVA, where the IOMMU directly walks the userspace page tables, and not just for the CPU core?
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: > +/* > + * Normally the Dirty bit is used to denote COW memory on x86. But > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used, > + * since the Dirty=1,Write=0 will result in the memory being treated > + * as shaodw stack by the HW. So when creating COW memory, a software > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and > + * transition it to the shadow stack compatible version of COW (Cow=1). > + */ > + > +static inline pte_t pte_mkcow(pte_t pte) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pte; > + > + pte = pte_clear_flags(pte, _PAGE_DIRTY); > + return pte_set_flags(pte, _PAGE_COW); > +} > + > +static inline pte_t pte_clear_cow(pte_t pte) > +{ > + /* > + * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels. > + * See the _PAGE_COW definition for more details. > + */ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return pte; > + > + /* > + * PTE is getting copied-on-write, so it will be dirtied > + * if writable, or made shadow stack if shadow stack and > + * being copied on access. Set they dirty bit for both > + * cases. > + */ > + pte = pte_set_flags(pte, _PAGE_DIRTY); > + return pte_clear_flags(pte, _PAGE_COW); > +} These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the _PAGE_COW logic for all machines with 64-bit entries. It will get you much more coverage and more universal rules. > + > #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP > static inline int pte_uffd_wp(pte_t pte) > { > @@ -319,7 +381,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) > > static inline pte_t pte_mkclean(pte_t pte) > { > - return pte_clear_flags(pte, _PAGE_DIRTY); > + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); > } > > static inline pte_t pte_mkold(pte_t pte) > @@ -329,7 +391,16 @@ static inline pte_t pte_mkold(pte_t pte) > > static inline pte_t pte_wrprotect(pte_t pte) > { > - return pte_clear_flags(pte, _PAGE_RW); > + pte = pte_clear_flags(pte, _PAGE_RW); > + > + /* > + * Blindly clearing _PAGE_RW might accidentally create > + * a shadow stack PTE (Write=0,Dirty=1). Move the hardware > + * dirty value to the software bit. > + */ > + if (pte_dirty(pte)) > + pte = pte_mkcow(pte); > + return pte; > } Hm. What about ptep/pmdp_set_wrprotect()? They clear _PAGE_RW blindly.
On Mon, 2022-10-03 at 19:26 +0300, Kirill A . Shutemov wrote: > On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: > > +/* > > + * Normally the Dirty bit is used to denote COW memory on x86. But > > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used, > > + * since the Dirty=1,Write=0 will result in the memory being > > treated > > + * as shaodw stack by the HW. So when creating COW memory, a > > software > > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() > > and > > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) > > and > > + * transition it to the shadow stack compatible version of COW > > (Cow=1). > > + */ > > + > > +static inline pte_t pte_mkcow(pte_t pte) > > +{ > > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > > + return pte; > > + > > + pte = pte_clear_flags(pte, _PAGE_DIRTY); > > + return pte_set_flags(pte, _PAGE_COW); > > +} > > + > > +static inline pte_t pte_clear_cow(pte_t pte) > > +{ > > + /* > > + * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels. > > + * See the _PAGE_COW definition for more details. > > + */ > > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > > + return pte; > > + > > + /* > > + * PTE is getting copied-on-write, so it will be dirtied > > + * if writable, or made shadow stack if shadow stack and > > + * being copied on access. Set they dirty bit for both > > + * cases. > > + */ > > + pte = pte_set_flags(pte, _PAGE_DIRTY); > > + return pte_clear_flags(pte, _PAGE_COW); > > +} > > These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the > _PAGE_COW > logic for all machines with 64-bit entries. It will get you much more > coverage and more universal rules. Yes, I didn't like them either at first. The reasoning originally was that _PAGE_COW is a bit more work and it might show up for some benchmark. Looking at this again though, it is just a few more operations on memory that is already getting touched either way. It must be a very tiny amount of impact if any. I'm fine removing them. Having just one set of logic around this would make it easier to reason about. Dave, any thoughts on this?
On Mon, Oct 3, 2022 at 11:36 PM Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote: > On Mon, 2022-10-03 at 19:26 +0300, Kirill A . Shutemov wrote: > > On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: > > > +/* > > > + * Normally the Dirty bit is used to denote COW memory on x86. But > > > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used, > > > + * since the Dirty=1,Write=0 will result in the memory being > > > treated > > > + * as shaodw stack by the HW. So when creating COW memory, a > > > software > > > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() > > > and > > > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) > > > and > > > + * transition it to the shadow stack compatible version of COW > > > (Cow=1). > > > + */ > > > + > > > +static inline pte_t pte_mkcow(pte_t pte) > > > +{ > > > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > > > + return pte; > > > + > > > + pte = pte_clear_flags(pte, _PAGE_DIRTY); > > > + return pte_set_flags(pte, _PAGE_COW); > > > +} > > > + > > > +static inline pte_t pte_clear_cow(pte_t pte) > > > +{ > > > + /* > > > + * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels. > > > + * See the _PAGE_COW definition for more details. > > > + */ > > > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > > > + return pte; > > > + > > > + /* > > > + * PTE is getting copied-on-write, so it will be dirtied > > > + * if writable, or made shadow stack if shadow stack and > > > + * being copied on access. Set they dirty bit for both > > > + * cases. > > > + */ > > > + pte = pte_set_flags(pte, _PAGE_DIRTY); > > > + return pte_clear_flags(pte, _PAGE_COW); > > > +} > > > > These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the > > _PAGE_COW > > logic for all machines with 64-bit entries. It will get you much more > > coverage and more universal rules. > > Yes, I didn't like them either at first. The reasoning originally was > that _PAGE_COW is a bit more work and it might show up for some > benchmark. > > Looking at this again though, it is just a few more operations on > memory that is already getting touched either way. It must be a very > tiny amount of impact if any. I'm fine removing them. Having just one > set of logic around this would make it easier to reason about. > > Dave, any thoughts on this? But the rules wouldn't actually be universal - you'd still have to look at X86_FEATURE_SHSTK in code that wants to figure out whether a PTE is shadow stack (on a newer CPU) or readonly dirty (on an older CPU that can set dirty bits on non-present PTEs), right?
On 10/3/22 14:36, Edgecombe, Rick P wrote: >>> +static inline pte_t pte_clear_cow(pte_t pte) >>> +{ >>> + /* >>> + * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels. >>> + * See the _PAGE_COW definition for more details. >>> + */ >>> + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) >>> + return pte; >>> + >>> + /* >>> + * PTE is getting copied-on-write, so it will be dirtied >>> + * if writable, or made shadow stack if shadow stack and >>> + * being copied on access. Set they dirty bit for both >>> + * cases. >>> + */ >>> + pte = pte_set_flags(pte, _PAGE_DIRTY); >>> + return pte_clear_flags(pte, _PAGE_COW); >>> +} >> These X86_FEATURE_SHSTK checks make me uneasy. Maybe use the >> _PAGE_COW >> logic for all machines with 64-bit entries. It will get you much more >> coverage and more universal rules. > Yes, I didn't like them either at first. The reasoning originally was > that _PAGE_COW is a bit more work and it might show up for some > benchmark. > > Looking at this again though, it is just a few more operations on > memory that is already getting touched either way. It must be a very > tiny amount of impact if any. I'm fine removing them. Having just one > set of logic around this would make it easier to reason about. > > Dave, any thoughts on this? The cpu_feature_enabled(X86_FEATURE_SHSTK) checks enable both compile-time and runtime optimization. What makes this even more fun is: +#ifdef CONFIG_X86_SHADOW_STACK +#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW) +#else +#define _PAGE_COW (_AT(pteval_t, 0)) +#endif which I think means that the pte_clear_flags() goes away if CONFIG_X86_SHADOW_STACK is disabled. So, what Rick posted here ends up doing the following with: | X86_FEATURE_SHSTK=1 | X86_FEATURE_SHSTK=0 ==========+=====================+======================== CONFIG=n | compiled out | compiled out CONFIG=y | set/clear | boot-time patched out If we pull the cpu_feature_enabled() out, I think we end up getting behavior like this: | X86_FEATURE_SHSTK=1 | X86_FEATURE_SHSTK=0 ==========+=====================+======================== CONFIG=n | set _PAGE_DIRTY | set _PAGE_DIRTY CONFIG=y | set/clear | set/clear It ends up adding instruction overhead (set _PAGE_DIRTY) to two cases where it completely compiled out before. It also adds runtime overhead (the "tiny amount of impact" you mentioned) to set/clear where it would have runtime patched out before. None of this is a deal breaker in terms of runtime overhead. But, I do think the benefits of the cpu_feature_enabled() are worth it, even if it's just an optimization. You could move it to the end of the series and we can debate it on its own merits if you want.
On 29/09/2022 23:29, Rick Edgecombe wrote: > From: Yu-cheng Yu <yu-cheng.yu@intel.com> > > There is essentially no room left in the x86 hardware PTEs on some OSes > (not Linux). That left the hardware architects looking for a way to > represent a new memory type (shadow stack) within the existing bits. > They chose to repurpose a lightly-used state: Write=0,Dirty=1. How does "Some OSes have a greater dependence on software available bits in PTEs than Linux" sound? > The reason it's lightly used is that Dirty=1 is normally set _before_ a > write. A write with a Write=0 PTE would typically only generate a fault, > not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow > stacks will no longer exhibit this oddity. Again, an interesting anecdote but not salient information here. > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index 6496ec84b953..ad201dae7316 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -134,9 +142,17 @@ static inline int pte_young(pte_t pte) > return pte_flags(pte) & _PAGE_ACCESSED; > } > > -static inline int pmd_dirty(pmd_t pmd) > +static inline bool pmd_dirty(pmd_t pmd) > { > - return pmd_flags(pmd) & _PAGE_DIRTY; > + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; > +} > + > +static inline bool pmd_shstk(pmd_t pmd) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) > + return false; > + > + return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; (flags & PSE|RW|D) == PSE|D; R/O+D can exist higher in the paging structures and does not convey type=shstk-ness to later stages of the walk. However, there is a further complication which is bound rear its head sooner or later, and warrants discussing. type=shstk isn't actually only R/O+D on the leaf PTE; its also R/W on the accumulated access rights on non-leaf PTEs. Specifically, if you clear the RW bit on any higher level in the pagetable, then everything mapped by that PTE ceases to be of type shstk, even if the leaf has the R/O+D bit combination. This is allegedly a feature for the database folks, where they can create R/O and R/W aliases of the same memory, sharing intermediate pagetables, where the R/W alias will set D bits per usual and the R/O alias needs not to transmogrify itself into a shadow stack. ~Andrew
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: Mucho confusion here: > (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page. > (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page > (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack are all identical cases; > (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE. > (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without as are these.
On 10/4/22 19:17, Andrew Cooper wrote: > On 29/09/2022 23:29, Rick Edgecombe wrote: >> From: Yu-cheng Yu <yu-cheng.yu@intel.com> >> >> There is essentially no room left in the x86 hardware PTEs on some OSes >> (not Linux). That left the hardware architects looking for a way to >> represent a new memory type (shadow stack) within the existing bits. >> They chose to repurpose a lightly-used state: Write=0,Dirty=1. > How does "Some OSes have a greater dependence on software available bits > in PTEs than Linux" sound? > >> The reason it's lightly used is that Dirty=1 is normally set _before_ a >> write. A write with a Write=0 PTE would typically only generate a fault, >> not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the >> fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow >> stacks will no longer exhibit this oddity. > Again, an interesting anecdote but not salient information here. As much as I like the sound of my own voice (and anecdotes), I agree that this is a bit oblique for the patch. Maybe this anecdote should get banished elsewhere. The changelog here could definitely get to the point faster.
On Wed, 2022-10-05 at 02:17 +0000, Andrew Cooper wrote: > (flags & PSE|RW|D) == PSE|D; > > R/O+D can exist higher in the paging structures and does not convey > type=shstk-ness to later stages of the walk. Hmm, yes. I guess it would be more correct to check if it's a leaf as well. > > > However, there is a further complication which is bound rear its head > sooner or later, and warrants discussing. > > type=shstk isn't actually only R/O+D on the leaf PTE; its also R/W on > the accumulated access rights on non-leaf PTEs. > > Specifically, if you clear the RW bit on any higher level in the > pagetable, then everything mapped by that PTE ceases to be of type > shstk, even if the leaf has the R/O+D bit combination. > > This is allegedly a feature for the database folks, where they can > create R/O and R/W aliases of the same memory, sharing intermediate > pagetables, where the R/W alias will set D bits per usual and the R/O > alias needs not to transmogrify itself into a shadow stack. Thanks, I somehow missed this corner of the architecture. It looks like this is not an issue for Linux at the moment because non-leaf PTEs should have Write=1. I guess we need to keep this in mind if we ever have Write=0 upper level PTEs though. Maybe a comment around _PAGE_TABLE would be useful.
On Wed, 2022-10-05 at 07:08 -0700, Dave Hansen wrote: > On 10/4/22 19:17, Andrew Cooper wrote: > > On 29/09/2022 23:29, Rick Edgecombe wrote: > > > From: Yu-cheng Yu <yu-cheng.yu@intel.com> > > > > > > There is essentially no room left in the x86 hardware PTEs on > > > some OSes > > > (not Linux). That left the hardware architects looking for a way > > > to > > > represent a new memory type (shadow stack) within the existing > > > bits. > > > They chose to repurpose a lightly-used state: Write=0,Dirty=1. > > > > How does "Some OSes have a greater dependence on software available > > bits > > in PTEs than Linux" sound? > > > > > The reason it's lightly used is that Dirty=1 is normally set > > > _before_ a > > > write. A write with a Write=0 PTE would typically only generate a > > > fault, > > > not set Dirty=1. Hardware can (rarely) both set Write=1 *and* > > > generate the > > > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which > > > supports shadow > > > stacks will no longer exhibit this oddity. > > > > Again, an interesting anecdote but not salient information here. > > As much as I like the sound of my own voice (and anecdotes), I agree > that this is a bit oblique for the patch. Maybe this anecdote should > get banished elsewhere. > > The changelog here could definitely get to the point faster. Although this text was inherited, I thought it was useful to disperse any "huh, I wonder why" thoughts that may be lingering in the readers head as they try to grok the rest of the text. I'll shorten it as suggested. Thanks all.
On Fri, 2022-09-30 at 17:16 +0200, Jann Horn wrote: > On Fri, Sep 30, 2022 at 12:30 AM Rick Edgecombe > <rick.p.edgecombe@intel.com> wrote: > > The reason it's lightly used is that Dirty=1 is normally set > > _before_ a > > write. A write with a Write=0 PTE would typically only generate a > > fault, > > not set Dirty=1. Hardware can (rarely) both set Write=1 *and* > > generate the > > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports > > shadow > > stacks will no longer exhibit this oddity. > > Stupid question, since I just recently learned that IOMMUv2 is a > thing: I assume this also holds for IOMMUs that implement > IOMMUv2/SVA, > where the IOMMU directly walks the userspace page tables, and not > just > for the CPU core? Sorry for the delay, I had to go find out. IOMMU behaves similar to the CET CPUs in this regard. Thanks for the question.
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: > From: Yu-cheng Yu <yu-cheng.yu@intel.com> > > There is essentially no room left in the x86 hardware PTEs on some OSes > (not Linux). That left the hardware architects looking for a way to > represent a new memory type (shadow stack) within the existing bits. > They chose to repurpose a lightly-used state: Write=0,Dirty=1. > > The reason it's lightly used is that Dirty=1 is normally set _before_ a > write. A write with a Write=0 PTE would typically only generate a fault, > not set Dirty=1. Hardware can (rarely) both set Write=1 *and* generate the s/Write/Dirty/ > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports shadow s/Dirty=0,Write=1/Write=0,Dirty=1/ > stacks will no longer exhibit this oddity. > > The kernel should avoid inadvertently creating shadow stack memory because > it is security sensitive. So given the above, all it needs to do is avoid > manually crating Write=0,Dirty=1 PTEs in software. Whichever way around you choose, please be consistent. > In places where Linux normally creates Write=0,Dirty=1, it can use the > software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In other > words, whenever Linux needs to create Write=0,Dirty=1, it instead creates > Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1. This > clearly separates shadow stack from other data, and results in the > following: > > (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page. > Previously when a typical anonymous writable mapping was made COW via > fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead > use the Cow bit. > (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page > is in a R/O VMA, and get_user_pages() needs a writable copy. The page > fault handler creates a copy of the page and sets the new copy's PTE > as Write=0 and Cow=1. > (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE. > (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack > page is being shared among processes (this happens at fork()), its PTE > is made Dirty=0, so the next shadow stack access causes a fault, and > the page is duplicated and Dirty=1 is set again. This is the COW > equivalent for shadow stack pages, even though it's copy-on-access > rather than copy-on-write. > (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without > shadow stack support set Dirty=1. Please restureture this (and the comment) something like: (Write=0,Dirty=0,Cow=1): - copy_present_pte(): A modified copy-on-write page. - ... (Write=0,Dirty=1,Cow=0): - FEATURE_CET: Shadow Stack entry - !FEATURE_CET: see the above Cow=1 cases
On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: > @@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear) > return native_make_pte(v & ~clear); > } > > +/* > + * Normally the Dirty bit is used to denote COW memory on x86. But This is misleading; this isn't an x86 specific thing. The core-mm code does this. > + * in the case of X86_FEATURE_SHSTK, the software COW bit is used, > + * since the Dirty=1,Write=0 will result in the memory being treated > + * as shaodw stack by the HW. So when creating COW memory, a software > + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and > + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and > + * transition it to the shadow stack compatible version of COW (Cow=1). > + */
On Fri, 2022-10-14 at 11:41 +0200, Peter Zijlstra wrote: > On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: > > From: Yu-cheng Yu <yu-cheng.yu@intel.com> > > > > There is essentially no room left in the x86 hardware PTEs on some > > OSes > > (not Linux). That left the hardware architects looking for a way to > > represent a new memory type (shadow stack) within the existing > > bits. > > They chose to repurpose a lightly-used state: Write=0,Dirty=1. > > > > The reason it's lightly used is that Dirty=1 is normally set > > _before_ a > > write. A write with a Write=0 PTE would typically only generate a > > fault, > > not set Dirty=1. Hardware can (rarely) both set Write=1 *and* > > generate the > > s/Write/Dirty/ Oops, yes. > > > fault, resulting in a Dirty=0,Write=1 PTE. Hardware which supports > > shadow > > s/Dirty=0,Write=1/Write=0,Dirty=1/ Ok, I'll scrub the series for the order. > > > stacks will no longer exhibit this oddity. > > > > The kernel should avoid inadvertently creating shadow stack memory > > because > > it is security sensitive. So given the above, all it needs to do is > > avoid > > manually crating Write=0,Dirty=1 PTEs in software. > > Whichever way around you choose, please be consistent. > > > In places where Linux normally creates Write=0,Dirty=1, it can use > > the > > software-defined _PAGE_COW in place of the hardware _PAGE_DIRTY. In > > other > > words, whenever Linux needs to create Write=0,Dirty=1, it instead > > creates > > Write=0,Cow=1 except for shadow stack, which is Write=0,Dirty=1. > > This > > clearly separates shadow stack from other data, and results in the > > following: > > > > (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page. > > Previously when a typical anonymous writable mapping was made > > COW via > > fork(), the kernel would mark it Write=0,Dirty=1. Now it will > > instead > > use the Cow bit. > > (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The > > user page > > is in a R/O VMA, and get_user_pages() needs a writable copy. > > The page > > fault handler creates a copy of the page and sets the new > > copy's PTE > > as Write=0 and Cow=1. > > (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE. > > (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a > > shadow stack > > page is being shared among processes (this happens at fork()), > > its PTE > > is made Dirty=0, so the next shadow stack access causes a > > fault, and > > the page is duplicated and Dirty=1 is set again. This is the > > COW > > equivalent for shadow stack pages, even though it's copy-on- > > access > > rather than copy-on-write. > > (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor > > without > > shadow stack support set Dirty=1. > > Please restureture this (and the comment) something like: > > > (Write=0,Dirty=0,Cow=1): > > - copy_present_pte(): A modified copy-on-write page. > - ... > > > (Write=0,Dirty=1,Cow=0): > > - FEATURE_CET: Shadow Stack entry > - !FEATURE_CET: see the above Cow=1 cases Yes, I incorporated feedback from your earlier comment. Sorry for bad communication.
On Fri, 2022-10-14 at 11:42 +0200, Peter Zijlstra wrote: > On Thu, Sep 29, 2022 at 03:29:07PM -0700, Rick Edgecombe wrote: > > @@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte, > > pteval_t clear) > > return native_make_pte(v & ~clear); > > } > > > > +/* > > + * Normally the Dirty bit is used to denote COW memory on x86. But > > This is misleading; this isn't an x86 specific thing. The core-mm > code > does this. Well pte_mkdirty() does map to other HW bits on different architectures. But yea, it's confusing. Hmm, is this comment a bit stale either way now though? In the past it was probably more accurate to say core MM code used it to "detect" cowed memory. But the GUP pte_dirty() check was changed recently: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5535be3099717646781ce1540cf725965d680e7b I don't think any code is looking specifically for COWed memory using the PTE dirty bit anymore, it just happens to coincide with it. Double checking my understanding... Maybe this would be more accurate? /* * Normally COW memory can result in Dirty=1,Write=0 PTEs. But in the * case of X86_FEATURE_SHSTK, the software COW bit is used, since the * Dirty=1,Write=0 will result in the memory being treated as shaodw * stack by the HW. So when creating COW memory, a software bit is used * _PAGE_BIT_COW. The following functions pte_mkcow() and * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and * transition it to the shadow stack compatible version of COW (Cow=1). */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 6496ec84b953..ad201dae7316 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags; * The following only work if pte_present() is true. * Undefined behaviour if not.. */ -static inline int pte_dirty(pte_t pte) +static inline bool pte_dirty(pte_t pte) { - return pte_flags(pte) & _PAGE_DIRTY; + return pte_flags(pte) & _PAGE_DIRTY_BITS; +} + +static inline bool pte_shstk(pte_t pte) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return false; + + return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; } static inline int pte_young(pte_t pte) @@ -134,9 +142,17 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } -static inline int pmd_dirty(pmd_t pmd) +static inline bool pmd_dirty(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_DIRTY; + return pmd_flags(pmd) & _PAGE_DIRTY_BITS; +} + +static inline bool pmd_shstk(pmd_t pmd) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return false; + + return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY; } static inline int pmd_young(pmd_t pmd) @@ -144,9 +160,9 @@ static inline int pmd_young(pmd_t pmd) return pmd_flags(pmd) & _PAGE_ACCESSED; } -static inline int pud_dirty(pud_t pud) +static inline bool pud_dirty(pud_t pud) { - return pud_flags(pud) & _PAGE_DIRTY; + return pud_flags(pud) & _PAGE_DIRTY_BITS; } static inline int pud_young(pud_t pud) @@ -156,13 +172,21 @@ static inline int pud_young(pud_t pud) static inline int pte_write(pte_t pte) { - return pte_flags(pte) & _PAGE_RW; + /* + * Shadow stack pages are logically writable, but do not have + * _PAGE_RW. Check for them separately from _PAGE_RW itself. + */ + return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte); } #define pmd_write pmd_write static inline int pmd_write(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_RW; + /* + * Shadow stack pages are logically writable, but do not have + * _PAGE_RW. Check for them separately from _PAGE_RW itself. + */ + return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd); } #define pud_write pud_write @@ -300,6 +324,44 @@ static inline pte_t pte_clear_flags(pte_t pte, pteval_t clear) return native_make_pte(v & ~clear); } +/* + * Normally the Dirty bit is used to denote COW memory on x86. But + * in the case of X86_FEATURE_SHSTK, the software COW bit is used, + * since the Dirty=1,Write=0 will result in the memory being treated + * as shaodw stack by the HW. So when creating COW memory, a software + * bit is used _PAGE_BIT_COW. The following functions pte_mkcow() and + * pte_clear_cow() take a PTE marked conventially COW (Dirty=1) and + * transition it to the shadow stack compatible version of COW (Cow=1). + */ + +static inline pte_t pte_mkcow(pte_t pte) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pte; + + pte = pte_clear_flags(pte, _PAGE_DIRTY); + return pte_set_flags(pte, _PAGE_COW); +} + +static inline pte_t pte_clear_cow(pte_t pte) +{ + /* + * _PAGE_COW is unnecessary on !X86_FEATURE_SHSTK kernels. + * See the _PAGE_COW definition for more details. + */ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pte; + + /* + * PTE is getting copied-on-write, so it will be dirtied + * if writable, or made shadow stack if shadow stack and + * being copied on access. Set they dirty bit for both + * cases. + */ + pte = pte_set_flags(pte, _PAGE_DIRTY); + return pte_clear_flags(pte, _PAGE_COW); +} + #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline int pte_uffd_wp(pte_t pte) { @@ -319,7 +381,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte) static inline pte_t pte_mkclean(pte_t pte) { - return pte_clear_flags(pte, _PAGE_DIRTY); + return pte_clear_flags(pte, _PAGE_DIRTY_BITS); } static inline pte_t pte_mkold(pte_t pte) @@ -329,7 +391,16 @@ static inline pte_t pte_mkold(pte_t pte) static inline pte_t pte_wrprotect(pte_t pte) { - return pte_clear_flags(pte, _PAGE_RW); + pte = pte_clear_flags(pte, _PAGE_RW); + + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PTE (Write=0,Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (pte_dirty(pte)) + pte = pte_mkcow(pte); + return pte; } static inline pte_t pte_mkexec(pte_t pte) @@ -339,7 +410,19 @@ static inline pte_t pte_mkexec(pte_t pte) static inline pte_t pte_mkdirty(pte_t pte) { - return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pteval_t dirty = _PAGE_DIRTY; + + /* Avoid creating Dirty=1,Write=0 PTEs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pte_write(pte)) + dirty = _PAGE_COW; + + return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pte_t pte_mkwrite_shstk(pte_t pte) +{ + /* pte_clear_cow() also sets Dirty=1 */ + return pte_clear_cow(pte); } static inline pte_t pte_mkyoung(pte_t pte) @@ -349,7 +432,12 @@ static inline pte_t pte_mkyoung(pte_t pte) static inline pte_t pte_mkwrite(pte_t pte) { - return pte_set_flags(pte, _PAGE_RW); + pte = pte_set_flags(pte, _PAGE_RW); + + if (pte_dirty(pte)) + pte = pte_clear_cow(pte); + + return pte; } static inline pte_t pte_mkhuge(pte_t pte) @@ -396,6 +484,26 @@ static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear) return native_make_pmd(v & ~clear); } +/* See comments above pte_mkcow() */ +static inline pmd_t pmd_mkcow(pmd_t pmd) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pmd; + + pmd = pmd_clear_flags(pmd, _PAGE_DIRTY); + return pmd_set_flags(pmd, _PAGE_COW); +} + +/* See comments above pte_mkcow() */ +static inline pmd_t pmd_clear_cow(pmd_t pmd) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pmd; + + pmd = pmd_set_flags(pmd, _PAGE_DIRTY); + return pmd_clear_flags(pmd, _PAGE_COW); +} + #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline int pmd_uffd_wp(pmd_t pmd) { @@ -420,17 +528,36 @@ static inline pmd_t pmd_mkold(pmd_t pmd) static inline pmd_t pmd_mkclean(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_DIRTY); + return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS); } static inline pmd_t pmd_wrprotect(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_RW); + pmd = pmd_clear_flags(pmd, _PAGE_RW); + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PMD (RW=0, Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (pmd_dirty(pmd)) + pmd = pmd_mkcow(pmd); + return pmd; } static inline pmd_t pmd_mkdirty(pmd_t pmd) { - return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pmdval_t dirty = _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=1, Write=0 PMDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pmd_write(pmd)) + dirty = _PAGE_COW; + + return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY); +} + +static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd) +{ + return pmd_clear_cow(pmd); } static inline pmd_t pmd_mkdevmap(pmd_t pmd) @@ -450,7 +577,11 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd) static inline pmd_t pmd_mkwrite(pmd_t pmd) { - return pmd_set_flags(pmd, _PAGE_RW); + pmd = pmd_set_flags(pmd, _PAGE_RW); + + if (pmd_dirty(pmd)) + pmd = pmd_clear_cow(pmd); + return pmd; } static inline pud_t pud_set_flags(pud_t pud, pudval_t set) @@ -467,6 +598,26 @@ static inline pud_t pud_clear_flags(pud_t pud, pudval_t clear) return native_make_pud(v & ~clear); } +/* See comments above pte_mkcow() */ +static inline pud_t pud_mkcow(pud_t pud) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pud; + + pud = pud_clear_flags(pud, _PAGE_DIRTY); + return pud_set_flags(pud, _PAGE_COW); +} + +/* See comments above pte_mkcow() */ +static inline pud_t pud_clear_cow(pud_t pud) +{ + if (!cpu_feature_enabled(X86_FEATURE_SHSTK)) + return pud; + + pud = pud_set_flags(pud, _PAGE_DIRTY); + return pud_clear_flags(pud, _PAGE_COW); +} + static inline pud_t pud_mkold(pud_t pud) { return pud_clear_flags(pud, _PAGE_ACCESSED); @@ -474,17 +625,32 @@ static inline pud_t pud_mkold(pud_t pud) static inline pud_t pud_mkclean(pud_t pud) { - return pud_clear_flags(pud, _PAGE_DIRTY); + return pud_clear_flags(pud, _PAGE_DIRTY_BITS); } static inline pud_t pud_wrprotect(pud_t pud) { - return pud_clear_flags(pud, _PAGE_RW); + pud = pud_clear_flags(pud, _PAGE_RW); + + /* + * Blindly clearing _PAGE_RW might accidentally create + * a shadow stack PUD (RW=0, Dirty=1). Move the hardware + * dirty value to the software bit. + */ + if (pud_dirty(pud)) + pud = pud_mkcow(pud); + return pud; } static inline pud_t pud_mkdirty(pud_t pud) { - return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); + pudval_t dirty = _PAGE_DIRTY; + + /* Avoid creating (HW)Dirty=1, Write=0 PUDs */ + if (cpu_feature_enabled(X86_FEATURE_SHSTK) && !pud_write(pud)) + dirty = _PAGE_COW; + + return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY); } static inline pud_t pud_mkdevmap(pud_t pud) @@ -504,7 +670,11 @@ static inline pud_t pud_mkyoung(pud_t pud) static inline pud_t pud_mkwrite(pud_t pud) { - return pud_set_flags(pud, _PAGE_RW); + pud = pud_set_flags(pud, _PAGE_RW); + + if (pud_dirty(pud)) + pud = pud_clear_cow(pud); + return pud; } #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index ff82237e7b6b..85d88c0f9618 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -21,7 +21,8 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ -#define _PAGE_BIT_SOFTW4 58 /* available for programmer */ +#define _PAGE_BIT_SOFTW4 57 /* available for programmer */ +#define _PAGE_BIT_SOFTW5 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ #define _PAGE_BIT_PKEY_BIT2 61 /* Protection Keys, bit 3/4 */ @@ -34,6 +35,15 @@ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 +/* + * Indicates a copy-on-write page. + */ +#ifdef CONFIG_X86_SHADOW_STACK +#define _PAGE_BIT_COW _PAGE_BIT_SOFTW5 /* copy-on-write */ +#else +#define _PAGE_BIT_COW 0 +#endif + /* If _PAGE_BIT_PRESENT is clear, we use these: */ /* - if the user mapped it with PROT_NONE; pte_present gives true */ #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL @@ -117,6 +127,36 @@ #define _PAGE_SOFTW4 (_AT(pteval_t, 0)) #endif +/* + * The hardware requires shadow stack to be read-only and Dirty. + * _PAGE_COW is a software-only bit used to separate copy-on-write PTEs + * from shadow stack PTEs: + * (a) (Write=0,Cow=1,Dirty=0) A modified, copy-on-write (COW) page. + * Previously when a typical anonymous writable mapping was made COW via + * fork(), the kernel would mark it Write=0,Dirty=1. Now it will instead + * use the Cow bit. + * (b) (Write=0,Cow=1,Dirty=0) A R/O page that has been COW'ed. The user page + * is in a R/O VMA, and get_user_pages() needs a writable copy. The page + * fault handler creates a copy of the page and sets the new copy's PTE + * as Write=0 and Cow=1. + * (c) (Write=0,Cow=0,Dirty=1) A shadow stack PTE. + * (d) (Write=0,Cow=1,Dirty=0) A shared shadow stack PTE. When a shadow stack + * page is being shared among processes (this happens at fork()), its PTE + * is made Dirty=0, so the next shadow stack access causes a fault, and + * the page is duplicated and Dirty=1 is set again. This is the COW + * equivalent for shadow stack pages, even though it's copy-on-access + * rather than copy-on-write. + * (e) (Write=0,Cow=0,Dirty=1) A Cow PTE created when a processor without + * shadow stack support set Dirty=1. + */ +#ifdef CONFIG_X86_SHADOW_STACK +#define _PAGE_COW (_AT(pteval_t, 1) << _PAGE_BIT_COW) +#else +#define _PAGE_COW (_AT(pteval_t, 0)) +#endif + +#define _PAGE_DIRTY_BITS (_PAGE_DIRTY | _PAGE_COW) + #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) /*