diff mbox series

[v2,08/39] x86/mm: Remove _PAGE_DIRTY from kernel RO pages

Message ID 20220929222936.14584-9-rick.p.edgecombe@intel.com (mailing list archive)
State New
Headers show
Series Shadowstacks for userspace | expand

Commit Message

Rick Edgecombe Sept. 29, 2022, 10:29 p.m. UTC
From: Yu-cheng Yu <yu-cheng.yu@intel.com>

Processors sometimes directly create Write=0,Dirty=1 PTEs. These PTEs are
created by software. One such case is that kernel read-only pages are
historically set up as Dirty.

New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
shadow stack pages. When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some
instructions can write to such supervisor memory. The kernel does not set
IA32_S_CET.SH_STK_EN, but to reduce ambiguity between shadow stack and
regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>

---

v2:
 - Normalize PTE bit descriptions between patches

 arch/x86/include/asm/pgtable_types.h | 6 +++---
 arch/x86/mm/pat/set_memory.c         | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

Comments

Kirill A . Shutemov Oct. 3, 2022, 2:17 p.m. UTC | #1
On Thu, Sep 29, 2022 at 03:29:05PM -0700, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> 
> Processors sometimes directly create Write=0,Dirty=1 PTEs. These PTEs are
> created by software. One such case is that kernel read-only pages are
> historically set up as Dirty.
> 
> New processors that support Shadow Stack regard Write=0,Dirty=1 PTEs as
> shadow stack pages. When CR4.CET=1 and IA32_S_CET.SH_STK_EN=1, some
> instructions can write to such supervisor memory. The kernel does not set
> IA32_S_CET.SH_STK_EN, but to reduce ambiguity between shadow stack and
> regular Write=0 pages, removed Dirty=1 from any kernel Write=0 PTEs.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> 
> ---
> 
> v2:
>  - Normalize PTE bit descriptions between patches
> 
>  arch/x86/include/asm/pgtable_types.h | 6 +++---
>  arch/x86/mm/pat/set_memory.c         | 2 +-
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index aa174fed3a71..ff82237e7b6b 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -192,10 +192,10 @@ enum page_cache_mode {
>  #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
>  #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
>  #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
> -#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
> -#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
> +#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
> +#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
>  #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
> -#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
> +#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
>  #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
>  #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
>  #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 1abd5438f126..ed9193b469ba 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -1977,7 +1977,7 @@ int set_memory_nx(unsigned long addr, int numpages)
>  
>  int set_memory_ro(unsigned long addr, int numpages)
>  {
> -	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
> +	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
>  }

Hm. Do we need to modify also *_wrprotect() helpers to clear dirty bit?

I guess not (at least without a lot of audit), as we risk loosing dirty
bit on page cache pages. But why is it safe? Do we only care about about
kernel PTEs here? Userspace Write=0,Dirty=1 PTEs handled as before?

>  int set_memory_rw(unsigned long addr, int numpages)
> -- 
> 2.17.1
>
Andrew Cooper Oct. 5, 2022, 1:31 a.m. UTC | #2
On 29/09/2022 23:29, Rick Edgecombe wrote:
> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>
> Processors sometimes directly create Write=0,Dirty=1 PTEs.

Do they? (Rhetorical)

Yes, this is a relevant anecdote for why CET isn't available on pre-TGL
parts, but it one of the more wrong things to have as the first sentence
of this commit message.

The point you want to express is that under the CET-SS spec, R/O+Dirty
has a new meaning as type=shstk, so stop using this bit combination for
existing mappings.

I'm not even sure it's relevant to note that CET capable processors can
set D on a R/O mapping, because that depends on !CR0.WP which in turn
prohibits CR4.CET being enabled.

~Andrew
Peter Zijlstra Oct. 5, 2022, 11:16 a.m. UTC | #3
On Wed, Oct 05, 2022 at 01:31:28AM +0000, Andrew Cooper wrote:
> On 29/09/2022 23:29, Rick Edgecombe wrote:
> > From: Yu-cheng Yu <yu-cheng.yu@intel.com>
> >
> > Processors sometimes directly create Write=0,Dirty=1 PTEs.
> 
> Do they? (Rhetorical)
> 
> Yes, this is a relevant anecdote for why CET isn't available on pre-TGL
> parts, but it one of the more wrong things to have as the first sentence
> of this commit message.
> 
> The point you want to express is that under the CET-SS spec, R/O+Dirty
> has a new meaning as type=shstk, so stop using this bit combination for
> existing mappings.
> 
> I'm not even sure it's relevant to note that CET capable processors can
> set D on a R/O mapping, because that depends on !CR0.WP which in turn
> prohibits CR4.CET being enabled.

Whilst I agree that the Changelog is 'suboptimal' -- I do think it might
be good to mention how we ended up at the current state where we
explicitly set this non-sensical W=0,D=1 state.

Looking at the git history this seems to be a bit of a hysterical
accident, not something done on purpose to 'optimize' for these weird
CPUs.
Andrew Cooper Oct. 5, 2022, 12:34 p.m. UTC | #4
On 05/10/2022 12:16, Peter Zijlstra wrote:
> On Wed, Oct 05, 2022 at 01:31:28AM +0000, Andrew Cooper wrote:
>> On 29/09/2022 23:29, Rick Edgecombe wrote:
>>> From: Yu-cheng Yu <yu-cheng.yu@intel.com>
>>>
>>> Processors sometimes directly create Write=0,Dirty=1 PTEs.
>> Do they? (Rhetorical)
>>
>> Yes, this is a relevant anecdote for why CET isn't available on pre-TGL
>> parts, but it one of the more wrong things to have as the first sentence
>> of this commit message.
>>
>> The point you want to express is that under the CET-SS spec, R/O+Dirty
>> has a new meaning as type=shstk, so stop using this bit combination for
>> existing mappings.
>>
>> I'm not even sure it's relevant to note that CET capable processors can
>> set D on a R/O mapping, because that depends on !CR0.WP which in turn
>> prohibits CR4.CET being enabled.
> Whilst I agree that the Changelog is 'suboptimal' -- I do think it might
> be good to mention how we ended up at the current state where we
> explicitly set this non-sensical W=0,D=1 state.

Sure, but that's got nothing to do with hardware errata.

Having hardware set A/D bits is expensive.  Being a locked operation,
it's roughly a smp_mb() behind the scenes.

Therefore, when A/D tracking doesn't matter, traditional wisdom says set
both of them when creating the PTE.

It's only now that R/O+Dirty has a meaning (other than being a slightly
weird but safe bit combination), and we've got to be more careful about
using it.

~Andrew
diff mbox series

Patch

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index aa174fed3a71..ff82237e7b6b 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -192,10 +192,10 @@  enum page_cache_mode {
 #define _KERNPG_TABLE		 (__PP|__RW|   0|___A|   0|___D|   0|   0| _ENC)
 #define _PAGE_TABLE_NOENC	 (__PP|__RW|_USR|___A|   0|___D|   0|   0)
 #define _PAGE_TABLE		 (__PP|__RW|_USR|___A|   0|___D|   0|   0| _ENC)
-#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|___D|   0|___G)
-#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|___D|   0|___G)
+#define __PAGE_KERNEL_RO	 (__PP|   0|   0|___A|__NX|   0|   0|___G)
+#define __PAGE_KERNEL_ROX	 (__PP|   0|   0|___A|   0|   0|   0|___G)
 #define __PAGE_KERNEL_NOCACHE	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __NC)
-#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|___D|   0|___G)
+#define __PAGE_KERNEL_VVAR	 (__PP|   0|_USR|___A|__NX|   0|   0|___G)
 #define __PAGE_KERNEL_LARGE	 (__PP|__RW|   0|___A|__NX|___D|_PSE|___G)
 #define __PAGE_KERNEL_LARGE_EXEC (__PP|__RW|   0|___A|   0|___D|_PSE|___G)
 #define __PAGE_KERNEL_WP	 (__PP|__RW|   0|___A|__NX|___D|   0|___G| __WP)
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1abd5438f126..ed9193b469ba 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1977,7 +1977,7 @@  int set_memory_nx(unsigned long addr, int numpages)
 
 int set_memory_ro(unsigned long addr, int numpages)
 {
-	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0);
+	return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW | _PAGE_DIRTY), 0);
 }
 
 int set_memory_rw(unsigned long addr, int numpages)