mbox series

[RFC,0/9] PKS write protected page tables

Message ID 20210505003032.489164-1-rick.p.edgecombe@intel.com (mailing list archive)
Headers show
Series PKS write protected page tables | expand

Message

Rick Edgecombe May 5, 2021, 12:30 a.m. UTC
This is a POC for write protecting page tables with PKS (Protection Keys for 
Supervisor) [1]. The basic idea is to make the page tables read only, except 
temporarily on a per-cpu basis when they need to be modified. I’m looking for 
opinions on whether people like the general direction of this in terms of 
value and implementation.

Why would people want this?
===========================
Page tables are the basis for many types of protections and as such, are a 
juicy target for attackers. Mapping them read-only will make them harder to 
use in attacks.

This protects against an attacker that has acquired the ability to write to 
the page tables. It's not foolproof because an attacker who can execute 
arbitrary code can either disable PKS directly, or simply call the same 
functions that the kernel uses for legitimate page table writes.

Why use PKS for this?
=====================
PKS is an upcoming CPU feature that allows supervisor virtual memory 
permissions to be changed without flushing the TLB, like PKU does for user 
memory. Protecting page tables would normally be really expensive because you 
would have to do it with paging itself. PKS helps by providing a way to toggle 
the writability of the page tables with just a per-cpu MSR.

Performance impacts
===================
Setting direct map permissions on whatever random page gets allocated for a 
page table would result in a lot of kernel range shootdowns and direct map 
large page shattering. So the way the PKS page table memory is created is 
similar to this module page clustering series[2], where a cache of pages is 
replenished from 2MB pages such that the direct map permissions and associated 
breakage is localized on the direct map. In the PKS page tables case, a PKS 
key is pre-applied to the direct map for pages in the cache.

There would be some costs of memory overhead in order to protect the direct 
map page tables. There would also be some extra kernel range shootdowns to 
replenish the cache on occasion, from setting the PKS key on the direct map of 
the new pages. I don’t have any actual performance data yet.

This is based on V6 [1] of the core PKS infrastructure patches. PKS 
infrastructure follow-on’s are planned to enable keys to be set to the same 
permissions globally. Since this usage needs a key to be set globally 
read-only by default, a small temporary solution is hacked up in patch 8. Long 
term, PKS protected page tables would use a better and more generic solution 
to achieve this.

[1]
https://lore.kernel.org/lkml/20210401225833.566238-1-ira.weiny@intel.com/
[2]
https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com
/

Thanks,

Rick


Rick Edgecombe (9):
  list: Support getting most recent element in list_lru
  list: Support list head not in object for list_lru
  x86/mm/cpa: Add grouped page allocations
  mm: Explicitly zero page table lock ptr
  x86, mm: Use cache of page tables
  x86/mm/cpa: Add set_memory_pks()
  x86/mm/cpa: Add perm callbacks to grouped pages
  x86, mm: Protect page tables with PKS
  x86, cpa: PKS protect direct map page tables

 arch/x86/boot/compressed/ident_map_64.c |   5 +
 arch/x86/include/asm/pgalloc.h          |   6 +
 arch/x86/include/asm/pgtable.h          |  26 +-
 arch/x86/include/asm/pgtable_64.h       |  33 ++-
 arch/x86/include/asm/pkeys_common.h     |   8 +-
 arch/x86/include/asm/set_memory.h       |  23 ++
 arch/x86/mm/init.c                      |  40 +++
 arch/x86/mm/pat/set_memory.c            | 312 +++++++++++++++++++++++-
 arch/x86/mm/pgtable.c                   | 144 ++++++++++-
 include/asm-generic/pgalloc.h           |  42 +++-
 include/linux/list_lru.h                |  26 ++
 include/linux/mm.h                      |   7 +
 mm/Kconfig                              |   6 +-
 mm/list_lru.c                           |  38 ++-
 mm/memory.c                             |   1 +
 mm/swap.c                               |   7 +
 mm/swap_state.c                         |   6 +
 17 files changed, 705 insertions(+), 25 deletions(-)

Comments

Ira Weiny May 5, 2021, 2:03 a.m. UTC | #1
On Tue, May 04, 2021 at 05:30:23PM -0700, Rick Edgecombe wrote:
> 
> This is based on V6 [1] of the core PKS infrastructure patches. PKS 
> infrastructure follow-on’s are planned to enable keys to be set to the same 
> permissions globally. Since this usage needs a key to be set globally 
> read-only by default, a small temporary solution is hacked up in patch 8. Long 
> term, PKS protected page tables would use a better and more generic solution 
> to achieve this.

Before you send this out I've been thinking about this more and I think I would
prefer you not call this 'globally' setting the key.  Because you don't really
want to be able to update the key globally like I originally suggested for
kmap().  What is required is to set a different default for the key which gets
used by all threads by 'default'.

What is really missing is how to get the default changed after it may have been
used by some threads...  thus the 'global' nature...  Perhaps I am picking nits
here but I think it may go over better with Thomas and the maintainers.  Or
maybe not...  :-)

Would it be too much trouble to call this a 'default' change?  Because that is
really what you implement?

Ira
Kees Cook May 5, 2021, 6:25 a.m. UTC | #2
On Tue, May 04, 2021 at 05:30:23PM -0700, Rick Edgecombe wrote:
> This is a POC for write protecting page tables with PKS (Protection Keys for 
> Supervisor) [1]. The basic idea is to make the page tables read only, except 
> temporarily on a per-cpu basis when they need to be modified. I’m looking for 
> opinions on whether people like the general direction of this in terms of 
> value and implementation.

Yay!

> Why would people want this?
> ===========================
> Page tables are the basis for many types of protections and as such, are a 
> juicy target for attackers. Mapping them read-only will make them harder to 
> use in attacks.
> 
> This protects against an attacker that has acquired the ability to write to 
> the page tables. It's not foolproof because an attacker who can execute 
> arbitrary code can either disable PKS directly, or simply call the same 
> functions that the kernel uses for legitimate page table writes.

I think it absolutely has value. The exploit techniques I'm aware of that
target the page table are usually attempting to upgrade an arbitrary
write into execution (e.g. write to kernel text after setting kernel
text writable in the page table) or similar "data only" attacks (make
sensitive page writable).

It looks like PKS-protected page tables would be much like the
RO-protected text pages in the sense that there is already code in
the kernel to do things to make it writable, change text, and set it
read-only again (alternatives, ftrace, etc). That said, making the PKS
manipulation code be inline to page-writing code would make it less
available for ROP/JOP, if an attack DID want to go that route.

> Why use PKS for this?
> =====================
> PKS is an upcoming CPU feature that allows supervisor virtual memory 
> permissions to be changed without flushing the TLB, like PKU does for user 
> memory. Protecting page tables would normally be really expensive because you 
> would have to do it with paging itself. PKS helps by providing a way to toggle 
> the writability of the page tables with just a per-cpu MSR.

The per-cpu-ness is really important for both performance and for avoiding
temporal attacks where an arbitrary write in one CPU is timed against
a page table write in another CPU.

> Performance impacts
> ===================
> Setting direct map permissions on whatever random page gets allocated for a 
> page table would result in a lot of kernel range shootdowns and direct map 
> large page shattering. So the way the PKS page table memory is created is 
> similar to this module page clustering series[2], where a cache of pages is 
> replenished from 2MB pages such that the direct map permissions and associated 
> breakage is localized on the direct map. In the PKS page tables case, a PKS 
> key is pre-applied to the direct map for pages in the cache.
> 
> There would be some costs of memory overhead in order to protect the direct 
> map page tables. There would also be some extra kernel range shootdowns to 
> replenish the cache on occasion, from setting the PKS key on the direct map of 
> the new pages. I don’t have any actual performance data yet.

What CPU models are expected to have PKS?

> This is based on V6 [1] of the core PKS infrastructure patches. PKS 
> infrastructure follow-on’s are planned to enable keys to be set to the same 
> permissions globally. Since this usage needs a key to be set globally 
> read-only by default, a small temporary solution is hacked up in patch 8. Long 
> term, PKS protected page tables would use a better and more generic solution 
> to achieve this.
> 
> [1] https://lore.kernel.org/lkml/20210401225833.566238-1-ira.weiny@intel.com/

Ah, neat!

> [2] https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/

Ooh. What does this do for performance? It sounds like less TLB
pressure, IIUC?
Peter Zijlstra May 5, 2021, 8:37 a.m. UTC | #3
On Tue, May 04, 2021 at 11:25:31PM -0700, Kees Cook wrote:

> It looks like PKS-protected page tables would be much like the
> RO-protected text pages in the sense that there is already code in
> the kernel to do things to make it writable, change text, and set it
> read-only again (alternatives, ftrace, etc).

We don't actually modify text by changing the mapping at all. We modify
through a writable (but not executable) temporary alias on the page (on
x86).

Once a mapping is RX it will *never* be writable again (until we tear it
all down).
Vlastimil Babka May 5, 2021, 11:08 a.m. UTC | #4
On 5/5/21 2:30 AM, Rick Edgecombe wrote:
> This is a POC for write protecting page tables with PKS (Protection Keys for 
> Supervisor) [1]. The basic idea is to make the page tables read only, except 
> temporarily on a per-cpu basis when they need to be modified. I’m looking for 
> opinions on whether people like the general direction of this in terms of 
> value and implementation.
> 
> Why would people want this?
> ===========================
> Page tables are the basis for many types of protections and as such, are a 
> juicy target for attackers. Mapping them read-only will make them harder to 
> use in attacks.
> 
> This protects against an attacker that has acquired the ability to write to 
> the page tables. It's not foolproof because an attacker who can execute 
> arbitrary code can either disable PKS directly, or simply call the same 
> functions that the kernel uses for legitimate page table writes.

Yeah, it's a good idea. I've once used a similar approach locally during
debugging a problem that appeared to be stray writes hitting page tables, and
without PKS I indeed made the whole pages read-only when not touched by the
designated code.

> Why use PKS for this?
> =====================
> PKS is an upcoming CPU feature that allows supervisor virtual memory 
> permissions to be changed without flushing the TLB, like PKU does for user 
> memory. Protecting page tables would normally be really expensive because you 
> would have to do it with paging itself. PKS helps by providing a way to toggle 
> the writability of the page tables with just a per-cpu MSR.

I can see in patch 8/9 that you are flipping the MSR around individual
operations on page table entries. In my patch I hooked making the page table
writable to obtaining the page table lock (IIRC I had only the PTE level fully
handled though). Wonder if that would be better tradeoff even for your MSR approach?

Vlastimil

> Performance impacts
> ===================
> Setting direct map permissions on whatever random page gets allocated for a 
> page table would result in a lot of kernel range shootdowns and direct map 
> large page shattering. So the way the PKS page table memory is created is 
> similar to this module page clustering series[2], where a cache of pages is 
> replenished from 2MB pages such that the direct map permissions and associated 
> breakage is localized on the direct map. In the PKS page tables case, a PKS 
> key is pre-applied to the direct map for pages in the cache.
> 
> There would be some costs of memory overhead in order to protect the direct 
> map page tables. There would also be some extra kernel range shootdowns to 
> replenish the cache on occasion, from setting the PKS key on the direct map of 
> the new pages. I don’t have any actual performance data yet.
> 
> This is based on V6 [1] of the core PKS infrastructure patches. PKS 
> infrastructure follow-on’s are planned to enable keys to be set to the same 
> permissions globally. Since this usage needs a key to be set globally 
> read-only by default, a small temporary solution is hacked up in patch 8. Long 
> term, PKS protected page tables would use a better and more generic solution 
> to achieve this.
> 
> [1]
> https://lore.kernel.org/lkml/20210401225833.566238-1-ira.weiny@intel.com/
> [2]
> https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com
> /
> 
> Thanks,
> 
> Rick
> 
> 
> Rick Edgecombe (9):
>   list: Support getting most recent element in list_lru
>   list: Support list head not in object for list_lru
>   x86/mm/cpa: Add grouped page allocations
>   mm: Explicitly zero page table lock ptr
>   x86, mm: Use cache of page tables
>   x86/mm/cpa: Add set_memory_pks()
>   x86/mm/cpa: Add perm callbacks to grouped pages
>   x86, mm: Protect page tables with PKS
>   x86, cpa: PKS protect direct map page tables
> 
>  arch/x86/boot/compressed/ident_map_64.c |   5 +
>  arch/x86/include/asm/pgalloc.h          |   6 +
>  arch/x86/include/asm/pgtable.h          |  26 +-
>  arch/x86/include/asm/pgtable_64.h       |  33 ++-
>  arch/x86/include/asm/pkeys_common.h     |   8 +-
>  arch/x86/include/asm/set_memory.h       |  23 ++
>  arch/x86/mm/init.c                      |  40 +++
>  arch/x86/mm/pat/set_memory.c            | 312 +++++++++++++++++++++++-
>  arch/x86/mm/pgtable.c                   | 144 ++++++++++-
>  include/asm-generic/pgalloc.h           |  42 +++-
>  include/linux/list_lru.h                |  26 ++
>  include/linux/mm.h                      |   7 +
>  mm/Kconfig                              |   6 +-
>  mm/list_lru.c                           |  38 ++-
>  mm/memory.c                             |   1 +
>  mm/swap.c                               |   7 +
>  mm/swap_state.c                         |   6 +
>  17 files changed, 705 insertions(+), 25 deletions(-)
>
Peter Zijlstra May 5, 2021, 11:56 a.m. UTC | #5
On Wed, May 05, 2021 at 01:08:35PM +0200, Vlastimil Babka wrote:
> On 5/5/21 2:30 AM, Rick Edgecombe wrote:

> > Why use PKS for this?
> > =====================
> > PKS is an upcoming CPU feature that allows supervisor virtual memory 
> > permissions to be changed without flushing the TLB, like PKU does for user 
> > memory. Protecting page tables would normally be really expensive because you 
> > would have to do it with paging itself. PKS helps by providing a way to toggle 
> > the writability of the page tables with just a per-cpu MSR.
> 
> I can see in patch 8/9 that you are flipping the MSR around individual
> operations on page table entries. In my patch I hooked making the page table
> writable to obtaining the page table lock (IIRC I had only the PTE level fully
> handled though). Wonder if that would be better tradeoff even for your MSR approach?

There's also the HIGHPTE code we could abuse to kmap an alias while
we're at it.
Kees Cook May 5, 2021, 6:38 p.m. UTC | #6
On Wed, May 05, 2021 at 10:37:29AM +0200, Peter Zijlstra wrote:
> On Tue, May 04, 2021 at 11:25:31PM -0700, Kees Cook wrote:
> 
> > It looks like PKS-protected page tables would be much like the
> > RO-protected text pages in the sense that there is already code in
> > the kernel to do things to make it writable, change text, and set it
> > read-only again (alternatives, ftrace, etc).
> 
> We don't actually modify text by changing the mapping at all. We modify
> through a writable (but not executable) temporary alias on the page (on
> x86).
> 
> Once a mapping is RX it will *never* be writable again (until we tear it
> all down).

Yes, quite true. I was trying to answer the concern about "is it okay
that there is a routine in the kernel that can write to page tables
(via temporary disabling of PKS)?" by saying "yes, this is fine -- we
already have similar routines in the kernel that bypass memory
protections, and that's okay because the defense is primarily about
blocking flaws that allow attacker-controlled writes to be used to
leverage greater control over kernel state, of which the page tables are
pretty central. :)
Rick Edgecombe May 5, 2021, 7:46 p.m. UTC | #7
On Wed, 2021-05-05 at 13:56 +0200, Peter Zijlstra wrote:
> On Wed, May 05, 2021 at 01:08:35PM +0200, Vlastimil Babka wrote:
> > On 5/5/21 2:30 AM, Rick Edgecombe wrote:
> 
> > > Why use PKS for this?
> > > =====================
> > > PKS is an upcoming CPU feature that allows supervisor virtual
> > > memory 
> > > permissions to be changed without flushing the TLB, like PKU does
> > > for user 
> > > memory. Protecting page tables would normally be really expensive
> > > because you 
> > > would have to do it with paging itself. PKS helps by providing a
> > > way to toggle 
> > > the writability of the page tables with just a per-cpu MSR.
> > 
> > I can see in patch 8/9 that you are flipping the MSR around
> > individual
> > operations on page table entries. In my patch I hooked making the
> > page table
> > writable to obtaining the page table lock (IIRC I had only the PTE
> > level fully
> > handled though). Wonder if that would be better tradeoff even for
> > your MSR approach?
> 
Hmm, I see, that could reduce the sprinkling of the enable/disable
calls. It seems some(most?) of the kernel address space page table
modification don't take the page table locks though so those callers
would have to call something else to be able to write.

> There's also the HIGHPTE code we could abuse to kmap an alias while
> we're at it.

For a non-PKS debug feature?

It might fit pretty easily into CONFIG_DEBUG_PAGEALLOC on top of this
series. enable/disable_pgtable_write() could take a pointer that is
ignored in PKS mode, but triggers a cpa call in a
CONFIG_DEBUG_PAGETABLE_WRITE mode.
Rick Edgecombe May 5, 2021, 7:51 p.m. UTC | #8
On Tue, 2021-05-04 at 23:25 -0700, Kees Cook wrote:
> > infrastructure follow-on’s are planned to enable keys to be set to
> > the same 
> > permissions globally. Since this usage needs a key to be set globally
> > read-only by default, a small temporary solution is hacked up in
> > patch 8. Long 
> > term, PKS protected page tables would use a better and more generic
> > solution 
> > to achieve this.
> > 
> > [1]
> > https://lore.kernel.org/lkml/20210401225833.566238-1-ira.weiny@intel.com/
> 
> Ah, neat!
> 
> > [2]
> > https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/
> 
> Ooh. What does this do for performance? It sounds like less TLB
> pressure, IIUC?

Yea, less TLB pressure, faster page table walks in theory. There was
some testing that showed having all 4k pages was bad for performance:
https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

I'm not sure exactly how much breakage is needed before problems start
to show up, but there was also someone posting that large amounts of
tracing was noticeable for their workload.
Ira Weiny May 6, 2021, midnight UTC | #9
On Tue, May 04, 2021 at 11:25:31PM -0700, Kees Cook wrote:
> On Tue, May 04, 2021 at 05:30:23PM -0700, Rick Edgecombe wrote:
> 
> > Performance impacts
> > ===================
> > Setting direct map permissions on whatever random page gets allocated for a 
> > page table would result in a lot of kernel range shootdowns and direct map 
> > large page shattering. So the way the PKS page table memory is created is 
> > similar to this module page clustering series[2], where a cache of pages is 
> > replenished from 2MB pages such that the direct map permissions and associated 
> > breakage is localized on the direct map. In the PKS page tables case, a PKS 
> > key is pre-applied to the direct map for pages in the cache.
> > 
> > There would be some costs of memory overhead in order to protect the direct 
> > map page tables. There would also be some extra kernel range shootdowns to 
> > replenish the cache on occasion, from setting the PKS key on the direct map of 
> > the new pages. I don’t have any actual performance data yet.
> 
> What CPU models are expected to have PKS?


Supervisor Memory Protection Keys (PKS) is a feature which is found on Intel’s
Sapphire Rapids (and later) “Scalable Processor” Server CPUs.  It will also be
available in future non-server Intel parts.

Also qemu has some support as well.

https://www.qemu.org/2021/04/30/qemu-6-0-0/

Ira