diff mbox series

[RFC,02/17] arm64: place kernel in its own L0 page table entry

Message ID E1q3zxo-00A5Js-9E@rmk-PC.armlinux.org.uk (mailing list archive)
State New, archived
Headers show
Series arm64 kernel text replication | expand

Commit Message

Russell King (Oracle) May 30, 2023, 2:04 p.m. UTC
Kernel text replication needs to maintain separate per-node page
tables for the kernel text. In order to do this without affecting
other kernel memory mappings, placing the kernel such that it does
not share a L0 page table entry with any other mapping is desirable.

Prior to this commit, the layout without KASLR was:

+----------+
|  vmalloc |
+----------+
|  Kernel  |
+----------+ MODULES_END, VMALLOC_START, KIMAGE_VADDR =
|  Modules |                 MODULES_VADDR + MODULES_VSIZE
+----------+ MODULES_VADDR = _PAGE_END(VA_BITS_MIN)
| VA space |
+----------+ 0

This becomes:

+----------+
|  vmalloc |
+----------+ VMALLOC_START = MODULES_END + PGDIR_SIZE
|  Kernel  |
+----------+ MODULES_END, KIMAGE_VADDR = _PAGE_END(VA_BITS_MIN) + PGDIR_SIZE
|  Modules |
+----------+ MODULES_VADDR = MODULES_END - MODULES_VSIZE
| VA space |
+----------+ 0

This assumes MODULES_VSIZE (128M) <= PGDIR_SIZE.

One side effect of this change is that KIMAGE_VADDR's definition now
includes PGDIR_SIZE (to leave room for the modules) but this is not
defined when asm/memory.h is included. This means KIMAGE_VADDR can
not be used in inline functions within this file, so we convert
kaslr_offset() and kaslr_enabled() to be macros instead.

Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
---
 arch/arm64/include/asm/memory.h  | 26 ++++++++++----------------
 arch/arm64/include/asm/pgtable.h |  2 +-
 arch/arm64/mm/mmu.c              |  3 ++-
 3 files changed, 13 insertions(+), 18 deletions(-)

Comments

Russell King (Oracle) June 12, 2023, 3:04 p.m. UTC | #1
On Mon, Jun 12, 2023 at 12:14:54PM +0100, Mark Rutland wrote:
> Hi Russell,
> 
> On Tue, May 30, 2023 at 03:04:40PM +0100, Russell King (Oracle) wrote:
> > Kernel text replication needs to maintain separate per-node page
> > tables for the kernel text. In order to do this without affecting
> > other kernel memory mappings, placing the kernel such that it does
> > not share a L0 page table entry with any other mapping is desirable.
> > 
> > Prior to this commit, the layout without KASLR was:
> > 
> > +----------+
> > |  vmalloc |
> > +----------+
> > |  Kernel  |
> > +----------+ MODULES_END, VMALLOC_START, KIMAGE_VADDR =
> > |  Modules |                 MODULES_VADDR + MODULES_VSIZE
> > +----------+ MODULES_VADDR = _PAGE_END(VA_BITS_MIN)
> > | VA space |
> > +----------+ 0
> > 
> > This becomes:
> > 
> > +----------+
> > |  vmalloc |
> > +----------+ VMALLOC_START = MODULES_END + PGDIR_SIZE
> > |  Kernel  |
> > +----------+ MODULES_END, KIMAGE_VADDR = _PAGE_END(VA_BITS_MIN) + PGDIR_SIZE
> > |  Modules |
> > +----------+ MODULES_VADDR = MODULES_END - MODULES_VSIZE
> > | VA space |
> > +----------+ 0
> 
> With KSASLR we may randomize the kernel and module space over a substantial
> portion of the vmalloc range. Are you expecting that text replication is going
> to restruct that range, or that we'd make it mutually exclusive with KASLR?

In the patch that adds the REPLICATE_KTEXT config option, I've made it
exclusive with RANDOMIZE_BASE, but this change in layout isn't dependent
on REPLICATE_KTEXT.

I've tested it with RANDOMIZE_BASE=y, and nothing seems to get upset,
so I believe that this patch doesn't cause any negative issues.

> I also note that the L0 table could have as few as two entries (with 16K pages
> and 4 levels). So either we'd need to also mess with an L1 table, or make text
> replication mutually exclusive with such configurations.

Ah, thanks for pointing that out - I was hoping to avoid needing
to touch anything but L0 tables.

However, it brings up a question: are there any NUMA systems that would
have just two entries in the L0 table? I suspect NUMA systems have lots
of RAM, and so would want a page table layout that results in multiple
L0 entries.

> > This assumes MODULES_VSIZE (128M) <= PGDIR_SIZE.
> 
> As a heads-up, we've just changed MODULES_VSIZE to be 2G in
> 
>   https://lore.kernel.org/linux-arm-kernel/20230530110328.2213762-1-mark.rutland@arm.com/
> 
> .. which is queued in the arm64 for-next/module-alloc branch:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/module-alloc

Ok - so I need to get a bit more clever about calculating MODULES_END
and KIMAGE_VADDR

> > One side effect of this change is that KIMAGE_VADDR's definition now
> > includes PGDIR_SIZE (to leave room for the modules) but this is not
> > defined when asm/memory.h is included. This means KIMAGE_VADDR can
> > not be used in inline functions within this file, so we convert
> > kaslr_offset() and kaslr_enabled() to be macros instead.
> 
> That series above also decoupled kaslr_enabled() from kaslr_offset(), 
> so we'd only need to change kaslr_offset().

Ok, I'll take a look to see how my changes are impacted.

> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 4829abe017e9..baf74d0c43c9 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -478,7 +478,8 @@ void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
> >  static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
> >  				phys_addr_t size, pgprot_t prot)
> >  {
> > -	if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
> > +	if ((virt >= PAGE_END) && (virt < VMALLOC_START) &&
> > +	    !is_kernel(virt)) {
> >  		pr_warn("BUG: not updating mapping for %pa at 0x%016lx - outside kernel range\n",
> >  			&phys, virt);
> >  		return;
> 
> I think the existing conditions here aren't quite right, and have become bogus
> over time, and I don't think that the is_kernel() check is necessary here.
> 
> Originally, back in commit:
> 
>   c1cc1552616d0f35 ("arm64: MMU initialisation")
> 
> We had:
> 
> 	if (virt < VMALLOC_START) {
> 		pr_warning("BUG: not creating mapping for 0x%016llx at 0x%016lx - outside kernel range\n",
> 			   phys, virt);
> 		return;
> 	}
> 
> ... which checked that the VA range we were manipulating was in the TTBR1 VA
> range, as at the time, VMALLOC_START happened to be the lowest TTBR1 address.
> 
> That didn't substantially change until commit:
> 
>   14c127c957c1c607 ("arm64: mm: Flip kernel VA space")
> 
> ... when the test was changed to:
> 
> 	if ((virt >= VA_START) && (virt < VMALLOC_START)) {
> 		pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
> 			&phys, virt);
> 		return;
> 	}
> 
> Note: in that commit, VA_START was actually the end of the linear map (which
> was itself a the start of the TTBR1 address space), so this is just checking if
> we're poking a small portion of the TTBR1 address space, rather than if we're
> poking *outside* of the TTBR1 address space.
> 
> That doesn't make much sense, and I'm pretty sure that was a thinko rather than
> an intentional change of semantic.
> 
> I "fixed" that without thinking in commit:
> 
>   77ad4ce69321abbe ("arm64: memory: rename VA_START to PAGE_END")
> 
> ... making that:
> 
> 	if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
> 		pr_warn("BUG: not creating mapping for %pa at 0x%016lx - outside kernel range\n",
> 			&phys, virt);
> 		return;
> 	}
> 
> ... but clearly it has lost the original semantic and doesn't make much sense.
> 
> I think the test should actually be something like:
> 
> 	/* Must be a TTBR1 address */
> 	if (virt < PAGE_OFFSET ) {
> 		...
> 	}
> 
> ... and then we won't randomly trip for kernel mappings if those fall between
> the linear map and vmalloc range.

Okay, so that sounds like if this is fixed, then I won't need to patch
it! Yay!
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index c735afdf639b..089f556b7387 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -43,9 +43,9 @@ 
 #define VA_BITS			(CONFIG_ARM64_VA_BITS)
 #define _PAGE_OFFSET(va)	(-(UL(1) << (va)))
 #define PAGE_OFFSET		(_PAGE_OFFSET(VA_BITS))
-#define KIMAGE_VADDR		(MODULES_END)
-#define MODULES_END		(MODULES_VADDR + MODULES_VSIZE)
-#define MODULES_VADDR		(_PAGE_END(VA_BITS_MIN))
+#define KIMAGE_VADDR		(_PAGE_END(VA_BITS_MIN) + PGDIR_SIZE)
+#define MODULES_END		(KIMAGE_VADDR)
+#define MODULES_VADDR		(MODULES_END - MODULES_VSIZE)
 #define MODULES_VSIZE		(SZ_128M)
 #define VMEMMAP_START		(-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
 #define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
@@ -199,20 +199,14 @@  extern u64			kimage_vaddr;
 /* the offset between the kernel virtual and physical mappings */
 extern u64			kimage_voffset;
 
-static inline unsigned long kaslr_offset(void)
-{
-	return kimage_vaddr - KIMAGE_VADDR;
-}
+#define kaslr_offset()	((unsigned long)(kimage_vaddr - KIMAGE_VADDR))
 
-static inline bool kaslr_enabled(void)
-{
-	/*
-	 * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
-	 * placement of the image rather than from the seed, so a displacement
-	 * of less than MIN_KIMG_ALIGN means that no seed was provided.
-	 */
-	return kaslr_offset() >= MIN_KIMG_ALIGN;
-}
+/*
+ * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
+ * placement of the image rather than from the seed, so a displacement
+ * of less than MIN_KIMG_ALIGN means that no seed was provided.
+ */
+#define kaslr_enabled()	(kaslr_offset() >= MIN_KIMG_ALIGN)
 
 /*
  * Allow all memory at the discovery stage. We will clip it later.
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0bd18de9fd97..cb526e69299d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -21,7 +21,7 @@ 
  * VMALLOC_END: extends to the available space below vmemmap, PCI I/O space
  *	and fixed mappings
  */
-#define VMALLOC_START		(MODULES_END)
+#define VMALLOC_START		(MODULES_END + PGDIR_SIZE)
 #define VMALLOC_END		(VMEMMAP_START - SZ_256M)
 
 #define vmemmap			((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 4829abe017e9..baf74d0c43c9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -478,7 +478,8 @@  void __init create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
 static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
 				phys_addr_t size, pgprot_t prot)
 {
-	if ((virt >= PAGE_END) && (virt < VMALLOC_START)) {
+	if ((virt >= PAGE_END) && (virt < VMALLOC_START) &&
+	    !is_kernel(virt)) {
 		pr_warn("BUG: not updating mapping for %pa at 0x%016lx - outside kernel range\n",
 			&phys, virt);
 		return;