mbox series

[RFC,00/17] arm64 kernel text replication

Message ID ZHYCUVa8fzmB4XZV@shell.armlinux.org.uk (mailing list archive)
Headers show
Series arm64 kernel text replication | expand

Message

Russell King (Oracle) May 30, 2023, 2:04 p.m. UTC
Problem
-------

NUMA systems have greater latency when accessing data and instructions
across nodes, which can lead to a reduction in performance on CPU cores
that mainly perform accesses beyond their local node.

Normally when an ARM64 system boots, the kernel will end up placed in
memory, and each CPU core will have to fetch instructions and data from
which ever NUMA node the kernel has been placed. This means that while
executing kernel code, CPUs local to that node will run faster than
CPUs in remote nodes.

The higher the latency to access remote NUMA node memory, the more the
kernel performance suffers on those nodes.

If there is a local copy of the kernel text in each node's RAM, and
each node runs the kernel using its local copy of the kernel text,
then it stands to reason that the kernel will run faster due to fewer
stalls while instructions are fetched from remote memory.

The question then arises how to achieve this.

Background
----------

An important issue to contend with is what happens when a thread
migrates between nodes. Essentially, the thread's state (including
instruction pointer) is saved to memory, and the scheduler on that CPU
loads some other thread's state and that CPU resumes executing that
new thread.

The CPU gaining the migrating thread loads the saved state, again
including the instruction pointer, and the gaining CPU resumes fetching
instructions at the virtual address where the original CPU left off.

The key point is that the virtual address is what matters here, and
this gives us a way to implement kernel text replication fairly easily.
At a practical level, all we need to do is to ensure that the virtual
addresses which contain the kernel text point to a local copy of the
that text.

This is exactly how this proposal of kernel text replication achieves
the replication. We can go a little bit further and include most of
the read-only data in this replication, as that will never be written
to by the kernel (and thus remains constant.)

Solution
--------

So, what we need to achieve is:

1. multiple identical copies of the kernel text (and read-only data)
2. point the virtual mappings to the appropriate copy of kernel text
   for the NUMA node.

(1) is fairly easy to achieve - we just need to allocate some memory
in the appropriate node and copy the parts of the kernel we want to
replicate. However, we also need to deal with ARM64's kernel patching.
There are two functions that patch the kernel text,
__apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
these need to to be modified to update all copies of the kernel text.

(2) is slightly harder.

Firstly, the aarch64 architecture has a very useful feature here - the
kernel page tables are entirely separate from the user page tables.
The hardware contains two page table pointers, one is used for user
mappings, the other is used for kernel mappings.

Therefore, we only have one page table to be concerned with: the table
which maps kernel space. We do not need to be concerned with each
user processes page table.

The approach taken here is to ensure that the kernel is located in an
area of kernel virtual address space covered by a level-0 page table
entry which is not shared with any other user. We can then maintain
separate per-node level-0 page tables for kernel space where the only
difference between them is this level-0 page table entry.

This gives a couple of benefits. Firstly, when updates to the level-0
page table happen (e.g. when establishing new mappings) these updates
can simply be copied to the other level-0 page tables provided it isn't
for the kernel image. Secondly, we don't need complexity at lower
levels of the page table code to figure out whether a level-1 or lower
update needs to be propagated to other nodes.

The level-0 page table entry for the kernel can then be used to point
at a node-unique set of level 1..N page tables to make the appropriate
copy of the kernel text (and read-only data) into kernel space, while
keeping the kernel read-write data shared between nodes.

Performance Analysis
--------------------

Needless to say, the performance results from kernel text replication
are workload specific, but appear to show a gain of between 6% and
17% for database-centric like workloads. When combined with userspace
awareness of NUMA, this can result in a gain of over 50%.

Problems
--------

There are a few areas that are a problem for kernel text replication:
1) As this series changes the kernel space virtual address space
   layout, it breaks KASAN - and I've zero knowledge of KASAN so I
   have no idea how to fix it. I would be grateful for input from
   KASAN folk for suggestions how to fix this.

2) KASLR can not be used with kernel text replication, since we need
   to place the kernel in its own L0 page table entry, not in vmalloc
   space. KASLR is disabled when support for kernel text replication
   is enabled.

3) Changing the kernel virtual address space layout also means that
   kaslr_offset() and kaslr_enabled() need to become macros rather
   than inline functions due to the use of PGDIR_SIZE in the
   calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
   constant, but asm/memory.h is included by asm/pgtable.h, having
   this symbol available would produce a circular include
   dependency, so I don't think there is any choice here.

4) read-only protection for replicated kernel images is not yet
   implemented.

Patch overview:

Patch 1 cleans up the rox page protection logic.
Patch 2 reoganises the kernel virtual address space layout (causing
  problems (1 and 3).
Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
  addresses.
Patch 4 makes a needed cache flushing function visible.
Patch 5 through 16 are the guts of kernel text replication.
Patch 17 adds the Kconfig entry for it.

Further patches not included in this set add a Kconfig for the default
state, a test module, and add code to verify the replicated kernel
text matches the node 0 text after the kernel has completed most of
its boot.

 Documentation/admin-guide/kernel-parameters.txt |   5 +
 arch/arm64/Kconfig                              |  10 +-
 arch/arm64/include/asm/cacheflush.h             |   2 +
 arch/arm64/include/asm/ktext.h                  |  45 ++++++
 arch/arm64/include/asm/memory.h                 |  26 ++--
 arch/arm64/include/asm/mmu_context.h            |  12 +-
 arch/arm64/include/asm/pgtable.h                |  35 ++++-
 arch/arm64/include/asm/smp.h                    |   1 +
 arch/arm64/kernel/alternative.c                 |   4 +-
 arch/arm64/kernel/asm-offsets.c                 |   1 +
 arch/arm64/kernel/cpufeature.c                  |   2 +-
 arch/arm64/kernel/head.S                        |   3 +-
 arch/arm64/kernel/hibernate.c                   |   2 +-
 arch/arm64/kernel/patching.c                    |   7 +-
 arch/arm64/kernel/smp.c                         |   3 +
 arch/arm64/kernel/suspend.c                     |   3 +-
 arch/arm64/kernel/vmlinux.lds.S                 |   3 +
 arch/arm64/mm/Makefile                          |   2 +
 arch/arm64/mm/init.c                            |   3 +
 arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
 arch/arm64/mm/mmu.c                             |  85 ++++++++--
 21 files changed, 413 insertions(+), 39 deletions(-)
 create mode 100644 arch/arm64/include/asm/ktext.h
 create mode 100644 arch/arm64/mm/ktext.c

Comments

Russell King (Oracle) June 5, 2023, 9:05 a.m. UTC | #1
Hi,

Are there any comments on this?

Thanks.

On Tue, May 30, 2023 at 03:04:01PM +0100, Russell King (Oracle) wrote:
> Problem
> -------
> 
> NUMA systems have greater latency when accessing data and instructions
> across nodes, which can lead to a reduction in performance on CPU cores
> that mainly perform accesses beyond their local node.
> 
> Normally when an ARM64 system boots, the kernel will end up placed in
> memory, and each CPU core will have to fetch instructions and data from
> which ever NUMA node the kernel has been placed. This means that while
> executing kernel code, CPUs local to that node will run faster than
> CPUs in remote nodes.
> 
> The higher the latency to access remote NUMA node memory, the more the
> kernel performance suffers on those nodes.
> 
> If there is a local copy of the kernel text in each node's RAM, and
> each node runs the kernel using its local copy of the kernel text,
> then it stands to reason that the kernel will run faster due to fewer
> stalls while instructions are fetched from remote memory.
> 
> The question then arises how to achieve this.
> 
> Background
> ----------
> 
> An important issue to contend with is what happens when a thread
> migrates between nodes. Essentially, the thread's state (including
> instruction pointer) is saved to memory, and the scheduler on that CPU
> loads some other thread's state and that CPU resumes executing that
> new thread.
> 
> The CPU gaining the migrating thread loads the saved state, again
> including the instruction pointer, and the gaining CPU resumes fetching
> instructions at the virtual address where the original CPU left off.
> 
> The key point is that the virtual address is what matters here, and
> this gives us a way to implement kernel text replication fairly easily.
> At a practical level, all we need to do is to ensure that the virtual
> addresses which contain the kernel text point to a local copy of the
> that text.
> 
> This is exactly how this proposal of kernel text replication achieves
> the replication. We can go a little bit further and include most of
> the read-only data in this replication, as that will never be written
> to by the kernel (and thus remains constant.)
> 
> Solution
> --------
> 
> So, what we need to achieve is:
> 
> 1. multiple identical copies of the kernel text (and read-only data)
> 2. point the virtual mappings to the appropriate copy of kernel text
>    for the NUMA node.
> 
> (1) is fairly easy to achieve - we just need to allocate some memory
> in the appropriate node and copy the parts of the kernel we want to
> replicate. However, we also need to deal with ARM64's kernel patching.
> There are two functions that patch the kernel text,
> __apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
> these need to to be modified to update all copies of the kernel text.
> 
> (2) is slightly harder.
> 
> Firstly, the aarch64 architecture has a very useful feature here - the
> kernel page tables are entirely separate from the user page tables.
> The hardware contains two page table pointers, one is used for user
> mappings, the other is used for kernel mappings.
> 
> Therefore, we only have one page table to be concerned with: the table
> which maps kernel space. We do not need to be concerned with each
> user processes page table.
> 
> The approach taken here is to ensure that the kernel is located in an
> area of kernel virtual address space covered by a level-0 page table
> entry which is not shared with any other user. We can then maintain
> separate per-node level-0 page tables for kernel space where the only
> difference between them is this level-0 page table entry.
> 
> This gives a couple of benefits. Firstly, when updates to the level-0
> page table happen (e.g. when establishing new mappings) these updates
> can simply be copied to the other level-0 page tables provided it isn't
> for the kernel image. Secondly, we don't need complexity at lower
> levels of the page table code to figure out whether a level-1 or lower
> update needs to be propagated to other nodes.
> 
> The level-0 page table entry for the kernel can then be used to point
> at a node-unique set of level 1..N page tables to make the appropriate
> copy of the kernel text (and read-only data) into kernel space, while
> keeping the kernel read-write data shared between nodes.
> 
> Performance Analysis
> --------------------
> 
> Needless to say, the performance results from kernel text replication
> are workload specific, but appear to show a gain of between 6% and
> 17% for database-centric like workloads. When combined with userspace
> awareness of NUMA, this can result in a gain of over 50%.
> 
> Problems
> --------
> 
> There are a few areas that are a problem for kernel text replication:
> 1) As this series changes the kernel space virtual address space
>    layout, it breaks KASAN - and I've zero knowledge of KASAN so I
>    have no idea how to fix it. I would be grateful for input from
>    KASAN folk for suggestions how to fix this.
> 
> 2) KASLR can not be used with kernel text replication, since we need
>    to place the kernel in its own L0 page table entry, not in vmalloc
>    space. KASLR is disabled when support for kernel text replication
>    is enabled.
> 
> 3) Changing the kernel virtual address space layout also means that
>    kaslr_offset() and kaslr_enabled() need to become macros rather
>    than inline functions due to the use of PGDIR_SIZE in the
>    calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
>    constant, but asm/memory.h is included by asm/pgtable.h, having
>    this symbol available would produce a circular include
>    dependency, so I don't think there is any choice here.
> 
> 4) read-only protection for replicated kernel images is not yet
>    implemented.
> 
> Patch overview:
> 
> Patch 1 cleans up the rox page protection logic.
> Patch 2 reoganises the kernel virtual address space layout (causing
>   problems (1 and 3).
> Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
>   addresses.
> Patch 4 makes a needed cache flushing function visible.
> Patch 5 through 16 are the guts of kernel text replication.
> Patch 17 adds the Kconfig entry for it.
> 
> Further patches not included in this set add a Kconfig for the default
> state, a test module, and add code to verify the replicated kernel
> text matches the node 0 text after the kernel has completed most of
> its boot.
> 
>  Documentation/admin-guide/kernel-parameters.txt |   5 +
>  arch/arm64/Kconfig                              |  10 +-
>  arch/arm64/include/asm/cacheflush.h             |   2 +
>  arch/arm64/include/asm/ktext.h                  |  45 ++++++
>  arch/arm64/include/asm/memory.h                 |  26 ++--
>  arch/arm64/include/asm/mmu_context.h            |  12 +-
>  arch/arm64/include/asm/pgtable.h                |  35 ++++-
>  arch/arm64/include/asm/smp.h                    |   1 +
>  arch/arm64/kernel/alternative.c                 |   4 +-
>  arch/arm64/kernel/asm-offsets.c                 |   1 +
>  arch/arm64/kernel/cpufeature.c                  |   2 +-
>  arch/arm64/kernel/head.S                        |   3 +-
>  arch/arm64/kernel/hibernate.c                   |   2 +-
>  arch/arm64/kernel/patching.c                    |   7 +-
>  arch/arm64/kernel/smp.c                         |   3 +
>  arch/arm64/kernel/suspend.c                     |   3 +-
>  arch/arm64/kernel/vmlinux.lds.S                 |   3 +
>  arch/arm64/mm/Makefile                          |   2 +
>  arch/arm64/mm/init.c                            |   3 +
>  arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
>  arch/arm64/mm/mmu.c                             |  85 ++++++++--
>  21 files changed, 413 insertions(+), 39 deletions(-)
>  create mode 100644 arch/arm64/include/asm/ktext.h
>  create mode 100644 arch/arm64/mm/ktext.c
> 
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
>
Mark Rutland June 5, 2023, 1:46 p.m. UTC | #2
On Mon, Jun 05, 2023 at 10:05:22AM +0100, Russell King (Oracle) wrote:
> Hi,
> 
> Are there any comments on this?

This is on my queue of things to review, but I haven't had the chance to give
more than a cursory look so far. I'm hoping to get to it in the next few days.

Thanks,
Mark.

> 
> Thanks.
> 
> On Tue, May 30, 2023 at 03:04:01PM +0100, Russell King (Oracle) wrote:
> > Problem
> > -------
> > 
> > NUMA systems have greater latency when accessing data and instructions
> > across nodes, which can lead to a reduction in performance on CPU cores
> > that mainly perform accesses beyond their local node.
> > 
> > Normally when an ARM64 system boots, the kernel will end up placed in
> > memory, and each CPU core will have to fetch instructions and data from
> > which ever NUMA node the kernel has been placed. This means that while
> > executing kernel code, CPUs local to that node will run faster than
> > CPUs in remote nodes.
> > 
> > The higher the latency to access remote NUMA node memory, the more the
> > kernel performance suffers on those nodes.
> > 
> > If there is a local copy of the kernel text in each node's RAM, and
> > each node runs the kernel using its local copy of the kernel text,
> > then it stands to reason that the kernel will run faster due to fewer
> > stalls while instructions are fetched from remote memory.
> > 
> > The question then arises how to achieve this.
> > 
> > Background
> > ----------
> > 
> > An important issue to contend with is what happens when a thread
> > migrates between nodes. Essentially, the thread's state (including
> > instruction pointer) is saved to memory, and the scheduler on that CPU
> > loads some other thread's state and that CPU resumes executing that
> > new thread.
> > 
> > The CPU gaining the migrating thread loads the saved state, again
> > including the instruction pointer, and the gaining CPU resumes fetching
> > instructions at the virtual address where the original CPU left off.
> > 
> > The key point is that the virtual address is what matters here, and
> > this gives us a way to implement kernel text replication fairly easily.
> > At a practical level, all we need to do is to ensure that the virtual
> > addresses which contain the kernel text point to a local copy of the
> > that text.
> > 
> > This is exactly how this proposal of kernel text replication achieves
> > the replication. We can go a little bit further and include most of
> > the read-only data in this replication, as that will never be written
> > to by the kernel (and thus remains constant.)
> > 
> > Solution
> > --------
> > 
> > So, what we need to achieve is:
> > 
> > 1. multiple identical copies of the kernel text (and read-only data)
> > 2. point the virtual mappings to the appropriate copy of kernel text
> >    for the NUMA node.
> > 
> > (1) is fairly easy to achieve - we just need to allocate some memory
> > in the appropriate node and copy the parts of the kernel we want to
> > replicate. However, we also need to deal with ARM64's kernel patching.
> > There are two functions that patch the kernel text,
> > __apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
> > these need to to be modified to update all copies of the kernel text.
> > 
> > (2) is slightly harder.
> > 
> > Firstly, the aarch64 architecture has a very useful feature here - the
> > kernel page tables are entirely separate from the user page tables.
> > The hardware contains two page table pointers, one is used for user
> > mappings, the other is used for kernel mappings.
> > 
> > Therefore, we only have one page table to be concerned with: the table
> > which maps kernel space. We do not need to be concerned with each
> > user processes page table.
> > 
> > The approach taken here is to ensure that the kernel is located in an
> > area of kernel virtual address space covered by a level-0 page table
> > entry which is not shared with any other user. We can then maintain
> > separate per-node level-0 page tables for kernel space where the only
> > difference between them is this level-0 page table entry.
> > 
> > This gives a couple of benefits. Firstly, when updates to the level-0
> > page table happen (e.g. when establishing new mappings) these updates
> > can simply be copied to the other level-0 page tables provided it isn't
> > for the kernel image. Secondly, we don't need complexity at lower
> > levels of the page table code to figure out whether a level-1 or lower
> > update needs to be propagated to other nodes.
> > 
> > The level-0 page table entry for the kernel can then be used to point
> > at a node-unique set of level 1..N page tables to make the appropriate
> > copy of the kernel text (and read-only data) into kernel space, while
> > keeping the kernel read-write data shared between nodes.
> > 
> > Performance Analysis
> > --------------------
> > 
> > Needless to say, the performance results from kernel text replication
> > are workload specific, but appear to show a gain of between 6% and
> > 17% for database-centric like workloads. When combined with userspace
> > awareness of NUMA, this can result in a gain of over 50%.
> > 
> > Problems
> > --------
> > 
> > There are a few areas that are a problem for kernel text replication:
> > 1) As this series changes the kernel space virtual address space
> >    layout, it breaks KASAN - and I've zero knowledge of KASAN so I
> >    have no idea how to fix it. I would be grateful for input from
> >    KASAN folk for suggestions how to fix this.
> > 
> > 2) KASLR can not be used with kernel text replication, since we need
> >    to place the kernel in its own L0 page table entry, not in vmalloc
> >    space. KASLR is disabled when support for kernel text replication
> >    is enabled.
> > 
> > 3) Changing the kernel virtual address space layout also means that
> >    kaslr_offset() and kaslr_enabled() need to become macros rather
> >    than inline functions due to the use of PGDIR_SIZE in the
> >    calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
> >    constant, but asm/memory.h is included by asm/pgtable.h, having
> >    this symbol available would produce a circular include
> >    dependency, so I don't think there is any choice here.
> > 
> > 4) read-only protection for replicated kernel images is not yet
> >    implemented.
> > 
> > Patch overview:
> > 
> > Patch 1 cleans up the rox page protection logic.
> > Patch 2 reoganises the kernel virtual address space layout (causing
> >   problems (1 and 3).
> > Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
> >   addresses.
> > Patch 4 makes a needed cache flushing function visible.
> > Patch 5 through 16 are the guts of kernel text replication.
> > Patch 17 adds the Kconfig entry for it.
> > 
> > Further patches not included in this set add a Kconfig for the default
> > state, a test module, and add code to verify the replicated kernel
> > text matches the node 0 text after the kernel has completed most of
> > its boot.
> > 
> >  Documentation/admin-guide/kernel-parameters.txt |   5 +
> >  arch/arm64/Kconfig                              |  10 +-
> >  arch/arm64/include/asm/cacheflush.h             |   2 +
> >  arch/arm64/include/asm/ktext.h                  |  45 ++++++
> >  arch/arm64/include/asm/memory.h                 |  26 ++--
> >  arch/arm64/include/asm/mmu_context.h            |  12 +-
> >  arch/arm64/include/asm/pgtable.h                |  35 ++++-
> >  arch/arm64/include/asm/smp.h                    |   1 +
> >  arch/arm64/kernel/alternative.c                 |   4 +-
> >  arch/arm64/kernel/asm-offsets.c                 |   1 +
> >  arch/arm64/kernel/cpufeature.c                  |   2 +-
> >  arch/arm64/kernel/head.S                        |   3 +-
> >  arch/arm64/kernel/hibernate.c                   |   2 +-
> >  arch/arm64/kernel/patching.c                    |   7 +-
> >  arch/arm64/kernel/smp.c                         |   3 +
> >  arch/arm64/kernel/suspend.c                     |   3 +-
> >  arch/arm64/kernel/vmlinux.lds.S                 |   3 +
> >  arch/arm64/mm/Makefile                          |   2 +
> >  arch/arm64/mm/init.c                            |   3 +
> >  arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c                             |  85 ++++++++--
> >  21 files changed, 413 insertions(+), 39 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/ktext.h
> >  create mode 100644 arch/arm64/mm/ktext.c
> > 
> > 
> > -- 
> > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> > 
> 
> -- 
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Ard Biesheuvel June 23, 2023, 3:24 p.m. UTC | #3
(cc Marc and Quentin)

On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> Hi,
>
> Are there any comments on this?
>

Hi Russell,

I think the proposed approach is sound, but it is rather intrusive, as
you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
work gets merged (which uses root level -1 when booted on LPA2 capable
hardware, and level 0 otherwise), we'll have yet another combination
that is either fully incompatible, or cumbersome to support at the
very least.

I wonder if it would be worthwhile to explore an alternative approach,
using pKVM and the host stage2:

- all stage1 kernel mappings remain as they are, and the kernel code
running at EL1 has no awareness of the replication beyond being
involved in allocating the memory;
- host is booted in protected KVM mode, which means that the host
kernel executes under a stage 2 mapping;
- each NUMA node has its own set of stage 2 page tables, and maps the
kernel's code/rodata IPA range to a NUMA local PA range
- the kernel's code and rodata are mapped read-only in the primary
stage-2 mapping so updates trap to EL2, permitting the hypervisor to
replicate those update to all clones.

Note that pKVM retains the capabilities of ordinary KVM, so as long as
you boot at EL2, the only downside compared to your approach would be
the increased TLB footprint due to the stage 2 mappings for the host
kernel.

Marc, Quentin, Will: any thoughts?



>
> On Tue, May 30, 2023 at 03:04:01PM +0100, Russell King (Oracle) wrote:
> > Problem
> > -------
> >
> > NUMA systems have greater latency when accessing data and instructions
> > across nodes, which can lead to a reduction in performance on CPU cores
> > that mainly perform accesses beyond their local node.
> >
> > Normally when an ARM64 system boots, the kernel will end up placed in
> > memory, and each CPU core will have to fetch instructions and data from
> > which ever NUMA node the kernel has been placed. This means that while
> > executing kernel code, CPUs local to that node will run faster than
> > CPUs in remote nodes.
> >
> > The higher the latency to access remote NUMA node memory, the more the
> > kernel performance suffers on those nodes.
> >
> > If there is a local copy of the kernel text in each node's RAM, and
> > each node runs the kernel using its local copy of the kernel text,
> > then it stands to reason that the kernel will run faster due to fewer
> > stalls while instructions are fetched from remote memory.
> >
> > The question then arises how to achieve this.
> >
> > Background
> > ----------
> >
> > An important issue to contend with is what happens when a thread
> > migrates between nodes. Essentially, the thread's state (including
> > instruction pointer) is saved to memory, and the scheduler on that CPU
> > loads some other thread's state and that CPU resumes executing that
> > new thread.
> >
> > The CPU gaining the migrating thread loads the saved state, again
> > including the instruction pointer, and the gaining CPU resumes fetching
> > instructions at the virtual address where the original CPU left off.
> >
> > The key point is that the virtual address is what matters here, and
> > this gives us a way to implement kernel text replication fairly easily.
> > At a practical level, all we need to do is to ensure that the virtual
> > addresses which contain the kernel text point to a local copy of the
> > that text.
> >
> > This is exactly how this proposal of kernel text replication achieves
> > the replication. We can go a little bit further and include most of
> > the read-only data in this replication, as that will never be written
> > to by the kernel (and thus remains constant.)
> >
> > Solution
> > --------
> >
> > So, what we need to achieve is:
> >
> > 1. multiple identical copies of the kernel text (and read-only data)
> > 2. point the virtual mappings to the appropriate copy of kernel text
> >    for the NUMA node.
> >
> > (1) is fairly easy to achieve - we just need to allocate some memory
> > in the appropriate node and copy the parts of the kernel we want to
> > replicate. However, we also need to deal with ARM64's kernel patching.
> > There are two functions that patch the kernel text,
> > __apply_alternatives() and aarch64_insn_patch_text_nosync(). Both of
> > these need to to be modified to update all copies of the kernel text.
> >
> > (2) is slightly harder.
> >
> > Firstly, the aarch64 architecture has a very useful feature here - the
> > kernel page tables are entirely separate from the user page tables.
> > The hardware contains two page table pointers, one is used for user
> > mappings, the other is used for kernel mappings.
> >
> > Therefore, we only have one page table to be concerned with: the table
> > which maps kernel space. We do not need to be concerned with each
> > user processes page table.
> >
> > The approach taken here is to ensure that the kernel is located in an
> > area of kernel virtual address space covered by a level-0 page table
> > entry which is not shared with any other user. We can then maintain
> > separate per-node level-0 page tables for kernel space where the only
> > difference between them is this level-0 page table entry.
> >
> > This gives a couple of benefits. Firstly, when updates to the level-0
> > page table happen (e.g. when establishing new mappings) these updates
> > can simply be copied to the other level-0 page tables provided it isn't
> > for the kernel image. Secondly, we don't need complexity at lower
> > levels of the page table code to figure out whether a level-1 or lower
> > update needs to be propagated to other nodes.
> >
> > The level-0 page table entry for the kernel can then be used to point
> > at a node-unique set of level 1..N page tables to make the appropriate
> > copy of the kernel text (and read-only data) into kernel space, while
> > keeping the kernel read-write data shared between nodes.
> >
> > Performance Analysis
> > --------------------
> >
> > Needless to say, the performance results from kernel text replication
> > are workload specific, but appear to show a gain of between 6% and
> > 17% for database-centric like workloads. When combined with userspace
> > awareness of NUMA, this can result in a gain of over 50%.
> >
> > Problems
> > --------
> >
> > There are a few areas that are a problem for kernel text replication:
> > 1) As this series changes the kernel space virtual address space
> >    layout, it breaks KASAN - and I've zero knowledge of KASAN so I
> >    have no idea how to fix it. I would be grateful for input from
> >    KASAN folk for suggestions how to fix this.
> >
> > 2) KASLR can not be used with kernel text replication, since we need
> >    to place the kernel in its own L0 page table entry, not in vmalloc
> >    space. KASLR is disabled when support for kernel text replication
> >    is enabled.
> >
> > 3) Changing the kernel virtual address space layout also means that
> >    kaslr_offset() and kaslr_enabled() need to become macros rather
> >    than inline functions due to the use of PGDIR_SIZE in the
> >    calculation of KIMAGE_VADDR. Since asm/pgtable.h defines this
> >    constant, but asm/memory.h is included by asm/pgtable.h, having
> >    this symbol available would produce a circular include
> >    dependency, so I don't think there is any choice here.
> >
> > 4) read-only protection for replicated kernel images is not yet
> >    implemented.
> >
> > Patch overview:
> >
> > Patch 1 cleans up the rox page protection logic.
> > Patch 2 reoganises the kernel virtual address space layout (causing
> >   problems (1 and 3).
> > Patch 3 provides a version of cpu_replace_ttbr1 that takes physical
> >   addresses.
> > Patch 4 makes a needed cache flushing function visible.
> > Patch 5 through 16 are the guts of kernel text replication.
> > Patch 17 adds the Kconfig entry for it.
> >
> > Further patches not included in this set add a Kconfig for the default
> > state, a test module, and add code to verify the replicated kernel
> > text matches the node 0 text after the kernel has completed most of
> > its boot.
> >
> >  Documentation/admin-guide/kernel-parameters.txt |   5 +
> >  arch/arm64/Kconfig                              |  10 +-
> >  arch/arm64/include/asm/cacheflush.h             |   2 +
> >  arch/arm64/include/asm/ktext.h                  |  45 ++++++
> >  arch/arm64/include/asm/memory.h                 |  26 ++--
> >  arch/arm64/include/asm/mmu_context.h            |  12 +-
> >  arch/arm64/include/asm/pgtable.h                |  35 ++++-
> >  arch/arm64/include/asm/smp.h                    |   1 +
> >  arch/arm64/kernel/alternative.c                 |   4 +-
> >  arch/arm64/kernel/asm-offsets.c                 |   1 +
> >  arch/arm64/kernel/cpufeature.c                  |   2 +-
> >  arch/arm64/kernel/head.S                        |   3 +-
> >  arch/arm64/kernel/hibernate.c                   |   2 +-
> >  arch/arm64/kernel/patching.c                    |   7 +-
> >  arch/arm64/kernel/smp.c                         |   3 +
> >  arch/arm64/kernel/suspend.c                     |   3 +-
> >  arch/arm64/kernel/vmlinux.lds.S                 |   3 +
> >  arch/arm64/mm/Makefile                          |   2 +
> >  arch/arm64/mm/init.c                            |   3 +
> >  arch/arm64/mm/ktext.c                           | 198 ++++++++++++++++++++++++
> >  arch/arm64/mm/mmu.c                             |  85 ++++++++--
> >  21 files changed, 413 insertions(+), 39 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/ktext.h
> >  create mode 100644 arch/arm64/mm/ktext.c
> >
> >
> > --
> > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
> >
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Russell King (Oracle) June 23, 2023, 3:34 p.m. UTC | #4
On Fri, Jun 23, 2023 at 05:24:20PM +0200, Ard Biesheuvel wrote:
> (cc Marc and Quentin)
> 
> On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > Hi,
> >
> > Are there any comments on this?
> >
> 
> Hi Russell,
> 
> I think the proposed approach is sound, but it is rather intrusive, as
> you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
> work gets merged (which uses root level -1 when booted on LPA2 capable
> hardware, and level 0 otherwise), we'll have yet another combination
> that is either fully incompatible, or cumbersome to support at the
> very least.
> 
> I wonder if it would be worthwhile to explore an alternative approach,
> using pKVM and the host stage2:
> 
> - all stage1 kernel mappings remain as they are, and the kernel code
> running at EL1 has no awareness of the replication beyond being
> involved in allocating the memory;
> - host is booted in protected KVM mode, which means that the host
> kernel executes under a stage 2 mapping;
> - each NUMA node has its own set of stage 2 page tables, and maps the
> kernel's code/rodata IPA range to a NUMA local PA range
> - the kernel's code and rodata are mapped read-only in the primary
> stage-2 mapping so updates trap to EL2, permitting the hypervisor to
> replicate those update to all clones.
> 
> Note that pKVM retains the capabilities of ordinary KVM, so as long as
> you boot at EL2, the only downside compared to your approach would be
> the increased TLB footprint due to the stage 2 mappings for the host
> kernel.
> 
> Marc, Quentin, Will: any thoughts?

Thanks for taking a look.

That sounds great, but my initial question would be whether, with such a
setup, one could then run VMs under such a kernel without hardware that
supports nested virtualisation? I suspect the answer would be no.
Marc Zyngier June 23, 2023, 3:54 p.m. UTC | #5
On 2023-06-23 16:34, Russell King (Oracle) wrote:
> On Fri, Jun 23, 2023 at 05:24:20PM +0200, Ard Biesheuvel wrote:
>> (cc Marc and Quentin)
>> 
>> On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
>> <linux@armlinux.org.uk> wrote:
>> >
>> > Hi,
>> >
>> > Are there any comments on this?
>> >
>> 
>> Hi Russell,
>> 
>> I think the proposed approach is sound, but it is rather intrusive, as
>> you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
>> work gets merged (which uses root level -1 when booted on LPA2 capable
>> hardware, and level 0 otherwise), we'll have yet another combination
>> that is either fully incompatible, or cumbersome to support at the
>> very least.
>> 
>> I wonder if it would be worthwhile to explore an alternative approach,
>> using pKVM and the host stage2:
>> 
>> - all stage1 kernel mappings remain as they are, and the kernel code
>> running at EL1 has no awareness of the replication beyond being
>> involved in allocating the memory;
>> - host is booted in protected KVM mode, which means that the host
>> kernel executes under a stage 2 mapping;
>> - each NUMA node has its own set of stage 2 page tables, and maps the
>> kernel's code/rodata IPA range to a NUMA local PA range
>> - the kernel's code and rodata are mapped read-only in the primary
>> stage-2 mapping so updates trap to EL2, permitting the hypervisor to
>> replicate those update to all clones.
>> 
>> Note that pKVM retains the capabilities of ordinary KVM, so as long as
>> you boot at EL2, the only downside compared to your approach would be
>> the increased TLB footprint due to the stage 2 mappings for the host
>> kernel.
>> 
>> Marc, Quentin, Will: any thoughts?
> 
> Thanks for taking a look.
> 
> That sounds great, but my initial question would be whether, with such 
> a
> setup, one could then run VMs under such a kernel without hardware that
> supports nested virtualisation? I suspect the answer would be no.

The answer is yes. All you need to do is to switch between the host
and guest stage-2s in the hypervisor, which is what KVM running in
protected mode does.

         M.
Marc Zyngier June 23, 2023, 4:37 p.m. UTC | #6
On Fri, 23 Jun 2023 16:24:20 +0100,
Ard Biesheuvel <ardb@kernel.org> wrote:
> 
> (cc Marc and Quentin)
> 
> On Mon, 5 Jun 2023 at 11:05, Russell King (Oracle)
> <linux@armlinux.org.uk> wrote:
> >
> > Hi,
> >
> > Are there any comments on this?
> >
> 
> Hi Russell,
> 
> I think the proposed approach is sound, but it is rather intrusive, as
> you've pointed out already (wrt KASLR and KASAN etc). And once my LPA2
> work gets merged (which uses root level -1 when booted on LPA2 capable
> hardware, and level 0 otherwise), we'll have yet another combination
> that is either fully incompatible, or cumbersome to support at the
> very least.
> 
> I wonder if it would be worthwhile to explore an alternative approach,
> using pKVM and the host stage2:
> 
> - all stage1 kernel mappings remain as they are, and the kernel code
> running at EL1 has no awareness of the replication beyond being
> involved in allocating the memory;
> - host is booted in protected KVM mode, which means that the host
> kernel executes under a stage 2 mapping;
> - each NUMA node has its own set of stage 2 page tables, and maps the
> kernel's code/rodata IPA range to a NUMA local PA range
> - the kernel's code and rodata are mapped read-only in the primary
> stage-2 mapping so updates trap to EL2, permitting the hypervisor to
> replicate those update to all clones.
> 
> Note that pKVM retains the capabilities of ordinary KVM, so as long as
> you boot at EL2, the only downside compared to your approach would be
> the increased TLB footprint due to the stage 2 mappings for the host
> kernel.
> 
> Marc, Quentin, Will: any thoughts?

I like the idea, though there are a couple of 'interesting' corner
cases:

- you have to give up VHE, which means that if your workload is to
  mainly run VMs, you pay an extra cost on each guest entry/exit

- the EL2 code doesn't have the luxury of a stage-2, meaning that
  either you accept the fact that this code is going to suffer form
  uneven performance, or you keep the complexity of the kernel-visible
  replication for the EL2 code only

- memory allocation for the stage-2 is tricky (Quentin can talk about
  that), and relies on being able to steal enough memory to cover the
  whole of the host's memory-map, including I/O. Having a set of S2
  PTs per node is going to increase that pressure/complexity

- I'm not too worried about the TLB aspect. Cores tend to cache VA/PA,
  not VA/IPA+IPA/PA. What is going to cost is the walk itself. This
  could be mitigated if S2 uses large mappings (possibly using 64k
  pages).

The last point makes me think that what this approach may not be pKVM
itself, but something that builds on top of what pKVM has (host S2)
and the nVHE/hVHE behaviour.

Thanks,

	M.
Lameter, Christopher June 26, 2023, 11:42 p.m. UTC | #7
On Fri, 23 Jun 2023, Marc Zyngier wrote:

>> That sounds great, but my initial question would be whether, with such a
>> setup, one could then run VMs under such a kernel without hardware that
>> supports nested virtualisation? I suspect the answer would be no.
>
> The answer is yes. All you need to do is to switch between the host
> and guest stage-2s in the hypervisor, which is what KVM running in
> protected mode does.

Well I think his point was that there are machines running without a 
hypervisor and kernel replication needs to work on that. We certainly 
benefit a lot from kernel replication and our customers may elect to run 
ARM64 kernels without hypervisors on bare metal.
Marc Zyngier June 27, 2023, 8:02 a.m. UTC | #8
On Tue, 27 Jun 2023 00:42:53 +0100,
"Lameter, Christopher" <cl@os.amperecomputing.com> wrote:
> 
> On Fri, 23 Jun 2023, Marc Zyngier wrote:
> 
> >> That sounds great, but my initial question would be whether, with such a
> >> setup, one could then run VMs under such a kernel without hardware that
> >> supports nested virtualisation? I suspect the answer would be no.
> > 
> > The answer is yes. All you need to do is to switch between the host
> > and guest stage-2s in the hypervisor, which is what KVM running in
> > protected mode does.
> 
> Well I think his point was that there are machines running without a
> hypervisor and kernel replication needs to work on that. We certainly
> benefit a lot from kernel replication and our customers may elect to
> run ARM64 kernels without hypervisors on bare metal.

These are not incompatible goals.

The hypervisor is a function that the user may want to enable or not.
Irrespective of that, the HW that underpins the virtualisation
functionality is available and allows you to solve this particular
problem in a different way. This doesn't preclude from running
bare-metal at all.

There is even precedent in using stage-2 to work around critical bugs
(the Socionext PCIe fiasco springs to mind).

	M.