Message ID | 20231228131056.602411-1-artem.kuzin@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | x86 NUMA-aware kernel replication | expand |
On Thu, Dec 28, 2023 at 09:10:44PM +0800, artem.kuzin@huawei.com wrote: > There was a work previously published for ARM64 platform > by Russell King (arm64 kernel text replication). > We hope that it will be possible to push this technology forward together. Honestly, I don't think the arm64 kernel text replication is going to progress. I had zero feedback on the last posting, which suggests that there is very little interest in it. With Ard's suggestion to use pKVM for it, that's totally and utterly outside of my knowledge realm about how to implement that, and I have no idea what the implications of doing so would be - would it prevent being able to run KVM guests? How does it interact with the KVM hypervisor? Does it require nested virtualisation (which isn't supported on the platforms that Oracle are interested in.) Then there is now the pKVM hypervisor for kernel text replication would grab the memory it needs to replicate the kernel. Having been through all the different layers of firmware, boot loader etc the conclusion was that it's something that the kernel should be doing - and the code that Ard pointed me towards was firmware-based. So, right now I think arm64 kernel text replication is rather dead in the water. Honestly, I now utterly regret bringing up this idea inside Oracle. It has become something of a millstone around my neck.
Hi Artem, > > Preliminary performance evaluation results: > Processor Intel(R) Xeon(R) CPU E5-2690 > 2 nodes with 12 CPU cores for each one > > fork/1 - Time measurements include only one time of invoking this system call. > Measurements are made between entering and exiting the system call. > > fork/1024 - The system call is invoked in a loop 1024 times. > The time between entering a loop and exiting it was measured. > > mmap/munmap - A set of 1024 pages (if PAGE_SIZE is not defined it is equal to 4096) > was mapped using mmap syscall and unmapped using munmap one. > Every page is mapped/unmapped per a loop iteration. > > mmap/lock - The same as above, but in this case flag MAP_LOCKED was added. > > open/close - The /dev/null pseudo-file was opened and closed in a loop 1024 times. > It was opened and closed once per iteration. > > mount - The pseudo-filesystem procFS was mounted to a temporary directory inside /tmp only one time. > The time between entering and exiting the system call was measured. > > kill - A signal handler for SIGUSR1 was setup. Signal was sent to a child process, > which was created using fork glibc's wrapper. Time between sending and receiving > SIGUSR1 signal was measured. > > Hot caches: > > fork-1 2.3% > fork-1024 10.8% > mmap/munmap 0.4% > mmap/lock 4.2% > open/close 3.2% > kill 4% > mount 8.7% > > Cold caches: > > fork-1 42.7% > fork-1024 17.1% > mmap/munmap 0.4% > mmap/lock 1.5% > open/close 0.4% > kill 26.1% > mount 4.1% > I've conducted some testing on AMD EPYC 7713 64-Core processor (dual socket, 2 NUMA nodes, 64 CPUs on each node) to evaluate the performance with this patchset. I've implemented the syscall based testcases as suggested in your previous mail. I'm shielding the 2nd NUMA node using isolcpus and nohz_full, and executing the tests on cpus belonging to this node. Performance Evaluation results (% gain over base kernel 6.5.0-rc5): Hot caches: fork-1 1.1% fork-1024 -3.8% mmap/munmap -1.5% mmap/lock -4.7% open/close -6.8% kill 3.3% mount -13.0% Cold caches: fork-1 1.2% fork-1024 -7.2% mmap/munmap -1.6% mmap/lock -1.0% open/close 4.6% kill -54.2% mount -8.5% Thanks, Shivank
On 1/25/2024 7:30 AM, Garg, Shivank wrote: > Hi Artem, > >> Preliminary performance evaluation results: >> Processor Intel(R) Xeon(R) CPU E5-2690 >> 2 nodes with 12 CPU cores for each one >> >> fork/1 - Time measurements include only one time of invoking this system call. >> Measurements are made between entering and exiting the system call. >> >> fork/1024 - The system call is invoked in a loop 1024 times. >> The time between entering a loop and exiting it was measured. >> >> mmap/munmap - A set of 1024 pages (if PAGE_SIZE is not defined it is equal to 4096) >> was mapped using mmap syscall and unmapped using munmap one. >> Every page is mapped/unmapped per a loop iteration. >> >> mmap/lock - The same as above, but in this case flag MAP_LOCKED was added. >> >> open/close - The /dev/null pseudo-file was opened and closed in a loop 1024 times. >> It was opened and closed once per iteration. >> >> mount - The pseudo-filesystem procFS was mounted to a temporary directory inside /tmp only one time. >> The time between entering and exiting the system call was measured. >> >> kill - A signal handler for SIGUSR1 was setup. Signal was sent to a child process, >> which was created using fork glibc's wrapper. Time between sending and receiving >> SIGUSR1 signal was measured. >> >> Hot caches: >> >> fork-1 2.3% >> fork-1024 10.8% >> mmap/munmap 0.4% >> mmap/lock 4.2% >> open/close 3.2% >> kill 4% >> mount 8.7% >> >> Cold caches: >> >> fork-1 42.7% >> fork-1024 17.1% >> mmap/munmap 0.4% >> mmap/lock 1.5% >> open/close 0.4% >> kill 26.1% >> mount 4.1% >> > I've conducted some testing on AMD EPYC 7713 64-Core processor (dual socket, 2 NUMA nodes, 64 CPUs on each node) to evaluate the performance with this patchset. > I've implemented the syscall based testcases as suggested in your previous mail. I'm shielding the 2nd NUMA node using isolcpus and nohz_full, and executing the tests on cpus belonging to this node. > > Performance Evaluation results (% gain over base kernel 6.5.0-rc5): > > Hot caches: > fork-1 1.1% > fork-1024 -3.8% > mmap/munmap -1.5% > mmap/lock -4.7% > open/close -6.8% > kill 3.3% > mount -13.0% > > Cold caches: > fork-1 1.2% > fork-1024 -7.2% > mmap/munmap -1.6% > mmap/lock -1.0% > open/close 4.6% > kill -54.2% > mount -8.5% > > Thanks, > Shivank > Hi Shivank, thank you for performance evaluation, unfortunately we don't have AMD EPYC right now, I'll try to find a way to perform measurements and clarify why such difference. We currently trying to make performance evaluation using database related benchmarks. Will return with the results after clarification. BR
From: Artem Kuzin <artem.kuzin@huawei.com> This patchset implements initial support of kernel text and rodata replication for x86_64 platform. Linux kernel 6.5.5 is used as a baseline. There was a work previously published for ARM64 platform by Russell King (arm64 kernel text replication). We hope that it will be possible to push this technology forward together. Current implementation supports next functionality: 1. Replicated kernel text and rodata per-NUMA node 2. Vmalloc is able to work with replicated areas, so kernel modules text and rodata also replicated during modules loading stage. 3. BPF handlers are not replicated by default, but this can be easily done by using existent APIs. 3. KASAN is working except 5-lvl translation table case. 4. KPROBES, KGDB and all functionality that depends on kernel text patching work without any limitation. 5. KPTI and KASLR fully supported. 6. Replicates parts of translation table related to replicated text and rodata. Translation tables synchronization is necessary only in several special cases: 1. Kernel boot 2. Modules deployment 3. Any allocation in user space that require new PUD/P4D In current design mutable kernel data modifications don't require synchronization between translation tables due to on 64-bit platforms all physical memory already mapped in kernel space and this mapping is persistent. In user space the translation tables synchronizations are quite rare due to the only case is new PUD/P4D allocation. Nowadays the only PGD layer is replicated for user space. Please refer the next pics. TT overview: NODE 0 NODE 1 USER KERNEL USER KERNEL --------------------- --------------------- PGD | | | | | | | | |*| | | | | | | | | |*| --------------------- --------------------- | | ------------------- ------------------- | | --------------------- --------------------- PUD | | | | | | | |*|*| | | | | | | | |*|*| --------------------- --------------------- | | ------------------- ------------------- | | --------------------- --------------------- PMD |READ-ONLY|MUTABLE | |READ-ONLY|MUTABLE | --------------------- --------------------- | | | | | -------------------------- | | | -------- ------- -------- PHYS | | | | | | MEM -------- ------- -------- <------> <------> NODE 0 Shared NODE 1 between nodes * - entries unique in each table TT synchronization: NODE 0 NODE 1 USER KERNEL USER KERNEL --------------------- --------------------- PGD | | |0| | | | | | | | | |0| | | | | | | --------------------- --------------------- | | | | | PUD_ALLOC / P4D_ALLOC | | IN USERSPACE | \/ --------------------- --------------------- PGD | | |p| | | | | | | | | |p| | | | | | | --------------------- --------------------- | | | | --------------------------- | --------------------- PUD/P4D | | | | | | | | | | --------------------- Known problems: 1. KASAN is not working in case of 5-lvl translation table. 2. Replication support in vmalloc, possibly, can be optimized in future. 3. Module APIs currently have lack of memory policies support. This part will be fixed in future. Preliminary performance evaluation results: Processor Intel(R) Xeon(R) CPU E5-2690 2 nodes with 12 CPU cores for each one fork/1 - Time measurements include only one time of invoking this system call. Measurements are made between entering and exiting the system call. fork/1024 - The system call is invoked in a loop 1024 times. The time between entering a loop and exiting it was measured. mmap/munmap - A set of 1024 pages (if PAGE_SIZE is not defined it is equal to 4096) was mapped using mmap syscall and unmapped using munmap one. Every page is mapped/unmapped per a loop iteration. mmap/lock - The same as above, but in this case flag MAP_LOCKED was added. open/close - The /dev/null pseudo-file was opened and closed in a loop 1024 times. It was opened and closed once per iteration. mount - The pseudo-filesystem procFS was mounted to a temporary directory inside /tmp only one time. The time between entering and exiting the system call was measured. kill - A signal handler for SIGUSR1 was setup. Signal was sent to a child process, which was created using fork glibc's wrapper. Time between sending and receiving SIGUSR1 signal was measured. Hot caches: fork-1 2.3% fork-1024 10.8% mmap/munmap 0.4% mmap/lock 4.2% open/close 3.2% kill 4% mount 8.7% Cold caches: fork-1 42.7% fork-1024 17.1% mmap/munmap 0.4% mmap/lock 1.5% open/close 0.4% kill 26.1% mount 4.1% Artem Kuzin (12): mm: allow per-NUMA node local PUD/PMD allocation mm: add config option and per-NUMA node VMS support mm: per-NUMA node replication core infrastructure x86: add support of memory protection for NUMA replicas x86: enable memory protection for replicated memory x86: align kernel text and rodata using HUGE_PAGE boundary x86: enable per-NUMA node kernel text and rodata replication x86: make kernel text patching aware about replicas x86: add support of NUMA replication for efi page tables mm: add replicas allocation support for vmalloc x86: add kernel modules text and rodata replication support mm: set memory permissions for BPF handlers replicas arch/x86/include/asm/numa_replication.h | 42 ++ arch/x86/include/asm/pgalloc.h | 10 + arch/x86/include/asm/set_memory.h | 14 + arch/x86/kernel/alternative.c | 116 ++--- arch/x86/kernel/kprobes/core.c | 2 +- arch/x86/kernel/module.c | 35 +- arch/x86/kernel/smpboot.c | 2 + arch/x86/kernel/vmlinux.lds.S | 4 +- arch/x86/mm/dump_pagetables.c | 9 + arch/x86/mm/fault.c | 4 +- arch/x86/mm/init.c | 8 +- arch/x86/mm/init_64.c | 4 +- arch/x86/mm/pat/set_memory.c | 150 ++++++- arch/x86/mm/pgtable.c | 76 +++- arch/x86/mm/pti.c | 2 +- arch/x86/mm/tlb.c | 30 +- arch/x86/platform/efi/efi_64.c | 9 + include/asm-generic/pgalloc.h | 34 ++ include/asm-generic/set_memory.h | 12 + include/linux/gfp.h | 2 + include/linux/mm.h | 79 +++- include/linux/mm_types.h | 11 +- include/linux/moduleloader.h | 10 + include/linux/numa_replication.h | 85 ++++ include/linux/set_memory.h | 10 + include/linux/vmalloc.h | 24 + init/main.c | 5 + kernel/bpf/bpf_struct_ops.c | 8 +- kernel/bpf/core.c | 4 +- kernel/bpf/trampoline.c | 6 +- kernel/module/main.c | 8 + kernel/module/strict_rwx.c | 14 +- mm/Kconfig | 10 + mm/Makefile | 1 + mm/memory.c | 251 ++++++++++- mm/numa_replication.c | 564 ++++++++++++++++++++++++ mm/page_alloc.c | 18 + mm/vmalloc.c | 469 ++++++++++++++++---- net/bpf/bpf_dummy_struct_ops.c | 2 +- 39 files changed, 1919 insertions(+), 225 deletions(-) create mode 100644 arch/x86/include/asm/numa_replication.h create mode 100644 include/linux/numa_replication.h create mode 100644 mm/numa_replication.c