Message ID | 20231012024842.99703-1-rongwei.wang@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | support NUMA emulation for arm64 | expand |
Hello Rongwei, On 10/12/23 04:48, Rongwei Wang wrote: > A brief introduction > ==================== > > The NUMA emulation can fake more node base on a single > node system, e.g. > > one node system: > > [root@localhost ~]# numactl -H > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 31788 MB > node 0 free: 31446 MB > node distances: > node 0 > 0: 10 > > add numa=fake=2 (fake 2 node on each origin node): > > [root@localhost ~]# numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 4 5 6 7 > node 0 size: 15806 MB > node 0 free: 15451 MB > node 1 cpus: 0 1 2 3 4 5 6 7 > node 1 size: 16029 MB > node 1 free: 15989 MB > node distances: > node 0 1 > 0: 10 10 > 1: 10 10 > > As above shown, a new node has been faked. As cpus, the realization > of x86 NUMA emulation is kept. Maybe each node should has 4 cores is > better (not sure, next to do if so). > > Why do this > =========== > > It seems has following reasons: > (1) In x86 host, apply NUMA emulation can fake more nodes environment > to test or verify some performance stuff, but arm64 only has > one method that modify ACPI table to do this. It's troublesome > more or less. > (2) Reduce competition for some locks. Here an example we found: > will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious > hotspot on lruvec->lock when test in single environment. What's > more, The performance improved greatly if test in two more nodes > system. The data shows below (more is better): > > --------------------------------------------------------------------- > threads/process | 1 | 12 | 24 | 48 | 96 > --------------------------------------------------------------------- > one node | 14 1122 | 110 5372 | 111 2615 | 79 7084 | 72 4516 > --------------------------------------------------------------------- > numa=fake=2 | 14 1168 | 144 4848 | 215 9070 | 157 0412 | 142 3968 > --------------------------------------------------------------------- > | For concurrency 12, no lruvec->lock hotspot. For 24, > hotspot | one node has 24% hotspot on lruvec->lock, but > | two nodes env hasn't. > --------------------------------------------------------------------- > > As for risks (e.g. numa balance...), they need to be discussed here. > > Lastly, this just is a draft, I can improve next if it's acceptable. I'm not engaging on the utility/relevance of the patch-set, but I tried them on an arm64 system with the 'numa=fake=2' parameter and could not see 2 nodes being created under: /sys/devices/system/node/ Indeed it seems that even though numa_emulation() is moved to a generic mm/numa.c file, the function is only called from: arch/x86/mm/numa.c:numa_init() (or maybe I'm misinterpreting the intent of the patches). Also I had the following errors when building (still for arm64): mm/numa.c:862:8: error: implicit declaration of function 'early_cpu_to_node' is invalid in C99 [-Werror,-Wimplicit-function-declaration] nid = early_cpu_to_node(cpu); ^ mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'? ./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node' declared here void __init early_map_cpu_to_node(unsigned int cpu, int nid); ^ mm/numa.c:874:3: error: implicit declaration of function 'debug_cpumask_set_cpu' is invalid in C99 [-Werror,-Wimplicit-function-declaration] debug_cpumask_set_cpu(cpu, nid, enable); ^ mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'? ./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp) ^ 2 errors generated. Regards, Pierre > > Thanks! > > Rongwei Wang (5): > mm/numa: move numa emulation APIs into generic files > mm: percpu: fix variable type of cpu > arch_numa: remove __init in early_cpu_to_node() > mm/numa: support CONFIG_NUMA_EMU for arm64 > mm/numa: migrate leftover numa emulation into mm/numa.c > > arch/x86/Kconfig | 8 - > arch/x86/include/asm/numa.h | 3 - > arch/x86/mm/Makefile | 1 - > arch/x86/mm/numa.c | 216 +------------- > arch/x86/mm/numa_internal.h | 14 +- > drivers/base/arch_numa.c | 7 +- > include/asm-generic/numa.h | 33 +++ > include/linux/percpu.h | 2 +- > mm/Kconfig | 8 + > mm/Makefile | 1 + > arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++- > 11 files changed, 373 insertions(+), 253 deletions(-) > rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%) >
On 2023/10/12 20:37, Pierre Gondois wrote: > Hello Rongwei, > > On 10/12/23 04:48, Rongwei Wang wrote: >> A brief introduction >> ==================== >> >> The NUMA emulation can fake more node base on a single >> node system, e.g. >> >> one node system: >> >> [root@localhost ~]# numactl -H >> available: 1 nodes (0) >> node 0 cpus: 0 1 2 3 4 5 6 7 >> node 0 size: 31788 MB >> node 0 free: 31446 MB >> node distances: >> node 0 >> 0: 10 >> >> add numa=fake=2 (fake 2 node on each origin node): >> >> [root@localhost ~]# numactl -H >> available: 2 nodes (0-1) >> node 0 cpus: 0 1 2 3 4 5 6 7 >> node 0 size: 15806 MB >> node 0 free: 15451 MB >> node 1 cpus: 0 1 2 3 4 5 6 7 >> node 1 size: 16029 MB >> node 1 free: 15989 MB >> node distances: >> node 0 1 >> 0: 10 10 >> 1: 10 10 >> >> As above shown, a new node has been faked. As cpus, the realization >> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is >> better (not sure, next to do if so). >> >> Why do this >> =========== >> >> It seems has following reasons: >> (1) In x86 host, apply NUMA emulation can fake more nodes environment >> to test or verify some performance stuff, but arm64 only has >> one method that modify ACPI table to do this. It's troublesome >> more or less. >> (2) Reduce competition for some locks. Here an example we found: >> will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious >> hotspot on lruvec->lock when test in single environment. What's >> more, The performance improved greatly if test in two more nodes >> system. The data shows below (more is better): >> >> --------------------------------------------------------------------- >> threads/process | 1 | 12 | 24 | 48 | 96 >> --------------------------------------------------------------------- >> one node | 14 1122 | 110 5372 | 111 2615 | 79 7084 | >> 72 4516 >> --------------------------------------------------------------------- >> numa=fake=2 | 14 1168 | 144 4848 | 215 9070 | 157 0412 | >> 142 3968 >> --------------------------------------------------------------------- >> | For concurrency 12, no lruvec->lock hotspot. >> For 24, >> hotspot | one node has 24% hotspot on lruvec->lock, but >> | two nodes env hasn't. >> --------------------------------------------------------------------- >> >> As for risks (e.g. numa balance...), they need to be discussed here. >> >> Lastly, this just is a draft, I can improve next if it's acceptable. > > I'm not engaging on the utility/relevance of the patch-set, but I tried > them on an arm64 system with the 'numa=fake=2' parameter and could not Sorry, my fault. I should mention this in previous brief introduction: acpi=on numa=fake=2. The default patch of arm64 numa initialize is numa_init() -> dummy_numa_init() if turn off acpi (this path has not been taken into account yet in this patch, next will to do). What's more, if you test these patchset in qemu-kvm, you should add below parameters in the script. object memory-backend-ram,id=mem0,size=32G \ numa node,memdev=mem0,cpus=0-7,nodeid=0 \ (Above parameters just make sure SRAT table has NUMA configure, avoiding path of numa_init() -> dummy_numa_init()) > see 2 nodes being created under: > /sys/devices/system/node/ > Indeed it seems that even though numa_emulation() is moved to a generic > mm/numa.c file, the function is only called from: > arch/x86/mm/numa.c:numa_init() > (or maybe I'm misinterpreting the intent of the patches). Here drivers/base/arch_numa.c:numa_init() has called numa_emulation() (I guess it works if you add acpi=on :-)). > > Also I had the following errors when building (still for arm64): > mm/numa.c:862:8: error: implicit declaration of function > 'early_cpu_to_node' is invalid in C99 > [-Werror,-Wimplicit-function-declaration] > nid = early_cpu_to_node(cpu); It seems CONFIG_DEBUG_PER_CPU_MAPS enabled in your environment? You can disable CONFIG_DEBUG_PER_CPU_MAPS and test it again. I have not test it with CONFIG_DEBUG_PER_CPU_MAPS enabled. It's very helpful, I will fix it next time. If you have any questions, please let me know. Regards, -wrw > ^ > mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'? > ./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node' > declared here > void __init early_map_cpu_to_node(unsigned int cpu, int nid); > ^ > mm/numa.c:874:3: error: implicit declaration of function > 'debug_cpumask_set_cpu' is invalid in C99 > [-Werror,-Wimplicit-function-declaration] > debug_cpumask_set_cpu(cpu, nid, enable); > ^ > mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'? > ./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here > static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct > cpumask *dstp) > ^ > 2 errors generated. > > Regards, > Pierre > >> >> Thanks! >> >> Rongwei Wang (5): >> mm/numa: move numa emulation APIs into generic files >> mm: percpu: fix variable type of cpu >> arch_numa: remove __init in early_cpu_to_node() >> mm/numa: support CONFIG_NUMA_EMU for arm64 >> mm/numa: migrate leftover numa emulation into mm/numa.c >> >> arch/x86/Kconfig | 8 - >> arch/x86/include/asm/numa.h | 3 - >> arch/x86/mm/Makefile | 1 - >> arch/x86/mm/numa.c | 216 +------------- >> arch/x86/mm/numa_internal.h | 14 +- >> drivers/base/arch_numa.c | 7 +- >> include/asm-generic/numa.h | 33 +++ >> include/linux/percpu.h | 2 +- >> mm/Kconfig | 8 + >> mm/Makefile | 1 + >> arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++- >> 11 files changed, 373 insertions(+), 253 deletions(-) >> rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%) >>
Hello Rongwei, On 10/12/23 15:30, Rongwei Wang wrote: > > On 2023/10/12 20:37, Pierre Gondois wrote: >> Hello Rongwei, >> >> On 10/12/23 04:48, Rongwei Wang wrote: >>> A brief introduction >>> ==================== >>> >>> The NUMA emulation can fake more node base on a single >>> node system, e.g. >>> >>> one node system: >>> >>> [root@localhost ~]# numactl -H >>> available: 1 nodes (0) >>> node 0 cpus: 0 1 2 3 4 5 6 7 >>> node 0 size: 31788 MB >>> node 0 free: 31446 MB >>> node distances: >>> node 0 >>> 0: 10 >>> >>> add numa=fake=2 (fake 2 node on each origin node): >>> >>> [root@localhost ~]# numactl -H >>> available: 2 nodes (0-1) >>> node 0 cpus: 0 1 2 3 4 5 6 7 >>> node 0 size: 15806 MB >>> node 0 free: 15451 MB >>> node 1 cpus: 0 1 2 3 4 5 6 7 >>> node 1 size: 16029 MB >>> node 1 free: 15989 MB >>> node distances: >>> node 0 1 >>> 0: 10 10 >>> 1: 10 10 >>> >>> As above shown, a new node has been faked. As cpus, the realization >>> of x86 NUMA emulation is kept. Maybe each node should has 4 cores is >>> better (not sure, next to do if so). >>> >>> Why do this >>> =========== >>> >>> It seems has following reasons: >>> (1) In x86 host, apply NUMA emulation can fake more nodes environment >>> to test or verify some performance stuff, but arm64 only has >>> one method that modify ACPI table to do this. It's troublesome >>> more or less. >>> (2) Reduce competition for some locks. Here an example we found: >>> will-it-scale/tlb_flush1_processes -t 96 -s 10, it shows obvious >>> hotspot on lruvec->lock when test in single environment. What's >>> more, The performance improved greatly if test in two more nodes >>> system. The data shows below (more is better): >>> >>> --------------------------------------------------------------------- >>> threads/process | 1 | 12 | 24 | 48 | 96 >>> --------------------------------------------------------------------- >>> one node | 14 1122 | 110 5372 | 111 2615 | 79 7084 | >>> 72 4516 >>> --------------------------------------------------------------------- >>> numa=fake=2 | 14 1168 | 144 4848 | 215 9070 | 157 0412 | >>> 142 3968 >>> --------------------------------------------------------------------- >>> | For concurrency 12, no lruvec->lock hotspot. >>> For 24, >>> hotspot | one node has 24% hotspot on lruvec->lock, but >>> | two nodes env hasn't. >>> --------------------------------------------------------------------- >>> >>> As for risks (e.g. numa balance...), they need to be discussed here. >>> >>> Lastly, this just is a draft, I can improve next if it's acceptable. >> >> I'm not engaging on the utility/relevance of the patch-set, but I tried >> them on an arm64 system with the 'numa=fake=2' parameter and could not > > Sorry, my fault. > > I should mention this in previous brief introduction: acpi=on numa=fake=2. > > The default patch of arm64 numa initialize is numa_init() -> > dummy_numa_init() if turn off acpi (this path has not been taken into > account yet in this patch, next will to do). > > What's more, if you test these patchset in qemu-kvm, you should add > below parameters in the script. > > object memory-backend-ram,id=mem0,size=32G \ > numa node,memdev=mem0,cpus=0-7,nodeid=0 \ > > (Above parameters just make sure SRAT table has NUMA configure, avoiding > path of numa_init() -> dummy_numa_init()) > >> see 2 nodes being created under: >> /sys/devices/system/node/ >> Indeed it seems that even though numa_emulation() is moved to a generic >> mm/numa.c file, the function is only called from: >> arch/x86/mm/numa.c:numa_init() >> (or maybe I'm misinterpreting the intent of the patches). > > Here drivers/base/arch_numa.c:numa_init() has called numa_emulation() (I > guess it works if you add acpi=on :-)). I don't see numa_emulation() being called from drivers/base/arch_numa.c:numa_init() I have: $ git grep numa_emulation arch/x86/mm/numa.c: numa_emulation(&numa_meminfo, numa_distance_cnt); arch/x86/mm/numa_internal.h:extern void __init numa_emulation(struct numa_meminfo *numa_meminfo, include/asm-generic/numa.h:void __init numa_emulation(struct numa_meminfo *numa_meminfo, mm/numa.c:/* Most of this file comes from x86/numa_emulation.c */ mm/numa.c: * numa_emulation - Emulate NUMA nodes mm/numa.c:void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) so from this, an arm64-based platform should not be able to call numa_emulation(). Is it possible to add a call to dump_stack() in numa_emulation() to see the call stack ? The branch I'm using is based on v6.6-rc5 and has the 5 patches applied: 2af398a87cc7 mm/numa: migrate leftover numa emulation into mm/numa.c c8e314fb23be mm/numa: support CONFIG_NUMA_EMU for arm64 335b7219d40e arch_numa: remove __init in early_cpu_to_node() d9358adf1cdc mm: percpu: fix variable type of cpu 1ffbe40a00f5 mm/numa: move numa emulation APIs into generic files 94f6f0550c62 (tag: v6.6-rc5) Linux 6.6-rc5 Regards, Pierre > > >> >> Also I had the following errors when building (still for arm64): >> mm/numa.c:862:8: error: implicit declaration of function >> 'early_cpu_to_node' is invalid in C99 >> [-Werror,-Wimplicit-function-declaration] >> nid = early_cpu_to_node(cpu); > > It seems CONFIG_DEBUG_PER_CPU_MAPS enabled in your environment? You can > disable CONFIG_DEBUG_PER_CPU_MAPS and test it again. > > I have not test it with CONFIG_DEBUG_PER_CPU_MAPS enabled. It's very > helpful, I will fix it next time. > > If you have any questions, please let me know. > > Regards, > > -wrw > >> ^ >> mm/numa.c:862:8: note: did you mean 'early_map_cpu_to_node'? >> ./include/asm-generic/numa.h:37:13: note: 'early_map_cpu_to_node' >> declared here >> void __init early_map_cpu_to_node(unsigned int cpu, int nid); >> ^ >> mm/numa.c:874:3: error: implicit declaration of function >> 'debug_cpumask_set_cpu' is invalid in C99 >> [-Werror,-Wimplicit-function-declaration] >> debug_cpumask_set_cpu(cpu, nid, enable); >> ^ >> mm/numa.c:874:3: note: did you mean '__cpumask_set_cpu'? >> ./include/linux/cpumask.h:474:29: note: '__cpumask_set_cpu' declared here >> static __always_inline void __cpumask_set_cpu(unsigned int cpu, struct >> cpumask *dstp) >> ^ >> 2 errors generated. >> >> Regards, >> Pierre >> >>> >>> Thanks! >>> >>> Rongwei Wang (5): >>> mm/numa: move numa emulation APIs into generic files >>> mm: percpu: fix variable type of cpu >>> arch_numa: remove __init in early_cpu_to_node() >>> mm/numa: support CONFIG_NUMA_EMU for arm64 >>> mm/numa: migrate leftover numa emulation into mm/numa.c >>> >>> arch/x86/Kconfig | 8 - >>> arch/x86/include/asm/numa.h | 3 - >>> arch/x86/mm/Makefile | 1 - >>> arch/x86/mm/numa.c | 216 +------------- >>> arch/x86/mm/numa_internal.h | 14 +- >>> drivers/base/arch_numa.c | 7 +- >>> include/asm-generic/numa.h | 33 +++ >>> include/linux/percpu.h | 2 +- >>> mm/Kconfig | 8 + >>> mm/Makefile | 1 + >>> arch/x86/mm/numa_emulation.c => mm/numa.c | 333 +++++++++++++++++++++- >>> 11 files changed, 373 insertions(+), 253 deletions(-) >>> rename arch/x86/mm/numa_emulation.c => mm/numa.c (63%) >>>