Message ID | 20200428093836.27190-4-srikar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Offline memoryless cpuless node 0 | expand |
On Tue, 28 Apr 2020 15:08:36 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > Currently Linux kernel with CONFIG_NUMA on a system with multiple > possible nodes, marks node 0 as online at boot. However in practice, > there are systems which have node 0 as memoryless and cpuless. > > This can cause numa_balancing to be enabled on systems with only one node > with memory and CPUs. The existence of this dummy node which is cpuless and > memoryless node can confuse users/scripts looking at output of lscpu / > numactl. > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is > always online. > > ... > > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy); > */ > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > [N_POSSIBLE] = NODE_MASK_ALL, > +#ifdef CONFIG_NUMA > + [N_ONLINE] = NODE_MASK_NONE, > +#else > [N_ONLINE] = { { [0] = 1UL } }, > -#ifndef CONFIG_NUMA > [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > #ifdef CONFIG_HIGHMEM > [N_HIGH_MEMORY] = { { [0] = 1UL } }, So on all other NUMA machines, when does node 0 get marked online? This change means that for some time during boot, such machines will now be running with node 0 marked as offline. What are the implications of this? Will something break?
> > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is > > always online. > > > > ... > > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy); > > */ > > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > > [N_POSSIBLE] = NODE_MASK_ALL, > > +#ifdef CONFIG_NUMA > > + [N_ONLINE] = NODE_MASK_NONE, > > +#else > > [N_ONLINE] = { { [0] = 1UL } }, > > -#ifndef CONFIG_NUMA > > [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > > #ifdef CONFIG_HIGHMEM > > [N_HIGH_MEMORY] = { { [0] = 1UL } }, > > So on all other NUMA machines, when does node 0 get marked online? > > This change means that for some time during boot, such machines will > now be running with node 0 marked as offline. What are the > implications of this? Will something break? Till the nodes are detected, marking Node 0 as online tends to be redundant. Because the system doesn't know if its a NUMA or a non-NUMA system. Once we detect the nodes, we online them immediately. Hence I don't see any side-effects or negative implications of this change. However if I am missing anything, please do let me know. From my part, I have tested this on 1. Non-NUMA Single node but CPUs and memory coming from zero node. 2. Non-NUMA Single node but CPUs and memory coming from non-zero node. 3. NUMA Multi node but with CPUs and memory from node 0. 4. NUMA Multi node but with no CPUs and memory from node 0.
On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote: > > > > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is > > > always online. > > > > > > ... > > > > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy); > > > */ > > > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > > > [N_POSSIBLE] = NODE_MASK_ALL, > > > +#ifdef CONFIG_NUMA > > > + [N_ONLINE] = NODE_MASK_NONE, > > > +#else > > > [N_ONLINE] = { { [0] = 1UL } }, > > > -#ifndef CONFIG_NUMA > > > [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > > > #ifdef CONFIG_HIGHMEM > > > [N_HIGH_MEMORY] = { { [0] = 1UL } }, > > > > So on all other NUMA machines, when does node 0 get marked online? > > > > This change means that for some time during boot, such machines will > > now be running with node 0 marked as offline. What are the > > implications of this? Will something break? > > Till the nodes are detected, marking Node 0 as online tends to be redundant. > Because the system doesn't know if its a NUMA or a non-NUMA system. > Once we detect the nodes, we online them immediately. Hence I don't see any > side-effects or negative implications of this change. > > However if I am missing anything, please do let me know. > > >From my part, I have tested this on > 1. Non-NUMA Single node but CPUs and memory coming from zero node. > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node. > 3. NUMA Multi node but with CPUs and memory from node 0. > 4. NUMA Multi node but with no CPUs and memory from node 0. Have you tested on something else than ppc? Each arch does the NUMA setup separately and this is a big mess. E.g. x86 marks even memory less nodes (see init_memory_less_node) as online. Honestly I have hard time to evaluate the effect of this patch. It makes some sense to assume all nodes offline before they get online but this is a land mine territory. I am also not sure what kind of problem this is going to address. You have mentioned numa balancing without many details.
* Michal Hocko <mhocko@kernel.org> [2020-04-29 14:22:11]: > On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote: > > > > > > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is > > > > always online. > > > > > > > > ... > > > > > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy); > > > > */ > > > > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > > > > [N_POSSIBLE] = NODE_MASK_ALL, > > > > +#ifdef CONFIG_NUMA > > > > + [N_ONLINE] = NODE_MASK_NONE, > > > > +#else > > > > [N_ONLINE] = { { [0] = 1UL } }, > > > > -#ifndef CONFIG_NUMA > > > > [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > > > > #ifdef CONFIG_HIGHMEM > > > > [N_HIGH_MEMORY] = { { [0] = 1UL } }, > > > > > > So on all other NUMA machines, when does node 0 get marked online? > > > > > > This change means that for some time during boot, such machines will > > > now be running with node 0 marked as offline. What are the > > > implications of this? Will something break? > > > > Till the nodes are detected, marking Node 0 as online tends to be redundant. > > Because the system doesn't know if its a NUMA or a non-NUMA system. > > Once we detect the nodes, we online them immediately. Hence I don't see any > > side-effects or negative implications of this change. > > > > However if I am missing anything, please do let me know. > > > > >From my part, I have tested this on > > 1. Non-NUMA Single node but CPUs and memory coming from zero node. > > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node. > > 3. NUMA Multi node but with CPUs and memory from node 0. > > 4. NUMA Multi node but with no CPUs and memory from node 0. > > Have you tested on something else than ppc? Each arch does the NUMA > setup separately and this is a big mess. E.g. x86 marks even memory less > nodes (see init_memory_less_node) as online. > while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA enabled/disabled on both single node and multi node machines. However, I dont have a cpuless/memoryless x86 system. > Honestly I have hard time to evaluate the effect of this patch. It makes > some sense to assume all nodes offline before they get online but this > is a land mine territory. > > I am also not sure what kind of problem this is going to address. You > have mentioned numa balancing without many details. 1. On a machine with just one node with node number not being 0, the current setup will end up showing 2 online nodes. And when there are more than one online nodes, numa_balancing gets enabled. Without patch $ grep numa /proc/vmstat numa_hit 95179 numa_miss 0 numa_foreign 0 numa_interleave 3764 numa_local 95179 numa_other 0 numa_pte_updates 1206973 <---------- numa_huge_pte_updates 4654 <---------- numa_hint_faults 19560 <---------- numa_hint_faults_local 19560 <---------- numa_pages_migrated 0 With patch $ grep numa /proc/vmstat numa_hit 322338756 numa_miss 0 numa_foreign 0 numa_interleave 3790 numa_local 322338756 numa_other 0 numa_pte_updates 0 <---------- numa_huge_pte_updates 0 <---------- numa_hint_faults 0 <---------- numa_hint_faults_local 0 <---------- numa_pages_migrated 0 So we have a redundant page hinting numa faults which we can avoid. 2. Few people have complained about existence of this dummy node when parsing lscpu and numactl o/p. They somehow start to think that the tools are reporting incorrectly or the kernel is not able to recognize resources connected to the node.
On Thu 30-04-20 12:48:20, Srikar Dronamraju wrote: > * Michal Hocko <mhocko@kernel.org> [2020-04-29 14:22:11]: > > > On Wed 29-04-20 07:11:45, Srikar Dronamraju wrote: > > > > > > > > > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is > > > > > always online. > > > > > > > > > > ... > > > > > > > > > > --- a/mm/page_alloc.c > > > > > +++ b/mm/page_alloc.c > > > > > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy); > > > > > */ > > > > > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > > > > > [N_POSSIBLE] = NODE_MASK_ALL, > > > > > +#ifdef CONFIG_NUMA > > > > > + [N_ONLINE] = NODE_MASK_NONE, > > > > > +#else > > > > > [N_ONLINE] = { { [0] = 1UL } }, > > > > > -#ifndef CONFIG_NUMA > > > > > [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > > > > > #ifdef CONFIG_HIGHMEM > > > > > [N_HIGH_MEMORY] = { { [0] = 1UL } }, > > > > > > > > So on all other NUMA machines, when does node 0 get marked online? > > > > > > > > This change means that for some time during boot, such machines will > > > > now be running with node 0 marked as offline. What are the > > > > implications of this? Will something break? > > > > > > Till the nodes are detected, marking Node 0 as online tends to be redundant. > > > Because the system doesn't know if its a NUMA or a non-NUMA system. > > > Once we detect the nodes, we online them immediately. Hence I don't see any > > > side-effects or negative implications of this change. > > > > > > However if I am missing anything, please do let me know. > > > > > > >From my part, I have tested this on > > > 1. Non-NUMA Single node but CPUs and memory coming from zero node. > > > 2. Non-NUMA Single node but CPUs and memory coming from non-zero node. > > > 3. NUMA Multi node but with CPUs and memory from node 0. > > > 4. NUMA Multi node but with no CPUs and memory from node 0. > > > > Have you tested on something else than ppc? Each arch does the NUMA > > setup separately and this is a big mess. E.g. x86 marks even memory less > > nodes (see init_memory_less_node) as online. > > > > while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA > enabled/disabled on both single node and multi node machines. > However, I dont have a cpuless/memoryless x86 system. This should be able to emulate inside kvm, I believe. > > Honestly I have hard time to evaluate the effect of this patch. It makes > > some sense to assume all nodes offline before they get online but this > > is a land mine territory. > > > > I am also not sure what kind of problem this is going to address. You > > have mentioned numa balancing without many details. > > 1. On a machine with just one node with node number not being 0, > the current setup will end up showing 2 online nodes. And when there are > more than one online nodes, numa_balancing gets enabled. > > Without patch > $ grep numa /proc/vmstat > numa_hit 95179 > numa_miss 0 > numa_foreign 0 > numa_interleave 3764 > numa_local 95179 > numa_other 0 > numa_pte_updates 1206973 <---------- > numa_huge_pte_updates 4654 <---------- > numa_hint_faults 19560 <---------- > numa_hint_faults_local 19560 <---------- > numa_pages_migrated 0 > > > With patch > $ grep numa /proc/vmstat > numa_hit 322338756 > numa_miss 0 > numa_foreign 0 > numa_interleave 3790 > numa_local 322338756 > numa_other 0 > numa_pte_updates 0 <---------- > numa_huge_pte_updates 0 <---------- > numa_hint_faults 0 <---------- > numa_hint_faults_local 0 <---------- > numa_pages_migrated 0 > > So we have a redundant page hinting numa faults which we can avoid. interesting. Does this lead to any observable differences? Btw. it would be really great to describe how the online state influences the numa balancing. > 2. Few people have complained about existence of this dummy node when > parsing lscpu and numactl o/p. They somehow start to think that the tools > are reporting incorrectly or the kernel is not able to recognize resources > connected to the node. Please be more specific.
* Michal Hocko <mhocko@kernel.org> [2020-05-04 11:37:12]: > > > > > > Have you tested on something else than ppc? Each arch does the NUMA > > > setup separately and this is a big mess. E.g. x86 marks even memory less > > > nodes (see init_memory_less_node) as online. > > > > > > > while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA > > enabled/disabled on both single node and multi node machines. > > However, I dont have a cpuless/memoryless x86 system. > > This should be able to emulate inside kvm, I believe. > I did try but somehow not able to get cpuless / memoryless node in a x86 kvm guest. Also I am unable to see how to enable HAVE_MEMORYLESS_NODES on x86 system. # git grep -w HAVE_MEMORYLESS_NODES | cat arch/ia64/Kconfig:config HAVE_MEMORYLESS_NODES arch/powerpc/Kconfig:config HAVE_MEMORYLESS_NODES # I forced enabled but it got disabled while kernel build. May be I am missing something. > > > > So we have a redundant page hinting numa faults which we can avoid. > > interesting. Does this lead to any observable differences? Btw. it would > be really great to describe how the online state influences the numa > balancing. > If numa_balancing is enabled, it has a check to see if the number of online nodes is 1. If its one, it disables numa_balancing, else the numa_balancing stays as is. In this case, the actual node (node nr > 0) and node 0 were marked online without the patch. Here are 2 sample numa programs. numa01.sh is a set of 2 process each running threads as many as number of cpus; each thread doing 50 loops on 3GB process shared memory operations. numa02.sh is a single process with threads as many as number of cpus; each thread doing 800 loops on 32MB thread local memory operations. Testcase Time: Min Max Avg StdDev ./numa01.sh Real: 149.62 149.66 149.64 0.02 ./numa01.sh Sys: 3.21 3.71 3.46 0.25 ./numa01.sh User: 4755.13 4758.15 4756.64 1.51 ./numa02.sh Real: 24.98 25.02 25.00 0.02 ./numa02.sh Sys: 0.51 0.59 0.55 0.04 ./numa02.sh User: 790.28 790.88 790.58 0.30 Testcase Time: Min Max Avg StdDev %Change ./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133% ./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5% ./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873% ./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641% ./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667% ./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151% numa01.sh param no_patch with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 1131164 0 -100% numa_hint_faults_local 1131164 0 -100% numa_hit 213696 214244 0.256439% numa_local 213696 214244 0.256439% numa_pte_updates 1131294 0 -100% pgfault 1380845 241424 -82.5162% pgmajfault 75 60 -20% numa02.sh param no_patch with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 111878 0 -100% numa_hint_faults_local 111878 0 -100% numa_hit 41854 43220 3.26373% numa_local 41854 43220 3.26373% numa_pte_updates 113926 0 -100% pgfault 163662 51210 -68.7099% pgmajfault 56 52 -7.14286% Observations: The real time and user time actually doesn't change much. However the system time changes to some extent. The reason being the number of numa hinting faults. With the patch we are not seeing the numa hinting faults. > > 2. Few people have complained about existence of this dummy node when > > parsing lscpu and numactl o/p. They somehow start to think that the tools > > are reporting incorrectly or the kernel is not able to recognize resources > > connected to the node. > > Please be more specific. Taking the below example of numactl available: 2 nodes (0,7) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 7 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 7 size: 16238 MB node 7 free: 15449 MB node distances: node 0 7 0: 10 20 7: 20 10 We know node 0 can be special, but users may not feel the same. When users parse numactl/lscpu or /sys directory; they find there are 2 online nodes. They find none of the resources for a node(node 0) are available but still online. However they find other nodes (nodes 1-6) with don't have resources but not online. So they tend to think the kernel has been unable to online some of the resources or the resources have gone bad. Please do note that on hypervisors like PowerVM, the admins don't have control over which nodes the resources are allocated.
On 08.05.20 15:03, Srikar Dronamraju wrote: > * Michal Hocko <mhocko@kernel.org> [2020-05-04 11:37:12]: > >>>> >>>> Have you tested on something else than ppc? Each arch does the NUMA >>>> setup separately and this is a big mess. E.g. x86 marks even memory less >>>> nodes (see init_memory_less_node) as online. >>>> >>> >>> while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA >>> enabled/disabled on both single node and multi node machines. >>> However, I dont have a cpuless/memoryless x86 system. >> >> This should be able to emulate inside kvm, I believe. >> > > I did try but somehow not able to get cpuless / memoryless node in a x86 kvm > guest. I use the following #! /bin/bash sudo x86_64-softmmu/qemu-system-x86_64 \ --enable-kvm \ -m 4G,maxmem=20G,slots=2 \ -smp sockets=2,cores=2 \ -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \ -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \ -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \ -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \ -machine pc,nvdimm \ -nographic \ -nodefaults \ -chardev stdio,id=serial \ -device isa-serial,chardev=serial \ -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \ -mon chardev=monitor,mode=readline to get a cpu-less and memory-less node 1. Never tried with node 0.
On 08.05.20 15:39, David Hildenbrand wrote: > On 08.05.20 15:03, Srikar Dronamraju wrote: >> * Michal Hocko <mhocko@kernel.org> [2020-05-04 11:37:12]: >> >>>>> >>>>> Have you tested on something else than ppc? Each arch does the NUMA >>>>> setup separately and this is a big mess. E.g. x86 marks even memory less >>>>> nodes (see init_memory_less_node) as online. >>>>> >>>> >>>> while I have predominantly tested on ppc, I did test on X86 with CONFIG_NUMA >>>> enabled/disabled on both single node and multi node machines. >>>> However, I dont have a cpuless/memoryless x86 system. >>> >>> This should be able to emulate inside kvm, I believe. >>> >> >> I did try but somehow not able to get cpuless / memoryless node in a x86 kvm >> guest. > > I use the following > > #! /bin/bash > sudo x86_64-softmmu/qemu-system-x86_64 \ > --enable-kvm \ > -m 4G,maxmem=20G,slots=2 \ > -smp sockets=2,cores=2 \ > -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \ Sorry, this line has to be -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G \ > -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \ > -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \ > -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \ > -machine pc,nvdimm \ > -nographic \ > -nodefaults \ > -chardev stdio,id=serial \ > -device isa-serial,chardev=serial \ > -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \ > -mon chardev=monitor,mode=readline > > to get a cpu-less and memory-less node 1. Never tried with node 0. >
* David Hildenbrand <david@redhat.com> [2020-05-08 15:42:12]: Hi David, Thanks for the steps to tryout. > > > > #! /bin/bash > > sudo x86_64-softmmu/qemu-system-x86_64 \ > > --enable-kvm \ > > -m 4G,maxmem=20G,slots=2 \ > > -smp sockets=2,cores=2 \ > > -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \ > > Sorry, this line has to be > > -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G \ > > > -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \ > > -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \ > > -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \ > > -machine pc,nvdimm \ > > -nographic \ > > -nodefaults \ > > -chardev stdio,id=serial \ > > -device isa-serial,chardev=serial \ > > -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \ > > -mon chardev=monitor,mode=readline > > > > to get a cpu-less and memory-less node 1. Never tried with node 0. > > I tried qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2 and the resulting guest was. [root@localhost ~]# numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 node 0 size: 3927 MB node 0 free: 3316 MB node distances: node 0 0: 10 [root@localhost ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 40 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 46 Model name: Intel(R) Xeon(R) CPU X7560 @ 2.27GHz Stepping: 6 CPU MHz: 2260.986 BogoMIPS: 4521.97 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities [root@localhost ~]# cat /sys/devices/system/node/online 0 [root@localhost ~]# cat /sys/devices/system/node/possible 0-1 --------------------------------------------------------------------------------- I also tried qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=1,cpus=0-3,mem=4G -numa node,nodeid=0,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2 and the resulting guest was. [root@localhost ~]# numactl -H available: 1 nodes (0) node 0 cpus: 0 1 2 3 node 0 size: 3927 MB node 0 free: 3316 MB node distances: node 0 0: 10 [root@localhost ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 40 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 2 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 46 Model name: Intel(R) Xeon(R) CPU X7560 @ 2.27GHz Stepping: 6 CPU MHz: 2260.986 BogoMIPS: 4521.97 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K L3 cache: 16384K NUMA node0 CPU(s): 0-3 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities [root@localhost ~]# cat /sys/devices/system/node/online 0 [root@localhost ~]# cat /sys/devices/system/node/possible 0-1 Even without my patch, both the combinations, I am still unable to see a cpuless, memoryless node being online. And the interesting part being even if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system somewhere marks node 0 as the actual node. > > David / dhildenb >
On 11.05.20 19:47, Srikar Dronamraju wrote: > * David Hildenbrand <david@redhat.com> [2020-05-08 15:42:12]: > > Hi David, > > Thanks for the steps to tryout. > >>> >>> #! /bin/bash >>> sudo x86_64-softmmu/qemu-system-x86_64 \ >>> --enable-kvm \ >>> -m 4G,maxmem=20G,slots=2 \ >>> -smp sockets=2,cores=2 \ >>> -numa node,nodeid=0,cpus=0-1,mem=4G -numa node,nodeid=1,cpus=2-3,mem=0G \ >> >> Sorry, this line has to be >> >> -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G \ >> >>> -kernel /home/dhildenb/git/linux/arch/x86_64/boot/bzImage \ >>> -append "console=ttyS0 rd.shell rd.luks=0 rd.lvm=0 rd.md=0 rd.dm=0" \ >>> -initrd /boot/initramfs-5.2.8-200.fc30.x86_64.img \ >>> -machine pc,nvdimm \ >>> -nographic \ >>> -nodefaults \ >>> -chardev stdio,id=serial \ >>> -device isa-serial,chardev=serial \ >>> -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \ >>> -mon chardev=monitor,mode=readline >>> >>> to get a cpu-less and memory-less node 1. Never tried with node 0. >>> > > I tried > > qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=0,cpus=0-3,mem=4G -numa node,nodeid=1,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2 > > and the resulting guest was. > > [root@localhost ~]# numactl -H > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 > node 0 size: 3927 MB > node 0 free: 3316 MB > node distances: > node 0 > 0: 10 > > [root@localhost ~]# lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > Address sizes: 40 bits physical, 48 bits virtual > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 1 > Core(s) per socket: 2 > Socket(s): 2 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 46 > Model name: Intel(R) Xeon(R) CPU X7560 @ 2.27GHz > Stepping: 6 > CPU MHz: 2260.986 > BogoMIPS: 4521.97 > Virtualization: VT-x > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 4096K > L3 cache: 16384K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities > > [root@localhost ~]# cat /sys/devices/system/node/online > 0 > [root@localhost ~]# cat /sys/devices/system/node/possible > 0-1 > > --------------------------------------------------------------------------------- > > I also tried > > qemu-system-x86_64 -enable-kvm -m 4G,maxmem=20G,slots=2 -smp sockets=2,cores=2 -cpu host -numa node,nodeid=1,cpus=0-3,mem=4G -numa node,nodeid=0,mem=0G -vga none -nographic -serial mon:stdio /home/srikar/fedora.qcow2 > > and the resulting guest was. > > [root@localhost ~]# numactl -H > available: 1 nodes (0) > node 0 cpus: 0 1 2 3 > node 0 size: 3927 MB > node 0 free: 3316 MB > node distances: > node 0 > 0: 10 > > [root@localhost ~]# lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > Address sizes: 40 bits physical, 48 bits virtual > CPU(s): 4 > On-line CPU(s) list: 0-3 > Thread(s) per core: 1 > Core(s) per socket: 2 > Socket(s): 2 > NUMA node(s): 1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 46 > Model name: Intel(R) Xeon(R) CPU X7560 @ 2.27GHz > Stepping: 6 > CPU MHz: 2260.986 > BogoMIPS: 4521.97 > Virtualization: VT-x > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 4096K > L3 cache: 16384K > NUMA node0 CPU(s): 0-3 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni vmx ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid tsc_adjust arat umip arch_capabilities > > [root@localhost ~]# cat /sys/devices/system/node/online > 0 > [root@localhost ~]# cat /sys/devices/system/node/possible > 0-1 > > Even without my patch, both the combinations, I am still unable to see a > cpuless, memoryless node being online. And the interesting part being even Yeah, I think on x86, all memory-less and cpu-less nodes are offline as default. Especially when hotunplugging cpus/memory, we set them offline as well. But as Michal mentioned, the node handling code is complicated and differs between various architectures. > if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system > somewhere marks node 0 as the actual node. Is the kernel maybe mapping PXM 1 to node 0 in that case, because it always requires node 0 to be online/contain memory? Would be interesting what happens if you hotplug a DIMM to (QEMU )node 0 - if PXM 0 will be mapped to node 1 then as well.
* David Hildenbrand <david@redhat.com> [2020-05-12 09:49:05]: > On 11.05.20 19:47, Srikar Dronamraju wrote: > > * David Hildenbrand <david@redhat.com> [2020-05-08 15:42:12]: > > > > > > [root@localhost ~]# cat /sys/devices/system/node/online > > 0 > > [root@localhost ~]# cat /sys/devices/system/node/possible > > 0-1 > > > > Even without my patch, both the combinations, I am still unable to see a > > cpuless, memoryless node being online. And the interesting part being even > > Yeah, I think on x86, all memory-less and cpu-less nodes are offline as > default. Especially when hotunplugging cpus/memory, we set them offline > as well. I also came to the same conclusion that we may not have a cpuless,memoryless node on x86. > > But as Michal mentioned, the node handling code is complicated and > differs between various architectures. > I do agree that node handling code differs across various architectures and quite complicated. > > if I mark node 0 as cpuless,memoryless and node 1 as actual node, the system > > somewhere marks node 0 as the actual node. > > Is the kernel maybe mapping PXM 1 to node 0 in that case, because it > always requires node 0 to be online/contain memory? Would be interesting > what happens if you hotplug a DIMM to (QEMU )node 0 - if PXM 0 will be > mapped to node 1 then as well. > Satheesh Rajendra had tried with cpu hotplug on a similar setup and we found that it crashes the x86 system. reference: https://bugzilla.kernel.org/show_bug.cgi?id=202187 Even if we were able to hotplug 1 DIMM memory into node 1, that would no more be a memoryless node.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 69827d4fa052..03b89592af04 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy); */ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { [N_POSSIBLE] = NODE_MASK_ALL, +#ifdef CONFIG_NUMA + [N_ONLINE] = NODE_MASK_NONE, +#else [N_ONLINE] = { { [0] = 1UL } }, -#ifndef CONFIG_NUMA [N_NORMAL_MEMORY] = { { [0] = 1UL } }, #ifdef CONFIG_HIGHMEM [N_HIGH_MEMORY] = { { [0] = 1UL } },
Currently Linux kernel with CONFIG_NUMA on a system with multiple possible nodes, marks node 0 as online at boot. However in practice, there are systems which have node 0 as memoryless and cpuless. This can cause numa_balancing to be enabled on systems with only one node with memory and CPUs. The existence of this dummy node which is cpuless and memoryless node can confuse users/scripts looking at output of lscpu / numactl. By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is always online. v5.7-rc3 available: 2 nodes (0,2) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31490 MB node distances: node 0 2 0: 10 20 2: 20 10 proc and sys files ------------------ /sys/devices/system/node/online: 0,2 /proc/sys/kernel/numa_balancing: 1 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 v5.7-rc3 + patch ------------------ available: 1 nodes (2) node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31487 MB node distances: node 2 2: 10 proc and sys files ------------------ /sys/devices/system/node/online: 2 /proc/sys/kernel/numa_balancing: 0 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 Note: On Powerpc, cpu_to_node of possible but not present cpus would previously return 0. Hence this commit depends on commit ("powerpc/numa: Set numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id queried from vphn"). Without the 2 commits, Powerpc system might crash. Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Christopher Lameter <cl@linux.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Changelog v1:->v2: - Rebased to v5.7-rc3 - Updated the changelog. mm/page_alloc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)