Message ID | 20200311110237.5731-4-srikar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Offline memoryless cpuless node 0 | expand |
On Wed, 11 Mar 2020, Srikar Dronamraju wrote: > Currently Linux kernel with CONFIG_NUMA on a system with multiple > possible nodes, marks node 0 as online at boot. However in practice, > there are systems which have node 0 as memoryless and cpuless. Would it not be better and simpler to require that node 0 always has memory (and processors)? A mininum operational set? We can dynamically number the nodes right? So just make sure that the firmware properly creates memory on node 0?
On Sun 15-03-20 14:20:05, Cristopher Lameter wrote: > On Wed, 11 Mar 2020, Srikar Dronamraju wrote: > > > Currently Linux kernel with CONFIG_NUMA on a system with multiple > > possible nodes, marks node 0 as online at boot. However in practice, > > there are systems which have node 0 as memoryless and cpuless. > > Would it not be better and simpler to require that node 0 always has > memory (and processors)? A mininum operational set? I do not think you can simply ignore the reality. I cannot say that I am a fan of memoryless/cpuless numa configurations but they are a sad reality of different LPAR configurations. We have to deal with them. Besides that I do not really see any strong technical arguments to lack a support for those crippled configurations. We do have zonelists that allow to do reasonable decisions on memoryless nodes. So no, I do not think that this is a viable approach. > We can dynamically number the nodes right? So just make sure that the > firmware properly creates memory on node 0? Are you suggesting that the OS would renumber NUMA nodes coming from FW just to satisfy node 0 existence? If yes then I believe this is really a bad idea because it would make HW/LPAR configuration matching to the resulting memory layout really hard to follow.
* Michal Hocko <mhocko@kernel.org> [2020-03-16 09:54:25]: > On Sun 15-03-20 14:20:05, Cristopher Lameter wrote: > > On Wed, 11 Mar 2020, Srikar Dronamraju wrote: > > > > > Currently Linux kernel with CONFIG_NUMA on a system with multiple > > > possible nodes, marks node 0 as online at boot. However in practice, > > > there are systems which have node 0 as memoryless and cpuless. > > > > Would it not be better and simpler to require that node 0 always has > > memory (and processors)? A mininum operational set? > > I do not think you can simply ignore the reality. I cannot say that I am > a fan of memoryless/cpuless numa configurations but they are a sad > reality of different LPAR configurations. We have to deal with them. > Besides that I do not really see any strong technical arguments to lack > a support for those crippled configurations. We do have zonelists that > allow to do reasonable decisions on memoryless nodes. So no, I do not > think that this is a viable approach. > I agree with Michal, kernel should accept the reality and work with different Lpar configurations. > > We can dynamically number the nodes right? So just make sure that the > > firmware properly creates memory on node 0? > > Are you suggesting that the OS would renumber NUMA nodes coming > from FW just to satisfy node 0 existence? If yes then I believe this is > really a bad idea because it would make HW/LPAR configuration matching > to the resulting memory layout really hard to follow. > > -- > Michal Hocko > SUSE Labs Michal, Vlastimil, Christoph and others, do you have any more comments, suggestions or any other feedback. If not, can you please add your reviewed-by, acked etc.
On Mon, 16 Mar 2020, Michal Hocko wrote: > > We can dynamically number the nodes right? So just make sure that the > > firmware properly creates memory on node 0? > > Are you suggesting that the OS would renumber NUMA nodes coming > from FW just to satisfy node 0 existence? If yes then I believe this is > really a bad idea because it would make HW/LPAR configuration matching > to the resulting memory layout really hard to follow. NUMA nodes are created by the OS based on information provided by the firmware. Either the FW would need to ensure that a viable node 0 exists or the bootstrap arch code could setup things to the same effect.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3c4eb75..68e635f4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -116,8 +116,10 @@ struct pcpu_drain { */ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { [N_POSSIBLE] = NODE_MASK_ALL, +#ifdef CONFIG_NUMA + [N_ONLINE] = NODE_MASK_NONE, +#else [N_ONLINE] = { { [0] = 1UL } }, -#ifndef CONFIG_NUMA [N_NORMAL_MEMORY] = { { [0] = 1UL } }, #ifdef CONFIG_HIGHMEM [N_HIGH_MEMORY] = { { [0] = 1UL } },
Currently Linux kernel with CONFIG_NUMA on a system with multiple possible nodes, marks node 0 as online at boot. However in practice, there are systems which have node 0 as memoryless and cpuless. This can cause numa_balancing to be enabled on systems with only one node with memory and CPUs. The existence of this dummy node which is cpuless and memoryless node can confuse users/scripts looking at output of lscpu / numactl. Lets stop assuming that Node 0 is always online. v5.6-rc4 available: 2 nodes (0,2) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31490 MB node distances: node 0 2 0: 10 20 2: 20 10 proc and sys files ------------------ /sys/devices/system/node/online: 0,2 /proc/sys/kernel/numa_balancing: 1 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 v5.6-rc4 + patch ------------------ available: 1 nodes (2) node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31487 MB node distances: node 2 2: 10 proc and sys files ------------------ /sys/devices/system/node/online: 2 /proc/sys/kernel/numa_balancing: 0 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Christopher Lameter <cl@linux.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- mm/page_alloc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)