Message ID | 20200624092846.9194-4-srikar@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Offline memoryless cpuless node 0 | expand |
On Wed, 24 Jun 2020, Srikar Dronamraju wrote: > Currently Linux kernel with CONFIG_NUMA on a system with multiple > possible nodes, marks node 0 as online at boot. However in practice, > there are systems which have node 0 as memoryless and cpuless. Maybe add something to explain why you are not simply mapping the existing memory to NUMA node 0 which is after all just a numbering scheme used by the kernel and can be used arbitrarily? This could be seen more as a bug in the arch code during the setup of NUMA nodes. The two nodes are created by the firmwware / bootstrap code after all. Just do not do it?
* Christopher Lameter <cl@linux.com> [2020-06-29 14:58:40]: > On Wed, 24 Jun 2020, Srikar Dronamraju wrote: > > > Currently Linux kernel with CONFIG_NUMA on a system with multiple > > possible nodes, marks node 0 as online at boot. However in practice, > > there are systems which have node 0 as memoryless and cpuless. > > Maybe add something to explain why you are not simply mapping the > existing memory to NUMA node 0 which is after all just a numbering scheme > used by the kernel and can be used arbitrarily? > I thought Michal Hocko already gave a clear picture on why mapping is a bad idea. https://lore.kernel.org/lkml/20200316085425.GB11482@dhcp22.suse.cz/t/#u Are you suggesting that we add that as part of the changelog? > This could be seen more as a bug in the arch code during the setup of NUMA > nodes. The two nodes are created by the firmwware / bootstrap code after > all. Just do not do it? > - The arch/setup code in powerpc is not onlining these nodes. - Later on cpus/memory in node 0 can be onlined. - Firmware in this case Phyp is an independent code by itself.
On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote: > Currently Linux kernel with CONFIG_NUMA on a system with multiple > possible nodes, marks node 0 as online at boot. However in practice, > there are systems which have node 0 as memoryless and cpuless. > > This can cause numa_balancing to be enabled on systems with only one node > with memory and CPUs. The existence of this dummy node which is cpuless and > memoryless node can confuse users/scripts looking at output of lscpu / > numactl. > > By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is > always online. > > v5.8-rc2 > available: 2 nodes (0,2) > node 0 cpus: > node 0 size: 0 MB > node 0 free: 0 MB > node 2 cpus: 0 1 2 3 4 5 6 7 > node 2 size: 32625 MB > node 2 free: 31490 MB > node distances: > node 0 2 > 0: 10 20 > 2: 20 10 > > proc and sys files > ------------------ > /sys/devices/system/node/online: 0,2 > /proc/sys/kernel/numa_balancing: 1 > /sys/devices/system/node/has_cpu: 2 > /sys/devices/system/node/has_memory: 2 > /sys/devices/system/node/has_normal_memory: 2 > /sys/devices/system/node/possible: 0-31 > > v5.8-rc2 + patch > ------------------ > available: 1 nodes (2) > node 2 cpus: 0 1 2 3 4 5 6 7 > node 2 size: 32625 MB > node 2 free: 31487 MB > node distances: > node 2 > 2: 10 > > proc and sys files > ------------------ > /sys/devices/system/node/online: 2 > /proc/sys/kernel/numa_balancing: 0 > /sys/devices/system/node/has_cpu: 2 > /sys/devices/system/node/has_memory: 2 > /sys/devices/system/node/has_normal_memory: 2 > /sys/devices/system/node/possible: 0-31 > > Note: On Powerpc, cpu_to_node of possible but not present cpus would > previously return 0. Hence this commit depends on commit ("powerpc/numa: Set > numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id > queried from vphn"). Without the 2 commits, Powerpc system might crash. > > 1. User space applications like Numactl, lscpu, that parse the sysfs tend to > believe there is an extra online node. This tends to confuse users and > applications. Other user space applications start believing that system was > not able to use all the resources (i.e missing resources) or the system was > not setup correctly. > > 2. Also existence of dummy node also leads to inconsistent information. The > number of online nodes is inconsistent with the information in the > device-tree and resource-dump > > 3. When the dummy node is present, single node non-Numa systems end up showing > up as NUMA systems and numa_balancing gets enabled. This will mean we take > the hit from the unnecessary numa hinting faults. I have to say that I dislike the node online/offline state and directly exporting that to the userspace. Users should only care whether the node has memory/cpus. Numa nodes can be online without any memory. Just offline all the present memory blocks but do not physically hot remove them and you are in the same situation. If users are confused by an output of tools like numactl -H then those could be updated and hide nodes without any memory&cpus. The autonuma problem sounds interesting but again this patch doesn't really solve the underlying problem because I strongly suspect that the problem is still there when a numa node gets all its memory offline as mentioned above. While I completely agree that making node 0 special is wrong, I have still hard time to review this very simply looking patch because all the numa initialization is so spread around that this might just blow up at unexpected places. IIRC we have discussed testing in the previous version and David has provided a way to emulate these configurations on x86. Did you manage to use those instruction for additional testing on other than ppc architectures? > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Michal Hocko <mhocko@suse.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: "Kirill A. Shutemov" <kirill@shutemov.name> > Cc: Christopher Lameter <cl@linux.com> > Cc: Michael Ellerman <mpe@ellerman.id.au> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> > Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> > Cc: David Hildenbrand <david@redhat.com> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > --- > Changelog v4:->v5: > - rebased to v5.8-rc2 > link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u > > Changelog v1:->v2: > - Rebased to v5.7-rc3 > Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u > > mm/page_alloc.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 48eb0f1410d4..5187664558e1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy); > */ > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > [N_POSSIBLE] = NODE_MASK_ALL, > +#ifdef CONFIG_NUMA > + [N_ONLINE] = NODE_MASK_NONE, > +#else > [N_ONLINE] = { { [0] = 1UL } }, > -#ifndef CONFIG_NUMA > [N_NORMAL_MEMORY] = { { [0] = 1UL } }, > #ifdef CONFIG_HIGHMEM > [N_HIGH_MEMORY] = { { [0] = 1UL } }, > -- > 2.18.1 >
* Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]: > > > > > 2. Also existence of dummy node also leads to inconsistent information. The > > number of online nodes is inconsistent with the information in the > > device-tree and resource-dump > > > > 3. When the dummy node is present, single node non-Numa systems end up showing > > up as NUMA systems and numa_balancing gets enabled. This will mean we take > > the hit from the unnecessary numa hinting faults. > > I have to say that I dislike the node online/offline state and directly > exporting that to the userspace. Users should only care whether the node > has memory/cpus. Numa nodes can be online without any memory. Just > offline all the present memory blocks but do not physically hot remove > them and you are in the same situation. If users are confused by an > output of tools like numactl -H then those could be updated and hide > nodes without any memory&cpus. > > The autonuma problem sounds interesting but again this patch doesn't > really solve the underlying problem because I strongly suspect that the > problem is still there when a numa node gets all its memory offline as > mentioned above. > > While I completely agree that making node 0 special is wrong, I have > still hard time to review this very simply looking patch because all the > numa initialization is so spread around that this might just blow up > at unexpected places. IIRC we have discussed testing in the previous > version and David has provided a way to emulate these configurations > on x86. Did you manage to use those instruction for additional testing > on other than ppc architectures? > I have tried all the steps that David mentioned and reported back at https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u As a summary, David's steps are still not creating a memoryless/cpuless on x86 VM. I have tried booting with Numa/non-numa on all the x86 machines that I could get to.
On 01.07.20 12:04, Srikar Dronamraju wrote: > * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]: > >> >>> >>> 2. Also existence of dummy node also leads to inconsistent information. The >>> number of online nodes is inconsistent with the information in the >>> device-tree and resource-dump >>> >>> 3. When the dummy node is present, single node non-Numa systems end up showing >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take >>> the hit from the unnecessary numa hinting faults. >> >> I have to say that I dislike the node online/offline state and directly >> exporting that to the userspace. Users should only care whether the node >> has memory/cpus. Numa nodes can be online without any memory. Just >> offline all the present memory blocks but do not physically hot remove >> them and you are in the same situation. If users are confused by an >> output of tools like numactl -H then those could be updated and hide >> nodes without any memory&cpus. >> >> The autonuma problem sounds interesting but again this patch doesn't >> really solve the underlying problem because I strongly suspect that the >> problem is still there when a numa node gets all its memory offline as >> mentioned above. >> >> While I completely agree that making node 0 special is wrong, I have >> still hard time to review this very simply looking patch because all the >> numa initialization is so spread around that this might just blow up >> at unexpected places. IIRC we have discussed testing in the previous >> version and David has provided a way to emulate these configurations >> on x86. Did you manage to use those instruction for additional testing >> on other than ppc architectures? >> > > I have tried all the steps that David mentioned and reported back at > https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u > > As a summary, David's steps are still not creating a memoryless/cpuless on > x86 VM. Now, that is wrong. You get a memoryless/cpuless node, which is *not online*. Once you hotplug some memory, it will switch online. Once you remove memory, it will switch back offline.
* David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]: > On 01.07.20 12:04, Srikar Dronamraju wrote: > > * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]: > > > >> > >>> > >>> 2. Also existence of dummy node also leads to inconsistent information. The > >>> number of online nodes is inconsistent with the information in the > >>> device-tree and resource-dump > >>> > >>> 3. When the dummy node is present, single node non-Numa systems end up showing > >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take > >>> the hit from the unnecessary numa hinting faults. > >> > >> I have to say that I dislike the node online/offline state and directly > >> exporting that to the userspace. Users should only care whether the node > >> has memory/cpus. Numa nodes can be online without any memory. Just > >> offline all the present memory blocks but do not physically hot remove > >> them and you are in the same situation. If users are confused by an > >> output of tools like numactl -H then those could be updated and hide > >> nodes without any memory&cpus. > >> > >> The autonuma problem sounds interesting but again this patch doesn't > >> really solve the underlying problem because I strongly suspect that the > >> problem is still there when a numa node gets all its memory offline as > >> mentioned above. > >> > >> While I completely agree that making node 0 special is wrong, I have > >> still hard time to review this very simply looking patch because all the > >> numa initialization is so spread around that this might just blow up > >> at unexpected places. IIRC we have discussed testing in the previous > >> version and David has provided a way to emulate these configurations > >> on x86. Did you manage to use those instruction for additional testing > >> on other than ppc architectures? > >> > > > > I have tried all the steps that David mentioned and reported back at > > https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u > > > > As a summary, David's steps are still not creating a memoryless/cpuless on > > x86 VM. > > Now, that is wrong. You get a memoryless/cpuless node, which is *not > online*. Once you hotplug some memory, it will switch online. Once you > remove memory, it will switch back offline. > Let me clarify, we are looking for a node 0 which is cpuless/memoryless at boot. The code in question tries to handle a cpuless/memoryless node 0 at boot. With the steps that you gave the node 0 was always populated, node 1 or some other node would be memoryless/cpuless and offline. But that should have no impact by patch. I don't see how adding/hotplugging/removing memory to a node after boot is going to affect the changes that I have made. Please do correct me if I have misunderstood.
On 01.07.20 13:01, Srikar Dronamraju wrote: > * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]: > >> On 01.07.20 12:04, Srikar Dronamraju wrote: >>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]: >>> >>>> >>>>> >>>>> 2. Also existence of dummy node also leads to inconsistent information. The >>>>> number of online nodes is inconsistent with the information in the >>>>> device-tree and resource-dump >>>>> >>>>> 3. When the dummy node is present, single node non-Numa systems end up showing >>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take >>>>> the hit from the unnecessary numa hinting faults. >>>> >>>> I have to say that I dislike the node online/offline state and directly >>>> exporting that to the userspace. Users should only care whether the node >>>> has memory/cpus. Numa nodes can be online without any memory. Just >>>> offline all the present memory blocks but do not physically hot remove >>>> them and you are in the same situation. If users are confused by an >>>> output of tools like numactl -H then those could be updated and hide >>>> nodes without any memory&cpus. >>>> >>>> The autonuma problem sounds interesting but again this patch doesn't >>>> really solve the underlying problem because I strongly suspect that the >>>> problem is still there when a numa node gets all its memory offline as >>>> mentioned above. >>>> >>>> While I completely agree that making node 0 special is wrong, I have >>>> still hard time to review this very simply looking patch because all the >>>> numa initialization is so spread around that this might just blow up >>>> at unexpected places. IIRC we have discussed testing in the previous >>>> version and David has provided a way to emulate these configurations >>>> on x86. Did you manage to use those instruction for additional testing >>>> on other than ppc architectures? >>>> >>> >>> I have tried all the steps that David mentioned and reported back at >>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u >>> >>> As a summary, David's steps are still not creating a memoryless/cpuless on >>> x86 VM. >> >> Now, that is wrong. You get a memoryless/cpuless node, which is *not >> online*. Once you hotplug some memory, it will switch online. Once you >> remove memory, it will switch back offline. >> > > Let me clarify, we are looking for a node 0 which is cpuless/memoryless at > boot. The code in question tries to handle a cpuless/memoryless node 0 at > boot. I was just correcting your statement, because it was wrong. Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither have CPUs nor memory. That would imply that we can, in fact, never have node 0 offline during boot.
On 01.07.20 13:06, David Hildenbrand wrote: > On 01.07.20 13:01, Srikar Dronamraju wrote: >> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]: >> >>> On 01.07.20 12:04, Srikar Dronamraju wrote: >>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]: >>>> >>>>> >>>>>> >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The >>>>>> number of online nodes is inconsistent with the information in the >>>>>> device-tree and resource-dump >>>>>> >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take >>>>>> the hit from the unnecessary numa hinting faults. >>>>> >>>>> I have to say that I dislike the node online/offline state and directly >>>>> exporting that to the userspace. Users should only care whether the node >>>>> has memory/cpus. Numa nodes can be online without any memory. Just >>>>> offline all the present memory blocks but do not physically hot remove >>>>> them and you are in the same situation. If users are confused by an >>>>> output of tools like numactl -H then those could be updated and hide >>>>> nodes without any memory&cpus. >>>>> >>>>> The autonuma problem sounds interesting but again this patch doesn't >>>>> really solve the underlying problem because I strongly suspect that the >>>>> problem is still there when a numa node gets all its memory offline as >>>>> mentioned above. >>>>> >>>>> While I completely agree that making node 0 special is wrong, I have >>>>> still hard time to review this very simply looking patch because all the >>>>> numa initialization is so spread around that this might just blow up >>>>> at unexpected places. IIRC we have discussed testing in the previous >>>>> version and David has provided a way to emulate these configurations >>>>> on x86. Did you manage to use those instruction for additional testing >>>>> on other than ppc architectures? >>>>> >>>> >>>> I have tried all the steps that David mentioned and reported back at >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u >>>> >>>> As a summary, David's steps are still not creating a memoryless/cpuless on >>>> x86 VM. >>> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not >>> online*. Once you hotplug some memory, it will switch online. Once you >>> remove memory, it will switch back offline. >>> >> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at >> boot. The code in question tries to handle a cpuless/memoryless node 0 at >> boot. > > I was just correcting your statement, because it was wrong. > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither > have CPUs nor memory. That would imply that we can, in fact, never have > node 0 offline during boot. > Yep, looks like it. [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > On 01.07.20 13:06, David Hildenbrand wrote: > > On 01.07.20 13:01, Srikar Dronamraju wrote: > >> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]: > >> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote: > >>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]: > >>>> > >>>>> > >>>>>> > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The > >>>>>> number of online nodes is inconsistent with the information in the > >>>>>> device-tree and resource-dump > >>>>>> > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take > >>>>>> the hit from the unnecessary numa hinting faults. > >>>>> > >>>>> I have to say that I dislike the node online/offline state and directly > >>>>> exporting that to the userspace. Users should only care whether the node > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just > >>>>> offline all the present memory blocks but do not physically hot remove > >>>>> them and you are in the same situation. If users are confused by an > >>>>> output of tools like numactl -H then those could be updated and hide > >>>>> nodes without any memory&cpus. > >>>>> > >>>>> The autonuma problem sounds interesting but again this patch doesn't > >>>>> really solve the underlying problem because I strongly suspect that the > >>>>> problem is still there when a numa node gets all its memory offline as > >>>>> mentioned above. I would really appreciate a feedback to these two as well. > >>>>> While I completely agree that making node 0 special is wrong, I have > >>>>> still hard time to review this very simply looking patch because all the > >>>>> numa initialization is so spread around that this might just blow up > >>>>> at unexpected places. IIRC we have discussed testing in the previous > >>>>> version and David has provided a way to emulate these configurations > >>>>> on x86. Did you manage to use those instruction for additional testing > >>>>> on other than ppc architectures? > >>>>> > >>>> > >>>> I have tried all the steps that David mentioned and reported back at > >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u > >>>> > >>>> As a summary, David's steps are still not creating a memoryless/cpuless on > >>>> x86 VM. > >>> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not > >>> online*. Once you hotplug some memory, it will switch online. Once you > >>> remove memory, it will switch back offline. > >>> > >> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at > >> boot. The code in question tries to handle a cpuless/memoryless node 0 at > >> boot. > > > > I was just correcting your statement, because it was wrong. > > > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither > > have CPUs nor memory. That would imply that we can, in fact, never have > > node 0 offline during boot. > > > > Yep, looks like it. > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] This begs a question whether ppc can do the same thing? I would swear that we've had x86 system with node 0 but I cannot really find it and it is possible that it was not x86 after all...
On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote: > * Christopher Lameter <cl@linux.com> [2020-06-29 14:58:40]: > > > On Wed, 24 Jun 2020, Srikar Dronamraju wrote: > > > > > Currently Linux kernel with CONFIG_NUMA on a system with multiple > > > possible nodes, marks node 0 as online at boot. However in practice, > > > there are systems which have node 0 as memoryless and cpuless. > > > > Maybe add something to explain why you are not simply mapping the > > existing memory to NUMA node 0 which is after all just a numbering scheme > > used by the kernel and can be used arbitrarily? > > > > I thought Michal Hocko already gave a clear picture on why mapping is a bad > idea. https://lore.kernel.org/lkml/20200316085425.GB11482@dhcp22.suse.cz/t/#u > Are you suggesting that we add that as part of the changelog? Well, I was not aware x86 already does renumber. So there is a certain precendence. As I've said I do not really like that but this is what already is happening. If renumbering is not an option then just handle that in the ppc code explicitly. Generic solution would be preferable of course but as I've said it is really hard to check for correctness and potential subtle issues.
* Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]: > > >>>>>> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The > > >>>>>> number of online nodes is inconsistent with the information in the > > >>>>>> device-tree and resource-dump > > >>>>>> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take > > >>>>>> the hit from the unnecessary numa hinting faults. > > >>>>> > > >>>>> I have to say that I dislike the node online/offline state and directly > > >>>>> exporting that to the userspace. Users should only care whether the node > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just > > >>>>> offline all the present memory blocks but do not physically hot remove > > >>>>> them and you are in the same situation. If users are confused by an > > >>>>> output of tools like numactl -H then those could be updated and hide > > >>>>> nodes without any memory&cpus. > > >>>>> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't > > >>>>> really solve the underlying problem because I strongly suspect that the > > >>>>> problem is still there when a numa node gets all its memory offline as > > >>>>> mentioned above. > > I would really appreciate a feedback to these two as well. 1. Its not just numactl that's to be fixed but all tools/utilities that depend on /sys/devices/system/node/online. Are we saying to not rely/believe in the output given by the kernel but do further verification? Also how would the user space differentiate between the case where the Kernel missed marking a node as offline to the case where the memory was offlined on a cpuless node but node wasn't offline?. 2. Regarding the autonuma, the case of offline memory is user/admin driven, so if there is a performance hit, its something that's driven by his user/admin actions. Also how often do we see users offline complete memory of cpuless node on a 2 node system? > > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > > This begs a question whether ppc can do the same thing? Certainly ppc can be made to adapt to this situation but that would be a workaround. Do we have a reason why we think node 0 is unique and special? If yes can we document it so that in future also people know why we consider node 0 to be special. I do understand the *fear of the unknown* but when we are unable to theoretically or practically come up a case, then it may probably be better we hit the situation to understand what that unknown is? > I would swear that we've had x86 system with node 0 but I cannot really > find it and it is possible that it was not x86 after all...
On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote: > * Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]: > > > > >>>>>> > > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The > > > >>>>>> number of online nodes is inconsistent with the information in the > > > >>>>>> device-tree and resource-dump > > > >>>>>> > > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing > > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take > > > >>>>>> the hit from the unnecessary numa hinting faults. > > > >>>>> > > > >>>>> I have to say that I dislike the node online/offline state and directly > > > >>>>> exporting that to the userspace. Users should only care whether the node > > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just > > > >>>>> offline all the present memory blocks but do not physically hot remove > > > >>>>> them and you are in the same situation. If users are confused by an > > > >>>>> output of tools like numactl -H then those could be updated and hide > > > >>>>> nodes without any memory&cpus. > > > >>>>> > > > >>>>> The autonuma problem sounds interesting but again this patch doesn't > > > >>>>> really solve the underlying problem because I strongly suspect that the > > > >>>>> problem is still there when a numa node gets all its memory offline as > > > >>>>> mentioned above. > > > > I would really appreciate a feedback to these two as well. > > 1. Its not just numactl that's to be fixed but all tools/utilities that > depend on /sys/devices/system/node/online. Are we saying to not rely/believe > in the output given by the kernel but do further verification? No, what we are saying is that even an online node might have zero number of online pages/cpus. So the online status is not really something that matters. If people are confused by that output then user space tools can make their confusion go away. I really do not understand why the kernel should do any logic there. > Also how would the user space differentiate between the case where the > Kernel missed marking a node as offline to the case where the memory was > offlined on a cpuless node but node wasn't offline?. What I am arguing is that those two shouldn't be any different. Really! > 2. Regarding the autonuma, the case of offline memory is user/admin driven, > so if there is a performance hit, its something that's driven by his > user/admin actions. Also how often do we see users offline complete memory > of cpuless node on a 2 node system? How often do we see crippled HW configurations like that? Really if autonuma should be made more clever for one case it should recognize the other as well. > > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > > > > This begs a question whether ppc can do the same thing? > > Certainly ppc can be made to adapt to this situation but that would be a > workaround. Do we have a reason why we think node 0 is unique and special? It is not. As replied in other email in this thread. I would hope for having less hacks in the numa initialization. Cleaning up the mess is would be a lot of work and testing on all NUMA capable architectures. This is a heritage from the past I am afraid. All that I am arguing here is that your touch to the generic code with a very simple looking patch might have side effects which are pretty much impossible to review. Moreover it seems that nothing but ppc really needs this treatment. So fixing it in ppc specific code sounds much more safe. Normally I would really push for a generic solution but after getting burned several times in this area I do not dare anymore. The problem is not in the code complexity but in how spread it is in places where you do not expect side effects.
* Michal Hocko <mhocko@kernel.org> [2020-07-02 10:41:23]: > On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote: > > * Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]: > > > > > > >>>>> The autonuma problem sounds interesting but again this patch doesn't > > > > >>>>> really solve the underlying problem because I strongly suspect that the > > > > >>>>> problem is still there when a numa node gets all its memory offline as > > > > >>>>> mentioned above. > > > > > > I would really appreciate a feedback to these two as well. > > > > 1. Its not just numactl that's to be fixed but all tools/utilities that > > depend on /sys/devices/system/node/online. Are we saying to not rely/believe > > in the output given by the kernel but do further verification? > > No, what we are saying is that even an online node might have zero > number of online pages/cpus. So the online status is not really > something that matters. If people are confused by that output then user > space tools can make their confusion go away. I really do not understand > why the kernel should do any logic there. The user facing teams are saying they are getting queries from the users who are unable to understand from the tools/sysfs files why a node is online and but has no attached resources. Its the amount of time that is being spent on these issues that triggered the patch. Initially even I was skeptical that this was a non-issue. > > > Also how would the user space differentiate between the case where the > > Kernel missed marking a node as offline to the case where the memory was > > offlined on a cpuless node but node wasn't offline?. > > What I am arguing is that those two shouldn't be any different. Really! > > > 2. Regarding the autonuma, the case of offline memory is user/admin driven, > > so if there is a performance hit, its something that's driven by his > > user/admin actions. Also how often do we see users offline complete memory > > of cpuless node on a 2 node system? > > How often do we see crippled HW configurations like that? Really if > autonuma should be made more clever for one case it should recognize the > other as well. > Lets take a 16 socket PowerVM system and assume that 32 lpars are created on that socket, i.e 2 lpars for each socket. (PowerVM has the final say on how the lpars are created.) In such a case, we can expect 30 out of the 32 lpars to face this problem, with the only 2 lpars that actually run on socket 0 having the correct configuration. > > > > > > This begs a question whether ppc can do the same thing? > > > > Certainly ppc can be made to adapt to this situation but that would be a > > workaround. Do we have a reason why we think node 0 is unique and special? > > It is not. As replied in other email in this thread. I would hope for > having less hacks in the numa initialization. Cleaning up the mess is > would be a lot of work and testing on all NUMA capable architectures. > This is a heritage from the past I am afraid. All that I am arguing here > is that your touch to the generic code with a very simple looking patch > might have side effects which are pretty much impossible to review. > Moreover it seems that nothing but ppc really needs this treatment. > So fixing it in ppc specific code sounds much more safe. > > Normally I would really push for a generic solution but after getting > burned several times in this area I do not dare anymore. The problem is > not in the code complexity but in how spread it is in places where you > do not expect side effects. > I do understand and respect your viewpoint. > -- > Michal Hocko > SUSE Labs
On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > > On 01.07.20 13:06, David Hildenbrand wrote: > > > On 01.07.20 13:01, Srikar Dronamraju wrote: > > >> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]: > > >> > > >>> On 01.07.20 12:04, Srikar Dronamraju wrote: > > >>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]: > > >>>> > > >>>>> > > >>>>>> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The > > >>>>>> number of online nodes is inconsistent with the information in the > > >>>>>> device-tree and resource-dump > > >>>>>> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take > > >>>>>> the hit from the unnecessary numa hinting faults. > > >>>>> > > >>>>> I have to say that I dislike the node online/offline state and directly > > >>>>> exporting that to the userspace. Users should only care whether the node > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just > > >>>>> offline all the present memory blocks but do not physically hot remove > > >>>>> them and you are in the same situation. If users are confused by an > > >>>>> output of tools like numactl -H then those could be updated and hide > > >>>>> nodes without any memory&cpus. > > >>>>> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't > > >>>>> really solve the underlying problem because I strongly suspect that the > > >>>>> problem is still there when a numa node gets all its memory offline as > > >>>>> mentioned above. > > I would really appreciate a feedback to these two as well. > > > >>>>> While I completely agree that making node 0 special is wrong, I have > > >>>>> still hard time to review this very simply looking patch because all the > > >>>>> numa initialization is so spread around that this might just blow up > > >>>>> at unexpected places. IIRC we have discussed testing in the previous > > >>>>> version and David has provided a way to emulate these configurations > > >>>>> on x86. Did you manage to use those instruction for additional testing > > >>>>> on other than ppc architectures? > > >>>>> > > >>>> > > >>>> I have tried all the steps that David mentioned and reported back at > > >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u > > >>>> > > >>>> As a summary, David's steps are still not creating a memoryless/cpuless on > > >>>> x86 VM. > > >>> > > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not > > >>> online*. Once you hotplug some memory, it will switch online. Once you > > >>> remove memory, it will switch back offline. > > >>> > > >> > > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at > > >> boot. The code in question tries to handle a cpuless/memoryless node 0 at > > >> boot. > > > > > > I was just correcting your statement, because it was wrong. > > > > > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither > > > have CPUs nor memory. That would imply that we can, in fact, never have > > > node 0 offline during boot. > > > > > > > Yep, looks like it. > > > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > > This begs a question whether ppc can do the same thing? Or x86 stop doing it so that you can see on what node you are running? What's the point of this indirection other than another way of avoiding empty node 0? Thanks Michal
[Cc Andi] On Fri 03-07-20 11:10:01, Michal Suchanek wrote: > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: [...] > > > Yep, looks like it. > > > > > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > > > > This begs a question whether ppc can do the same thing? > Or x86 stop doing it so that you can see on what node you are running? > > What's the point of this indirection other than another way of avoiding > empty node 0? Honestly, I do not have any idea. I've traced it down to Author: Andi Kleen <ak@suse.de> Date: Tue Jan 11 15:35:48 2005 -0800 [PATCH] x86_64: Fix ACPI SRAT NUMA parsing Fix fallout from the recent nodemask_t changes. The node ids assigned in the SRAT parser were off by one. I added a new first_unset_node() function to nodemask.h to allocate IDs sanely. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org> which doesn't really tell all that much. The historical baggage and a long term behavior which is not really trivial to fix I suspect.
On Fri 03-07-20 11:24:17, Michal Hocko wrote: > [Cc Andi] > > On Fri 03-07-20 11:10:01, Michal Suchanek wrote: > > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > [...] > > > > Yep, looks like it. > > > > > > > > [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > > > > [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > > > > [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > > > > [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > > > > [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > > > > [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > > > > > > This begs a question whether ppc can do the same thing? > > Or x86 stop doing it so that you can see on what node you are running? > > > > What's the point of this indirection other than another way of avoiding > > empty node 0? > > Honestly, I do not have any idea. I've traced it down to > Author: Andi Kleen <ak@suse.de> > Date: Tue Jan 11 15:35:48 2005 -0800 > > [PATCH] x86_64: Fix ACPI SRAT NUMA parsing > > Fix fallout from the recent nodemask_t changes. The node ids assigned > in the SRAT parser were off by one. > > I added a new first_unset_node() function to nodemask.h to allocate > IDs sanely. > > Signed-off-by: Andi Kleen <ak@suse.de> > Signed-off-by: Linus Torvalds <torvalds@osdl.org> > > which doesn't really tell all that much. The historical baggage and a > long term behavior which is not really trivial to fix I suspect. Thinking about this some more, this logic makes some sense afterall. Especially in the world without memory hotplug which was very likely the case back then. It is much better to have compact node mask rather than sparse one. After all node numbers shouldn't really matter as long as you have a clear mapping to the HW. I am not sure we export that information (except for the kernel ring buffer) though. The memory hotplug changes that somehow because you can hotremove numa nodes and therefore make the nodemask sparse but that is not a common case. I am not sure what would happen if a completely new node was added and its corresponding node was already used by the renumbered one though. It would likely conflate the two I am afraid. But I am not sure this is really possible with x86 and a lack of a bug report would suggest that nobody is doing that at least.
On 03.07.20 12:59, Michal Hocko wrote: > On Fri 03-07-20 11:24:17, Michal Hocko wrote: >> [Cc Andi] >> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote: >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote: >> [...] >>>>> Yep, looks like it. >>>>> >>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 >>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 >>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] >>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] >>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] >>>> >>>> This begs a question whether ppc can do the same thing? >>> Or x86 stop doing it so that you can see on what node you are running? >>> >>> What's the point of this indirection other than another way of avoiding >>> empty node 0? >> >> Honestly, I do not have any idea. I've traced it down to >> Author: Andi Kleen <ak@suse.de> >> Date: Tue Jan 11 15:35:48 2005 -0800 >> >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing >> >> Fix fallout from the recent nodemask_t changes. The node ids assigned >> in the SRAT parser were off by one. >> >> I added a new first_unset_node() function to nodemask.h to allocate >> IDs sanely. >> >> Signed-off-by: Andi Kleen <ak@suse.de> >> Signed-off-by: Linus Torvalds <torvalds@osdl.org> >> >> which doesn't really tell all that much. The historical baggage and a >> long term behavior which is not really trivial to fix I suspect. > > Thinking about this some more, this logic makes some sense afterall. > Especially in the world without memory hotplug which was very likely the > case back then. It is much better to have compact node mask rather than > sparse one. After all node numbers shouldn't really matter as long as > you have a clear mapping to the HW. I am not sure we export that > information (except for the kernel ring buffer) though. > > The memory hotplug changes that somehow because you can hotremove numa > nodes and therefore make the nodemask sparse but that is not a common > case. I am not sure what would happen if a completely new node was added > and its corresponding node was already used by the renumbered one > though. It would likely conflate the two I am afraid. But I am not sure > this is really possible with x86 and a lack of a bug report would > suggest that nobody is doing that at least. > I think the ACPI code takes care of properly mapping PXM to nodes. So if I start with PXM 0 empty and PXM 1 populated, I will get PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor $ echo "info numa" | sudo nc -U /var/tmp/monitor QEMU 5.0.50 monitor - type 'help' for more information (qemu) info numa 2 nodes node 0 cpus: node 0 size: 1024 MB node 0 plugged: 1024 MB node 1 cpus: 0 1 2 3 node 1 size: 4096 MB node 1 plugged: 0 MB I get in the guest: [ 50.174435] ------------[ cut here ]------------ [ 50.175436] node 1 was absent from the node_possible_map [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290 [ 50.176844] Modules linked in: [ 50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4 [ 50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4 [ 50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 50.176847] RIP: 0010:add_memory_resource+0x8c/0x290 [ 50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78 [ 50.176849] RSP: 0018:ffffa7a1c0043d48 EFLAGS: 00010296 [ 50.176850] RAX: 000000000000002c RBX: ffff8bc633e63b80 RCX: 0000000000000000 [ 50.176851] RDX: ffff8bc63bc27060 RSI: ffff8bc63bc18d00 RDI: ffff8bc63bc18d00 [ 50.176851] RBP: 0000000000000001 R08: 00000000000001e1 R09: ffffa7a1c0043bd8 [ 50.176852] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000140000000 [ 50.176852] R13: 000000017fffffff R14: 0000000040000000 R15: 0000000180000000 [ 50.176853] FS: 0000000000000000(0000) GS:ffff8bc63bc00000(0000) knlGS:0000000000000000 [ 50.176853] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 50.176855] CR2: 000055dfcbfc5ee8 CR3: 00000000aca0a000 CR4: 00000000000006f0 [ 50.176855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 50.176856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 50.176856] Call Trace: [ 50.176856] __add_memory+0x33/0x70 [ 50.176857] acpi_memory_device_add+0x132/0x2f2 [ 50.176857] acpi_bus_attach+0xd2/0x200 [ 50.176858] acpi_bus_scan+0x33/0x70 [ 50.176858] acpi_device_hotplug+0x298/0x390 [ 50.176858] acpi_hotplug_work_fn+0x3d/0x50 [ 50.176859] process_one_work+0x1b4/0x370 [ 50.176859] worker_thread+0x53/0x3e0 [ 50.176860] ? process_one_work+0x370/0x370 [ 50.176860] kthread+0x119/0x140 [ 50.176860] ? __kthread_bind_mask+0x60/0x60 [ 50.176861] ret_from_fork+0x22/0x30 [ 50.176861] ---[ end trace 9a2a837c1e0164f1 ]--- [ 50.209816] acpi PNP0C80:00: add_memory failed [ 50.210510] acpi PNP0C80:00: acpi_memory_enable_device() error [ 50.211445] acpi PNP0C80:00: Enumeration failure I remember that we added that check just recently (due to powerpc if I am not wrong). Not sure why that triggers here. But it properly maps PXM 0 to node 1.
On Fri 03-07-20 13:32:21, David Hildenbrand wrote: > On 03.07.20 12:59, Michal Hocko wrote: > > On Fri 03-07-20 11:24:17, Michal Hocko wrote: > >> [Cc Andi] > >> > >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote: > >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote: > >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote: > >> [...] > >>>>> Yep, looks like it. > >>>>> > >>>>> [ 0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0 > >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0 > >>>>> [ 0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0 > >>>>> [ 0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0 > >>>>> [ 0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff] > >>>>> [ 0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff] > >>>>> [ 0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff] > >>>> > >>>> This begs a question whether ppc can do the same thing? > >>> Or x86 stop doing it so that you can see on what node you are running? > >>> > >>> What's the point of this indirection other than another way of avoiding > >>> empty node 0? > >> > >> Honestly, I do not have any idea. I've traced it down to > >> Author: Andi Kleen <ak@suse.de> > >> Date: Tue Jan 11 15:35:48 2005 -0800 > >> > >> [PATCH] x86_64: Fix ACPI SRAT NUMA parsing > >> > >> Fix fallout from the recent nodemask_t changes. The node ids assigned > >> in the SRAT parser were off by one. > >> > >> I added a new first_unset_node() function to nodemask.h to allocate > >> IDs sanely. > >> > >> Signed-off-by: Andi Kleen <ak@suse.de> > >> Signed-off-by: Linus Torvalds <torvalds@osdl.org> > >> > >> which doesn't really tell all that much. The historical baggage and a > >> long term behavior which is not really trivial to fix I suspect. > > > > Thinking about this some more, this logic makes some sense afterall. > > Especially in the world without memory hotplug which was very likely the > > case back then. It is much better to have compact node mask rather than > > sparse one. After all node numbers shouldn't really matter as long as > > you have a clear mapping to the HW. I am not sure we export that > > information (except for the kernel ring buffer) though. > > > > The memory hotplug changes that somehow because you can hotremove numa > > nodes and therefore make the nodemask sparse but that is not a common > > case. I am not sure what would happen if a completely new node was added > > and its corresponding node was already used by the renumbered one > > though. It would likely conflate the two I am afraid. But I am not sure > > this is really possible with x86 and a lack of a bug report would > > suggest that nobody is doing that at least. > > > > I think the ACPI code takes care of properly mapping PXM to nodes. > > So if I start with PXM 0 empty and PXM 1 populated, I will get > PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU > > $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor > $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor > > $ echo "info numa" | sudo nc -U /var/tmp/monitor > QEMU 5.0.50 monitor - type 'help' for more information > (qemu) info numa > 2 nodes > node 0 cpus: > node 0 size: 1024 MB > node 0 plugged: 1024 MB > node 1 cpus: 0 1 2 3 > node 1 size: 4096 MB > node 1 plugged: 0 MB Thanks for double checking. > I get in the guest: > > [ 50.174435] ------------[ cut here ]------------ > [ 50.175436] node 1 was absent from the node_possible_map > [ 50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290 This would mean that the ACPI code or whoever does the remaping is not adding the new node into possible nodes. [...] > I remember that we added that check just recently (due to powerpc if I am not wrong). > Not sure why that triggers here. This was a misbehaving Qemu IIRC providing a garbage map.
* Michal Hocko <mhocko@kernel.org> [2020-07-03 12:59:44]: > > Honestly, I do not have any idea. I've traced it down to > > Author: Andi Kleen <ak@suse.de> > > Date: Tue Jan 11 15:35:48 2005 -0800 > > > > [PATCH] x86_64: Fix ACPI SRAT NUMA parsing > > > > Fix fallout from the recent nodemask_t changes. The node ids assigned > > in the SRAT parser were off by one. > > > > I added a new first_unset_node() function to nodemask.h to allocate > > IDs sanely. > > > > Signed-off-by: Andi Kleen <ak@suse.de> > > Signed-off-by: Linus Torvalds <torvalds@osdl.org> > > > > which doesn't really tell all that much. The historical baggage and a > > long term behavior which is not really trivial to fix I suspect. > > Thinking about this some more, this logic makes some sense afterall. > Especially in the world without memory hotplug which was very likely the > case back then. It is much better to have compact node mask rather than > sparse one. After all node numbers shouldn't really matter as long as > you have a clear mapping to the HW. I am not sure we export that > information (except for the kernel ring buffer) though. > > The memory hotplug changes that somehow because you can hotremove numa > nodes and therefore make the nodemask sparse but that is not a common > case. I am not sure what would happen if a completely new node was added > and its corresponding node was already used by the renumbered one > though. It would likely conflate the two I am afraid. But I am not sure > this is really possible with x86 and a lack of a bug report would > suggest that nobody is doing that at least. > JFYI, Satheesh copied in this mailchain had opened a bug a year on crash with vcpu hotplug on memoryless node. https://bugzilla.kernel.org/show_bug.cgi?id=202187
> > What's the point of this indirection other than another way of avoiding > > empty node 0? > > Honestly, I do not have any idea. I've traced it down to > Author: Andi Kleen <ak@suse.de> > Date: Tue Jan 11 15:35:48 2005 -0800 I don't remember all the details, and I can't even find the commit (is it in linux-historic?). But AFAIK there's no guarantee PXMs are small and continuous, so it seemed better to have a clean zero based space. Back then we had a lot of problems with buggy SRAT tables in BIOS, so we really tried to avoid trusting the BIOS as much as possible. -Andi
On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > The memory hotplug changes that somehow because you can hotremove numa > > nodes and therefore make the nodemask sparse but that is not a common > > case. I am not sure what would happen if a completely new node was added > > and its corresponding node was already used by the renumbered one > > though. It would likely conflate the two I am afraid. But I am not sure > > this is really possible with x86 and a lack of a bug report would > > suggest that nobody is doing that at least. > > > > JFYI, > Satheesh copied in this mailchain had opened a bug a year on crash with vcpu > hotplug on memoryless node. > > https://bugzilla.kernel.org/show_bug.cgi?id=202187 So... do we merge this patch or not? Seems that the overall view is "risky but nobody is likely to do anything better any time soon"?
On 07.08.20 06:32, Andrew Morton wrote: > On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > >>> The memory hotplug changes that somehow because you can hotremove numa >>> nodes and therefore make the nodemask sparse but that is not a common >>> case. I am not sure what would happen if a completely new node was added >>> and its corresponding node was already used by the renumbered one >>> though. It would likely conflate the two I am afraid. But I am not sure >>> this is really possible with x86 and a lack of a bug report would >>> suggest that nobody is doing that at least. >>> >> >> JFYI, >> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu >> hotplug on memoryless node. >> >> https://bugzilla.kernel.org/show_bug.cgi?id=202187 > > So... do we merge this patch or not? Seems that the overall view is > "risky but nobody is likely to do anything better any time soon"? I recall the issue Michal saw was "fix powerpc" vs. "break other architectures". @Michal how should we proceed? At least x86-64 won't be affected IIUC.
On Fri, Aug 07, 2020 at 08:58:09AM +0200, David Hildenbrand wrote: > On 07.08.20 06:32, Andrew Morton wrote: > > On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > > >>> The memory hotplug changes that somehow because you can hotremove numa > >>> nodes and therefore make the nodemask sparse but that is not a common > >>> case. I am not sure what would happen if a completely new node was added > >>> and its corresponding node was already used by the renumbered one > >>> though. It would likely conflate the two I am afraid. But I am not sure > >>> this is really possible with x86 and a lack of a bug report would > >>> suggest that nobody is doing that at least. > >>> > >> > >> JFYI, > >> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu > >> hotplug on memoryless node. > >> > >> https://bugzilla.kernel.org/show_bug.cgi?id=202187 > > > > So... do we merge this patch or not? Seems that the overall view is > > "risky but nobody is likely to do anything better any time soon"? > > I recall the issue Michal saw was "fix powerpc" vs. "break other > architectures". @Michal how should we proceed? At least x86-64 won't be > affected IIUC. There is a patch to introduce the node remapping on ppc as well which should eliminate the empty node 0. https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200731111916.243569-1-aneesh.kumar@linux.ibm.com/ Thanks Michal
Hi Andrew, Michal, David * Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]: > On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > > > The memory hotplug changes that somehow because you can hotremove numa > > > nodes and therefore make the nodemask sparse but that is not a common > > > case. I am not sure what would happen if a completely new node was added > > > and its corresponding node was already used by the renumbered one > > > though. It would likely conflate the two I am afraid. But I am not sure > > > this is really possible with x86 and a lack of a bug report would > > > suggest that nobody is doing that at least. > > > > > > > JFYI, > > Satheesh copied in this mailchain had opened a bug a year on crash with vcpu > > hotplug on memoryless node. > > > > https://bugzilla.kernel.org/show_bug.cgi?id=202187 > > So... do we merge this patch or not? Seems that the overall view is > "risky but nobody is likely to do anything better any time soon"? Can we decide on this one way or the other?
On 12.08.20 08:01, Srikar Dronamraju wrote: > Hi Andrew, Michal, David > > * Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]: > >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: >> >>>> The memory hotplug changes that somehow because you can hotremove numa >>>> nodes and therefore make the nodemask sparse but that is not a common >>>> case. I am not sure what would happen if a completely new node was added >>>> and its corresponding node was already used by the renumbered one >>>> though. It would likely conflate the two I am afraid. But I am not sure >>>> this is really possible with x86 and a lack of a bug report would >>>> suggest that nobody is doing that at least. >>>> >>> >>> JFYI, >>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu >>> hotplug on memoryless node. >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202187 >> >> So... do we merge this patch or not? Seems that the overall view is >> "risky but nobody is likely to do anything better any time soon"? > > Can we decide on this one way or the other? Hmm, not sure who's the person to decide. I tend to prefer doing the node renaming, handling this in ppc code; looking at the review of v2 there are still some concerns regarding numa distances. https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200817103238.158133-1-aneesh.kumar@linux.ibm.com/
On Tue 18-08-20 09:32:52, David Hildenbrand wrote: > On 12.08.20 08:01, Srikar Dronamraju wrote: > > Hi Andrew, Michal, David > > > > * Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]: > > > >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > >> > >>>> The memory hotplug changes that somehow because you can hotremove numa > >>>> nodes and therefore make the nodemask sparse but that is not a common > >>>> case. I am not sure what would happen if a completely new node was added > >>>> and its corresponding node was already used by the renumbered one > >>>> though. It would likely conflate the two I am afraid. But I am not sure > >>>> this is really possible with x86 and a lack of a bug report would > >>>> suggest that nobody is doing that at least. > >>>> > >>> > >>> JFYI, > >>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu > >>> hotplug on memoryless node. > >>> > >>> https://bugzilla.kernel.org/show_bug.cgi?id=202187 > >> > >> So... do we merge this patch or not? Seems that the overall view is > >> "risky but nobody is likely to do anything better any time soon"? > > > > Can we decide on this one way or the other? > > Hmm, not sure who's the person to decide. I tend to prefer doing the > node renaming, handling this in ppc code; Agreed. That would be a safer option.
* Michal Hocko <mhocko@suse.com> [2020-08-18 09:37:12]: > On Tue 18-08-20 09:32:52, David Hildenbrand wrote: > > On 12.08.20 08:01, Srikar Dronamraju wrote: > > > Hi Andrew, Michal, David > > > > > > * Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]: > > > > > >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > >> > > >>>> The memory hotplug changes that somehow because you can hotremove numa > > >>>> nodes and therefore make the nodemask sparse but that is not a common > > >>>> case. I am not sure what would happen if a completely new node was added > > >>>> and its corresponding node was already used by the renumbered one > > >>>> though. It would likely conflate the two I am afraid. But I am not sure > > >>>> this is really possible with x86 and a lack of a bug report would > > >>>> suggest that nobody is doing that at least. > > >>>> > > >> So... do we merge this patch or not? Seems that the overall view is > > >> "risky but nobody is likely to do anything better any time soon"? > > > > > > Can we decide on this one way or the other? > > > > Hmm, not sure who's the person to decide. I tend to prefer doing the > > node renaming, handling this in ppc code; > > Agreed. That would be a safer option. Okay, will send arch specific v6 version. > -- > Michal Hocko > SUSE Labs
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 48eb0f1410d4..5187664558e1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy); */ nodemask_t node_states[NR_NODE_STATES] __read_mostly = { [N_POSSIBLE] = NODE_MASK_ALL, +#ifdef CONFIG_NUMA + [N_ONLINE] = NODE_MASK_NONE, +#else [N_ONLINE] = { { [0] = 1UL } }, -#ifndef CONFIG_NUMA [N_NORMAL_MEMORY] = { { [0] = 1UL } }, #ifdef CONFIG_HIGHMEM [N_HIGH_MEMORY] = { { [0] = 1UL } },
Currently Linux kernel with CONFIG_NUMA on a system with multiple possible nodes, marks node 0 as online at boot. However in practice, there are systems which have node 0 as memoryless and cpuless. This can cause numa_balancing to be enabled on systems with only one node with memory and CPUs. The existence of this dummy node which is cpuless and memoryless node can confuse users/scripts looking at output of lscpu / numactl. By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is always online. v5.8-rc2 available: 2 nodes (0,2) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31490 MB node distances: node 0 2 0: 10 20 2: 20 10 proc and sys files ------------------ /sys/devices/system/node/online: 0,2 /proc/sys/kernel/numa_balancing: 1 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 v5.8-rc2 + patch ------------------ available: 1 nodes (2) node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31487 MB node distances: node 2 2: 10 proc and sys files ------------------ /sys/devices/system/node/online: 2 /proc/sys/kernel/numa_balancing: 0 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 Note: On Powerpc, cpu_to_node of possible but not present cpus would previously return 0. Hence this commit depends on commit ("powerpc/numa: Set numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id queried from vphn"). Without the 2 commits, Powerpc system might crash. 1. User space applications like Numactl, lscpu, that parse the sysfs tend to believe there is an extra online node. This tends to confuse users and applications. Other user space applications start believing that system was not able to use all the resources (i.e missing resources) or the system was not setup correctly. 2. Also existence of dummy node also leads to inconsistent information. The number of online nodes is inconsistent with the information in the device-tree and resource-dump 3. When the dummy node is present, single node non-Numa systems end up showing up as NUMA systems and numa_balancing gets enabled. This will mean we take the hit from the unnecessary numa hinting faults. Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Christopher Lameter <cl@linux.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> --- Changelog v4:->v5: - rebased to v5.8-rc2 link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u Changelog v1:->v2: - Rebased to v5.7-rc3 Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u mm/page_alloc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)