diff mbox series

[v5,3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

Message ID 20200624092846.9194-4-srikar@linux.vnet.ibm.com (mailing list archive)
State New, archived
Headers show
Series Offline memoryless cpuless node 0 | expand

Commit Message

Srikar Dronamraju June 24, 2020, 9:28 a.m. UTC
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot.  However in practice,
there are systems which have node 0 as memoryless and cpuless.

This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.

By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
always online.

v5.8-rc2
 available: 2 nodes (0,2)
 node 0 cpus:
 node 0 size: 0 MB
 node 0 free: 0 MB
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31490 MB
 node distances:
 node   0   2
   0:  10  20
   2:  20  10

proc and sys files
------------------
 /sys/devices/system/node/online:            0,2
 /proc/sys/kernel/numa_balancing:            1
 /sys/devices/system/node/has_cpu:           2
 /sys/devices/system/node/has_memory:        2
 /sys/devices/system/node/has_normal_memory: 2
 /sys/devices/system/node/possible:          0-31

v5.8-rc2 + patch
------------------
 available: 1 nodes (2)
 node 2 cpus: 0 1 2 3 4 5 6 7
 node 2 size: 32625 MB
 node 2 free: 31487 MB
 node distances:
 node   2
   2:  10

proc and sys files
------------------
/sys/devices/system/node/online:            2
/proc/sys/kernel/numa_balancing:            0
/sys/devices/system/node/has_cpu:           2
/sys/devices/system/node/has_memory:        2
/sys/devices/system/node/has_normal_memory: 2
/sys/devices/system/node/possible:          0-31

Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.

1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.

2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump

3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Christopher Lameter <cl@linux.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v4:->v5:
- rebased to v5.8-rc2
link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u

Changelog v1:->v2:
- Rebased to v5.7-rc3
Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u

 mm/page_alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Christoph Lameter (Ampere) June 29, 2020, 2:58 p.m. UTC | #1
On Wed, 24 Jun 2020, Srikar Dronamraju wrote:

> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.

Maybe add something to explain why you are not simply mapping the
existing memory to NUMA node 0 which is after all just a numbering scheme
used by the kernel and can be used arbitrarily?

This could be seen more as a bug in the arch code during the setup of NUMA
nodes. The two nodes are created by the firmwware / bootstrap code after
all. Just do not do it?
Srikar Dronamraju June 30, 2020, 4:01 a.m. UTC | #2
* Christopher Lameter <cl@linux.com> [2020-06-29 14:58:40]:

> On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> 
> > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > possible nodes, marks node 0 as online at boot.  However in practice,
> > there are systems which have node 0 as memoryless and cpuless.
> 
> Maybe add something to explain why you are not simply mapping the
> existing memory to NUMA node 0 which is after all just a numbering scheme
> used by the kernel and can be used arbitrarily?
> 

I thought Michal Hocko already gave a clear picture on why mapping is a bad
idea. https://lore.kernel.org/lkml/20200316085425.GB11482@dhcp22.suse.cz/t/#u
Are you suggesting that we add that as part of the changelog?

> This could be seen more as a bug in the arch code during the setup of NUMA
> nodes. The two nodes are created by the firmwware / bootstrap code after
> all. Just do not do it?
> 

- The arch/setup code in powerpc is not onlining these nodes. 
- Later on cpus/memory in node 0 can be onlined.
- Firmware in this case Phyp is an independent code by itself.
Michal Hocko July 1, 2020, 8:42 a.m. UTC | #3
On Wed 24-06-20 14:58:46, Srikar Dronamraju wrote:
> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.
> 
> This can cause numa_balancing to be enabled on systems with only one node
> with memory and CPUs. The existence of this dummy node which is cpuless and
> memoryless node can confuse users/scripts looking at output of lscpu /
> numactl.
> 
> By marking, N_ONLINE as NODE_MASK_NONE, lets stop assuming that Node 0 is
> always online.
> 
> v5.8-rc2
>  available: 2 nodes (0,2)
>  node 0 cpus:
>  node 0 size: 0 MB
>  node 0 free: 0 MB
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31490 MB
>  node distances:
>  node   0   2
>    0:  10  20
>    2:  20  10
> 
> proc and sys files
> ------------------
>  /sys/devices/system/node/online:            0,2
>  /proc/sys/kernel/numa_balancing:            1
>  /sys/devices/system/node/has_cpu:           2
>  /sys/devices/system/node/has_memory:        2
>  /sys/devices/system/node/has_normal_memory: 2
>  /sys/devices/system/node/possible:          0-31
> 
> v5.8-rc2 + patch
> ------------------
>  available: 1 nodes (2)
>  node 2 cpus: 0 1 2 3 4 5 6 7
>  node 2 size: 32625 MB
>  node 2 free: 31487 MB
>  node distances:
>  node   2
>    2:  10
> 
> proc and sys files
> ------------------
> /sys/devices/system/node/online:            2
> /proc/sys/kernel/numa_balancing:            0
> /sys/devices/system/node/has_cpu:           2
> /sys/devices/system/node/has_memory:        2
> /sys/devices/system/node/has_normal_memory: 2
> /sys/devices/system/node/possible:          0-31
> 
> Note: On Powerpc, cpu_to_node of possible but not present cpus would
> previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
> numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
> queried from vphn"). Without the 2 commits, Powerpc system might crash.
> 
> 1. User space applications like Numactl, lscpu, that parse the sysfs tend to
> believe there is an extra online node. This tends to confuse users and
> applications. Other user space applications start believing that system was
> not able to use all the resources (i.e missing resources) or the system was
> not setup correctly.
> 
> 2. Also existence of dummy node also leads to inconsistent information. The
> number of online nodes is inconsistent with the information in the
> device-tree and resource-dump
> 
> 3. When the dummy node is present, single node non-Numa systems end up showing
> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> the hit from the unnecessary numa hinting faults.

I have to say that I dislike the node online/offline state and directly
exporting that to the userspace. Users should only care whether the node
has memory/cpus. Numa nodes can be online without any memory. Just
offline all the present memory blocks but do not physically hot remove
them and you are in the same situation. If users are confused by an
output of tools like numactl -H then those could be updated and hide
nodes without any memory&cpus.

The autonuma problem sounds interesting but again this patch doesn't
really solve the underlying problem because I strongly suspect that the
problem is still there when a numa node gets all its memory offline as
mentioned above.

While I completely agree that making node 0 special is wrong, I have
still hard time to review this very simply looking patch because all the
numa initialization is so spread around that this might just blow up
at unexpected places. IIRC we have discussed testing in the previous
version and David has provided a way to emulate these configurations
on x86. Did you manage to use those instruction for additional testing
on other than ppc architectures?

> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v4:->v5:
> - rebased to v5.8-rc2
> link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u
> 
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u
> 
>  mm/page_alloc.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 48eb0f1410d4..5187664558e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -117,8 +117,10 @@ EXPORT_SYMBOL(latent_entropy);
>   */
>  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>  	[N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> +	[N_ONLINE] = NODE_MASK_NONE,
> +#else
>  	[N_ONLINE] = { { [0] = 1UL } },
> -#ifndef CONFIG_NUMA
>  	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
>  #ifdef CONFIG_HIGHMEM
>  	[N_HIGH_MEMORY] = { { [0] = 1UL } },
> -- 
> 2.18.1
>
Srikar Dronamraju July 1, 2020, 10:04 a.m. UTC | #4
* Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:

> 
> > 
> > 2. Also existence of dummy node also leads to inconsistent information. The
> > number of online nodes is inconsistent with the information in the
> > device-tree and resource-dump
> > 
> > 3. When the dummy node is present, single node non-Numa systems end up showing
> > up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > the hit from the unnecessary numa hinting faults.
> 
> I have to say that I dislike the node online/offline state and directly
> exporting that to the userspace. Users should only care whether the node
> has memory/cpus. Numa nodes can be online without any memory. Just
> offline all the present memory blocks but do not physically hot remove
> them and you are in the same situation. If users are confused by an
> output of tools like numactl -H then those could be updated and hide
> nodes without any memory&cpus.
> 
> The autonuma problem sounds interesting but again this patch doesn't
> really solve the underlying problem because I strongly suspect that the
> problem is still there when a numa node gets all its memory offline as
> mentioned above.
> 
> While I completely agree that making node 0 special is wrong, I have
> still hard time to review this very simply looking patch because all the
> numa initialization is so spread around that this might just blow up
> at unexpected places. IIRC we have discussed testing in the previous
> version and David has provided a way to emulate these configurations
> on x86. Did you manage to use those instruction for additional testing
> on other than ppc architectures?
> 

I have tried all the steps that David mentioned and reported back at
https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u

As a summary, David's steps are still not creating a memoryless/cpuless on
x86 VM. I have tried booting with Numa/non-numa on all the x86 machines that
I could get to.
David Hildenbrand July 1, 2020, 10:15 a.m. UTC | #5
On 01.07.20 12:04, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> 
>>
>>>
>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>> number of online nodes is inconsistent with the information in the
>>> device-tree and resource-dump
>>>
>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>> the hit from the unnecessary numa hinting faults.
>>
>> I have to say that I dislike the node online/offline state and directly
>> exporting that to the userspace. Users should only care whether the node
>> has memory/cpus. Numa nodes can be online without any memory. Just
>> offline all the present memory blocks but do not physically hot remove
>> them and you are in the same situation. If users are confused by an
>> output of tools like numactl -H then those could be updated and hide
>> nodes without any memory&cpus.
>>
>> The autonuma problem sounds interesting but again this patch doesn't
>> really solve the underlying problem because I strongly suspect that the
>> problem is still there when a numa node gets all its memory offline as
>> mentioned above.
>>
>> While I completely agree that making node 0 special is wrong, I have
>> still hard time to review this very simply looking patch because all the
>> numa initialization is so spread around that this might just blow up
>> at unexpected places. IIRC we have discussed testing in the previous
>> version and David has provided a way to emulate these configurations
>> on x86. Did you manage to use those instruction for additional testing
>> on other than ppc architectures?
>>
> 
> I have tried all the steps that David mentioned and reported back at
> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> 
> As a summary, David's steps are still not creating a memoryless/cpuless on
> x86 VM.

Now, that is wrong. You get a memoryless/cpuless node, which is *not
online*. Once you hotplug some memory, it will switch online. Once you
remove memory, it will switch back offline.
Srikar Dronamraju July 1, 2020, 11:01 a.m. UTC | #6
* David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:

> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> > 
> >>
> >>>
> >>> 2. Also existence of dummy node also leads to inconsistent information. The
> >>> number of online nodes is inconsistent with the information in the
> >>> device-tree and resource-dump
> >>>
> >>> 3. When the dummy node is present, single node non-Numa systems end up showing
> >>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>> the hit from the unnecessary numa hinting faults.
> >>
> >> I have to say that I dislike the node online/offline state and directly
> >> exporting that to the userspace. Users should only care whether the node
> >> has memory/cpus. Numa nodes can be online without any memory. Just
> >> offline all the present memory blocks but do not physically hot remove
> >> them and you are in the same situation. If users are confused by an
> >> output of tools like numactl -H then those could be updated and hide
> >> nodes without any memory&cpus.
> >>
> >> The autonuma problem sounds interesting but again this patch doesn't
> >> really solve the underlying problem because I strongly suspect that the
> >> problem is still there when a numa node gets all its memory offline as
> >> mentioned above.
> >>
> >> While I completely agree that making node 0 special is wrong, I have
> >> still hard time to review this very simply looking patch because all the
> >> numa initialization is so spread around that this might just blow up
> >> at unexpected places. IIRC we have discussed testing in the previous
> >> version and David has provided a way to emulate these configurations
> >> on x86. Did you manage to use those instruction for additional testing
> >> on other than ppc architectures?
> >>
> > 
> > I have tried all the steps that David mentioned and reported back at
> > https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> > 
> > As a summary, David's steps are still not creating a memoryless/cpuless on
> > x86 VM.
> 
> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> online*. Once you hotplug some memory, it will switch online. Once you
> remove memory, it will switch back offline.
> 

Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
boot.  The code in question tries to handle a cpuless/memoryless node 0 at
boot.

With the steps that you gave the node 0 was always populated, node 1 or
some other node would be memoryless/cpuless and offline. But that should
have no impact by patch.

I don't see how adding/hotplugging/removing memory to a node after boot is
going to affect the changes that I have made. Please do correct me if I have
misunderstood.
David Hildenbrand July 1, 2020, 11:06 a.m. UTC | #7
On 01.07.20 13:01, Srikar Dronamraju wrote:
> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
> 
>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
>>>
>>>>
>>>>>
>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>>>> number of online nodes is inconsistent with the information in the
>>>>> device-tree and resource-dump
>>>>>
>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>>>> the hit from the unnecessary numa hinting faults.
>>>>
>>>> I have to say that I dislike the node online/offline state and directly
>>>> exporting that to the userspace. Users should only care whether the node
>>>> has memory/cpus. Numa nodes can be online without any memory. Just
>>>> offline all the present memory blocks but do not physically hot remove
>>>> them and you are in the same situation. If users are confused by an
>>>> output of tools like numactl -H then those could be updated and hide
>>>> nodes without any memory&cpus.
>>>>
>>>> The autonuma problem sounds interesting but again this patch doesn't
>>>> really solve the underlying problem because I strongly suspect that the
>>>> problem is still there when a numa node gets all its memory offline as
>>>> mentioned above.
>>>>
>>>> While I completely agree that making node 0 special is wrong, I have
>>>> still hard time to review this very simply looking patch because all the
>>>> numa initialization is so spread around that this might just blow up
>>>> at unexpected places. IIRC we have discussed testing in the previous
>>>> version and David has provided a way to emulate these configurations
>>>> on x86. Did you manage to use those instruction for additional testing
>>>> on other than ppc architectures?
>>>>
>>>
>>> I have tried all the steps that David mentioned and reported back at
>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
>>>
>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>> x86 VM.
>>
>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>> online*. Once you hotplug some memory, it will switch online. Once you
>> remove memory, it will switch back offline.
>>
> 
> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> boot.

I was just correcting your statement, because it was wrong.

Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
have CPUs nor memory. That would imply that we can, in fact, never have
node 0 offline during boot.
David Hildenbrand July 1, 2020, 11:30 a.m. UTC | #8
On 01.07.20 13:06, David Hildenbrand wrote:
> On 01.07.20 13:01, Srikar Dronamraju wrote:
>> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
>>
>>> On 01.07.20 12:04, Srikar Dronamraju wrote:
>>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
>>>>
>>>>>
>>>>>>
>>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
>>>>>> number of online nodes is inconsistent with the information in the
>>>>>> device-tree and resource-dump
>>>>>>
>>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
>>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
>>>>>> the hit from the unnecessary numa hinting faults.
>>>>>
>>>>> I have to say that I dislike the node online/offline state and directly
>>>>> exporting that to the userspace. Users should only care whether the node
>>>>> has memory/cpus. Numa nodes can be online without any memory. Just
>>>>> offline all the present memory blocks but do not physically hot remove
>>>>> them and you are in the same situation. If users are confused by an
>>>>> output of tools like numactl -H then those could be updated and hide
>>>>> nodes without any memory&cpus.
>>>>>
>>>>> The autonuma problem sounds interesting but again this patch doesn't
>>>>> really solve the underlying problem because I strongly suspect that the
>>>>> problem is still there when a numa node gets all its memory offline as
>>>>> mentioned above.
>>>>>
>>>>> While I completely agree that making node 0 special is wrong, I have
>>>>> still hard time to review this very simply looking patch because all the
>>>>> numa initialization is so spread around that this might just blow up
>>>>> at unexpected places. IIRC we have discussed testing in the previous
>>>>> version and David has provided a way to emulate these configurations
>>>>> on x86. Did you manage to use those instruction for additional testing
>>>>> on other than ppc architectures?
>>>>>
>>>>
>>>> I have tried all the steps that David mentioned and reported back at
>>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
>>>>
>>>> As a summary, David's steps are still not creating a memoryless/cpuless on
>>>> x86 VM.
>>>
>>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
>>> online*. Once you hotplug some memory, it will switch online. Once you
>>> remove memory, it will switch back offline.
>>>
>>
>> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
>> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
>> boot.
> 
> I was just correcting your statement, because it was wrong.
> 
> Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> have CPUs nor memory. That would imply that we can, in fact, never have
> node 0 offline during boot.
> 

Yep, looks like it.

[    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
[    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
[    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
[    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
[    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
[    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
[    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
Michal Hocko July 1, 2020, 12:21 p.m. UTC | #9
On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> On 01.07.20 13:06, David Hildenbrand wrote:
> > On 01.07.20 13:01, Srikar Dronamraju wrote:
> >> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
> >>
> >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> >>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> >>>>
> >>>>>
> >>>>>>
> >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> >>>>>> number of online nodes is inconsistent with the information in the
> >>>>>> device-tree and resource-dump
> >>>>>>
> >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> >>>>>> the hit from the unnecessary numa hinting faults.
> >>>>>
> >>>>> I have to say that I dislike the node online/offline state and directly
> >>>>> exporting that to the userspace. Users should only care whether the node
> >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> >>>>> offline all the present memory blocks but do not physically hot remove
> >>>>> them and you are in the same situation. If users are confused by an
> >>>>> output of tools like numactl -H then those could be updated and hide
> >>>>> nodes without any memory&cpus.
> >>>>>
> >>>>> The autonuma problem sounds interesting but again this patch doesn't
> >>>>> really solve the underlying problem because I strongly suspect that the
> >>>>> problem is still there when a numa node gets all its memory offline as
> >>>>> mentioned above.

I would really appreciate a feedback to these two as well.

> >>>>> While I completely agree that making node 0 special is wrong, I have
> >>>>> still hard time to review this very simply looking patch because all the
> >>>>> numa initialization is so spread around that this might just blow up
> >>>>> at unexpected places. IIRC we have discussed testing in the previous
> >>>>> version and David has provided a way to emulate these configurations
> >>>>> on x86. Did you manage to use those instruction for additional testing
> >>>>> on other than ppc architectures?
> >>>>>
> >>>>
> >>>> I have tried all the steps that David mentioned and reported back at
> >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> >>>>
> >>>> As a summary, David's steps are still not creating a memoryless/cpuless on
> >>>> x86 VM.
> >>>
> >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> >>> online*. Once you hotplug some memory, it will switch online. Once you
> >>> remove memory, it will switch back offline.
> >>>
> >>
> >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> >> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> >> boot.
> > 
> > I was just correcting your statement, because it was wrong.
> > 
> > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > have CPUs nor memory. That would imply that we can, in fact, never have
> > node 0 offline during boot.
> > 
> 
> Yep, looks like it.
> 
> [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]

This begs a question whether ppc can do the same thing?

I would swear that we've had x86 system with node 0 but I cannot really
find it and it is possible that it was not x86 after all...
Michal Hocko July 1, 2020, 12:23 p.m. UTC | #10
On Tue 30-06-20 09:31:25, Srikar Dronamraju wrote:
> * Christopher Lameter <cl@linux.com> [2020-06-29 14:58:40]:
> 
> > On Wed, 24 Jun 2020, Srikar Dronamraju wrote:
> > 
> > > Currently Linux kernel with CONFIG_NUMA on a system with multiple
> > > possible nodes, marks node 0 as online at boot.  However in practice,
> > > there are systems which have node 0 as memoryless and cpuless.
> > 
> > Maybe add something to explain why you are not simply mapping the
> > existing memory to NUMA node 0 which is after all just a numbering scheme
> > used by the kernel and can be used arbitrarily?
> > 
> 
> I thought Michal Hocko already gave a clear picture on why mapping is a bad
> idea. https://lore.kernel.org/lkml/20200316085425.GB11482@dhcp22.suse.cz/t/#u
> Are you suggesting that we add that as part of the changelog?

Well, I was not aware x86 already does renumber. So there is a certain
precendence. As I've said I do not really like that but this is what
already is happening. If renumbering is not an option then just handle
that in the ppc code explicitly. Generic solution would be preferable of
course but as I've said it is really hard to check for correctness and
potential subtle issues.
Srikar Dronamraju July 2, 2020, 6:44 a.m. UTC | #11
* Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]:

> > >>>>>>
> > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > >>>>>> number of online nodes is inconsistent with the information in the
> > >>>>>> device-tree and resource-dump
> > >>>>>>
> > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > >>>>>> the hit from the unnecessary numa hinting faults.
> > >>>>>
> > >>>>> I have to say that I dislike the node online/offline state and directly
> > >>>>> exporting that to the userspace. Users should only care whether the node
> > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > >>>>> offline all the present memory blocks but do not physically hot remove
> > >>>>> them and you are in the same situation. If users are confused by an
> > >>>>> output of tools like numactl -H then those could be updated and hide
> > >>>>> nodes without any memory&cpus.
> > >>>>>
> > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > >>>>> really solve the underlying problem because I strongly suspect that the
> > >>>>> problem is still there when a numa node gets all its memory offline as
> > >>>>> mentioned above.
> 
> I would really appreciate a feedback to these two as well.

1. Its not just numactl that's to be fixed but all tools/utilities that
depend on /sys/devices/system/node/online. Are we saying to not rely/believe
in the output given by the kernel but do further verification?  

Also how would the user space differentiate between the case where the
Kernel missed marking a node as offline to the case where the memory was
offlined on a cpuless node but node wasn't offline?.

2. Regarding the autonuma, the case of offline memory is user/admin driven,
so if there is a performance hit, its something that's driven by his
user/admin actions. Also how often do we see users offline complete memory
of cpuless node on a 2 node system?

> 
> > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> 
> This begs a question whether ppc can do the same thing?

Certainly ppc can be made to adapt to this situation but that would be a
workaround. Do we have a reason why we think node 0 is unique and special?
If yes can we document it so that in future also people know why we consider
node 0 to be special. I do understand the *fear of the unknown* but when we
are unable to theoretically or practically come up a case, then it may
probably be better we hit the situation to understand what that unknown is?

> I would swear that we've had x86 system with node 0 but I cannot really
> find it and it is possible that it was not x86 after all...
Michal Hocko July 2, 2020, 8:41 a.m. UTC | #12
On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> * Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]:
> 
> > > >>>>>>
> > > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > > >>>>>> number of online nodes is inconsistent with the information in the
> > > >>>>>> device-tree and resource-dump
> > > >>>>>>
> > > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > > >>>>>> the hit from the unnecessary numa hinting faults.
> > > >>>>>
> > > >>>>> I have to say that I dislike the node online/offline state and directly
> > > >>>>> exporting that to the userspace. Users should only care whether the node
> > > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > > >>>>> offline all the present memory blocks but do not physically hot remove
> > > >>>>> them and you are in the same situation. If users are confused by an
> > > >>>>> output of tools like numactl -H then those could be updated and hide
> > > >>>>> nodes without any memory&cpus.
> > > >>>>>
> > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > >>>>> really solve the underlying problem because I strongly suspect that the
> > > >>>>> problem is still there when a numa node gets all its memory offline as
> > > >>>>> mentioned above.
> > 
> > I would really appreciate a feedback to these two as well.
> 
> 1. Its not just numactl that's to be fixed but all tools/utilities that
> depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> in the output given by the kernel but do further verification?  

No, what we are saying is that even an online node might have zero
number of online pages/cpus. So the online status is not really
something that matters. If people are confused by that output then user
space tools can make their confusion go away. I really do not understand
why the kernel should do any logic there.

> Also how would the user space differentiate between the case where the
> Kernel missed marking a node as offline to the case where the memory was
> offlined on a cpuless node but node wasn't offline?.

What I am arguing is that those two shouldn't be any different. Really!

> 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> so if there is a performance hit, its something that's driven by his
> user/admin actions. Also how often do we see users offline complete memory
> of cpuless node on a 2 node system?

How often do we see crippled HW configurations like that? Really if
autonuma should be made more clever for one case it should recognize the
other as well.

> > > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> > 
> > This begs a question whether ppc can do the same thing?
> 
> Certainly ppc can be made to adapt to this situation but that would be a
> workaround. Do we have a reason why we think node 0 is unique and special?

It is not. As replied in other email in this thread. I would hope for
having less hacks in the numa initialization. Cleaning up the mess is
would be a lot of work and testing on all NUMA capable architectures.
This is a heritage from the past I am afraid. All that I am arguing here
is that your touch to the generic code with a very simple looking patch
might have side effects which are pretty much impossible to review.
Moreover it seems that nothing but ppc really needs this treatment.
So fixing it in ppc specific code sounds much more safe.

Normally I would really push for a generic solution but after getting
burned several times in this area I do not dare anymore. The problem is
not in the code complexity but in how spread it is in places where you
do not expect side effects.
Srikar Dronamraju July 2, 2020, 2:32 p.m. UTC | #13
* Michal Hocko <mhocko@kernel.org> [2020-07-02 10:41:23]:

> On Thu 02-07-20 12:14:08, Srikar Dronamraju wrote:
> > * Michal Hocko <mhocko@kernel.org> [2020-07-01 14:21:10]:
> > 
> > > > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > > > >>>>> really solve the underlying problem because I strongly suspect that the
> > > > >>>>> problem is still there when a numa node gets all its memory offline as
> > > > >>>>> mentioned above.
> > > 
> > > I would really appreciate a feedback to these two as well.
> > 
> > 1. Its not just numactl that's to be fixed but all tools/utilities that
> > depend on /sys/devices/system/node/online. Are we saying to not rely/believe
> > in the output given by the kernel but do further verification?  
> 
> No, what we are saying is that even an online node might have zero
> number of online pages/cpus. So the online status is not really
> something that matters. If people are confused by that output then user
> space tools can make their confusion go away. I really do not understand
> why the kernel should do any logic there.

The user facing teams are saying they are getting queries from the users who
are unable to understand from the tools/sysfs files why a node is online and
but has no attached resources. Its the amount of time that is being spent on
these issues that triggered the patch. Initially even I was skeptical that
this was a non-issue.

> 
> > Also how would the user space differentiate between the case where the
> > Kernel missed marking a node as offline to the case where the memory was
> > offlined on a cpuless node but node wasn't offline?.
> 
> What I am arguing is that those two shouldn't be any different. Really!
> 
> > 2. Regarding the autonuma, the case of offline memory is user/admin driven,
> > so if there is a performance hit, its something that's driven by his
> > user/admin actions. Also how often do we see users offline complete memory
> > of cpuless node on a 2 node system?
> 
> How often do we see crippled HW configurations like that? Really if
> autonuma should be made more clever for one case it should recognize the
> other as well.
> 

Lets take a 16 socket PowerVM system and assume that 32 lpars are created
on that socket, i.e 2 lpars for each socket. (PowerVM has the final say on
how the lpars are created.) In such a case, we can expect 30 out of the 32
lpars to face this problem, with the only 2 lpars that actually run on
socket 0 having the correct configuration.

> > > 
> > > This begs a question whether ppc can do the same thing?
> > 
> > Certainly ppc can be made to adapt to this situation but that would be a
> > workaround. Do we have a reason why we think node 0 is unique and special?
> 
> It is not. As replied in other email in this thread. I would hope for
> having less hacks in the numa initialization. Cleaning up the mess is
> would be a lot of work and testing on all NUMA capable architectures.
> This is a heritage from the past I am afraid. All that I am arguing here
> is that your touch to the generic code with a very simple looking patch
> might have side effects which are pretty much impossible to review.
> Moreover it seems that nothing but ppc really needs this treatment.
> So fixing it in ppc specific code sounds much more safe.
> 
> Normally I would really push for a generic solution but after getting
> burned several times in this area I do not dare anymore. The problem is
> not in the code complexity but in how spread it is in places where you
> do not expect side effects.
> 

I do understand and respect your viewpoint.

> -- 
> Michal Hocko
> SUSE Labs
Michal Suchanek July 3, 2020, 9:10 a.m. UTC | #14
On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> > On 01.07.20 13:06, David Hildenbrand wrote:
> > > On 01.07.20 13:01, Srikar Dronamraju wrote:
> > >> * David Hildenbrand <david@redhat.com> [2020-07-01 12:15:54]:
> > >>
> > >>> On 01.07.20 12:04, Srikar Dronamraju wrote:
> > >>>> * Michal Hocko <mhocko@kernel.org> [2020-07-01 10:42:00]:
> > >>>>
> > >>>>>
> > >>>>>>
> > >>>>>> 2. Also existence of dummy node also leads to inconsistent information. The
> > >>>>>> number of online nodes is inconsistent with the information in the
> > >>>>>> device-tree and resource-dump
> > >>>>>>
> > >>>>>> 3. When the dummy node is present, single node non-Numa systems end up showing
> > >>>>>> up as NUMA systems and numa_balancing gets enabled. This will mean we take
> > >>>>>> the hit from the unnecessary numa hinting faults.
> > >>>>>
> > >>>>> I have to say that I dislike the node online/offline state and directly
> > >>>>> exporting that to the userspace. Users should only care whether the node
> > >>>>> has memory/cpus. Numa nodes can be online without any memory. Just
> > >>>>> offline all the present memory blocks but do not physically hot remove
> > >>>>> them and you are in the same situation. If users are confused by an
> > >>>>> output of tools like numactl -H then those could be updated and hide
> > >>>>> nodes without any memory&cpus.
> > >>>>>
> > >>>>> The autonuma problem sounds interesting but again this patch doesn't
> > >>>>> really solve the underlying problem because I strongly suspect that the
> > >>>>> problem is still there when a numa node gets all its memory offline as
> > >>>>> mentioned above.
> 
> I would really appreciate a feedback to these two as well.
> 
> > >>>>> While I completely agree that making node 0 special is wrong, I have
> > >>>>> still hard time to review this very simply looking patch because all the
> > >>>>> numa initialization is so spread around that this might just blow up
> > >>>>> at unexpected places. IIRC we have discussed testing in the previous
> > >>>>> version and David has provided a way to emulate these configurations
> > >>>>> on x86. Did you manage to use those instruction for additional testing
> > >>>>> on other than ppc architectures?
> > >>>>>
> > >>>>
> > >>>> I have tried all the steps that David mentioned and reported back at
> > >>>> https://lore.kernel.org/lkml/20200511174731.GD1961@linux.vnet.ibm.com/t/#u
> > >>>>
> > >>>> As a summary, David's steps are still not creating a memoryless/cpuless on
> > >>>> x86 VM.
> > >>>
> > >>> Now, that is wrong. You get a memoryless/cpuless node, which is *not
> > >>> online*. Once you hotplug some memory, it will switch online. Once you
> > >>> remove memory, it will switch back offline.
> > >>>
> > >>
> > >> Let me clarify, we are looking for a node 0 which is cpuless/memoryless at
> > >> boot.  The code in question tries to handle a cpuless/memoryless node 0 at
> > >> boot.
> > > 
> > > I was just correcting your statement, because it was wrong.
> > > 
> > > Could be that x86 code maps PXM 1 to node 0 because PXM 1 does neither
> > > have CPUs nor memory. That would imply that we can, in fact, never have
> > > node 0 offline during boot.
> > > 
> > 
> > Yep, looks like it.
> > 
> > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> 
> This begs a question whether ppc can do the same thing?
Or x86 stop doing it so that you can see on what node you are running?

What's the point of this indirection other than another way of avoiding
empty node 0?

Thanks

Michal
Michal Hocko July 3, 2020, 9:24 a.m. UTC | #15
[Cc Andi]

On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
[...]
> > > Yep, looks like it.
> > > 
> > > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> > 
> > This begs a question whether ppc can do the same thing?
> Or x86 stop doing it so that you can see on what node you are running?
> 
> What's the point of this indirection other than another way of avoiding
> empty node 0?

Honestly, I do not have any idea. I've traced it down to
Author: Andi Kleen <ak@suse.de>
Date:   Tue Jan 11 15:35:48 2005 -0800

    [PATCH] x86_64: Fix ACPI SRAT NUMA parsing

    Fix fallout from the recent nodemask_t changes. The node ids assigned
    in the SRAT parser were off by one.

    I added a new first_unset_node() function to nodemask.h to allocate
    IDs sanely.

    Signed-off-by: Andi Kleen <ak@suse.de>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

which doesn't really tell all that much. The historical baggage and a
long term behavior which is not really trivial to fix I suspect.
Michal Hocko July 3, 2020, 10:59 a.m. UTC | #16
On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> [Cc Andi]
> 
> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> > On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> > > On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> [...]
> > > > Yep, looks like it.
> > > > 
> > > > [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> > > > [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> > > > [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> > > > [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> > > > [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> > > > [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> > > > [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> > > 
> > > This begs a question whether ppc can do the same thing?
> > Or x86 stop doing it so that you can see on what node you are running?
> > 
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen <ak@suse.de>
> Date:   Tue Jan 11 15:35:48 2005 -0800
> 
>     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> 
>     Fix fallout from the recent nodemask_t changes. The node ids assigned
>     in the SRAT parser were off by one.
> 
>     I added a new first_unset_node() function to nodemask.h to allocate
>     IDs sanely.
> 
>     Signed-off-by: Andi Kleen <ak@suse.de>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> which doesn't really tell all that much. The historical baggage and a
> long term behavior which is not really trivial to fix I suspect.

Thinking about this some more, this logic makes some sense afterall.
Especially in the world without memory hotplug which was very likely the
case back then. It is much better to have compact node mask rather than
sparse one. After all node numbers shouldn't really matter as long as
you have a clear mapping to the HW. I am not sure we export that
information (except for the kernel ring buffer) though.

The memory hotplug changes that somehow because you can hotremove numa
nodes and therefore make the nodemask sparse but that is not a common
case. I am not sure what would happen if a completely new node was added
and its corresponding node was already used by the renumbered one
though. It would likely conflate the two I am afraid. But I am not sure
this is really possible with x86 and a lack of a bug report would
suggest that nobody is doing that at least.
David Hildenbrand July 3, 2020, 11:32 a.m. UTC | #17
On 03.07.20 12:59, Michal Hocko wrote:
> On Fri 03-07-20 11:24:17, Michal Hocko wrote:
>> [Cc Andi]
>>
>> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
>>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
>>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
>> [...]
>>>>> Yep, looks like it.
>>>>>
>>>>> [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
>>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
>>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
>>>>> [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
>>>>> [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
>>>>> [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
>>>>> [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
>>>>
>>>> This begs a question whether ppc can do the same thing?
>>> Or x86 stop doing it so that you can see on what node you are running?
>>>
>>> What's the point of this indirection other than another way of avoiding
>>> empty node 0?
>>
>> Honestly, I do not have any idea. I've traced it down to
>> Author: Andi Kleen <ak@suse.de>
>> Date:   Tue Jan 11 15:35:48 2005 -0800
>>
>>     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
>>
>>     Fix fallout from the recent nodemask_t changes. The node ids assigned
>>     in the SRAT parser were off by one.
>>
>>     I added a new first_unset_node() function to nodemask.h to allocate
>>     IDs sanely.
>>
>>     Signed-off-by: Andi Kleen <ak@suse.de>
>>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>>
>> which doesn't really tell all that much. The historical baggage and a
>> long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

I think the ACPI code takes care of properly mapping PXM to nodes.

So if I start with PXM 0 empty and PXM 1 populated, I will get
PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU

$ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
$ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor

$ echo "info numa" | sudo nc -U /var/tmp/monitor
QEMU 5.0.50 monitor - type 'help' for more information
(qemu) info numa
2 nodes
node 0 cpus:
node 0 size: 1024 MB
node 0 plugged: 1024 MB
node 1 cpus: 0 1 2 3
node 1 size: 4096 MB
node 1 plugged: 0 MB

I get in the guest:

[   50.174435] ------------[ cut here ]------------
[   50.175436] node 1 was absent from the node_possible_map
[   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290
[   50.176844] Modules linked in:
[   50.176845] CPU: 0 PID: 7 Comm: kworker/u8:0 Not tainted 5.8.0-rc2+ #4
[   50.176846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   50.176846] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   50.176847] RIP: 0010:add_memory_resource+0x8c/0x290
[   50.176849] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 63 c5 48 89 04 24 48 0f a3 05 94 6c 1c 01 72 17 89 ee 48 c78
[   50.176849] RSP: 0018:ffffa7a1c0043d48 EFLAGS: 00010296
[   50.176850] RAX: 000000000000002c RBX: ffff8bc633e63b80 RCX: 0000000000000000
[   50.176851] RDX: ffff8bc63bc27060 RSI: ffff8bc63bc18d00 RDI: ffff8bc63bc18d00
[   50.176851] RBP: 0000000000000001 R08: 00000000000001e1 R09: ffffa7a1c0043bd8
[   50.176852] R10: 0000000000000005 R11: 0000000000000000 R12: 0000000140000000
[   50.176852] R13: 000000017fffffff R14: 0000000040000000 R15: 0000000180000000
[   50.176853] FS:  0000000000000000(0000) GS:ffff8bc63bc00000(0000) knlGS:0000000000000000
[   50.176853] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.176855] CR2: 000055dfcbfc5ee8 CR3: 00000000aca0a000 CR4: 00000000000006f0
[   50.176855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   50.176856] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   50.176856] Call Trace:
[   50.176856]  __add_memory+0x33/0x70
[   50.176857]  acpi_memory_device_add+0x132/0x2f2
[   50.176857]  acpi_bus_attach+0xd2/0x200
[   50.176858]  acpi_bus_scan+0x33/0x70
[   50.176858]  acpi_device_hotplug+0x298/0x390
[   50.176858]  acpi_hotplug_work_fn+0x3d/0x50
[   50.176859]  process_one_work+0x1b4/0x370
[   50.176859]  worker_thread+0x53/0x3e0
[   50.176860]  ? process_one_work+0x370/0x370
[   50.176860]  kthread+0x119/0x140
[   50.176860]  ? __kthread_bind_mask+0x60/0x60
[   50.176861]  ret_from_fork+0x22/0x30
[   50.176861] ---[ end trace 9a2a837c1e0164f1 ]---
[   50.209816] acpi PNP0C80:00: add_memory failed
[   50.210510] acpi PNP0C80:00: acpi_memory_enable_device() error
[   50.211445] acpi PNP0C80:00: Enumeration failure


I remember that we added that check just recently (due to powerpc if I am not wrong).
Not sure why that triggers here.

But it properly maps PXM 0 to node 1.
Michal Hocko July 3, 2020, 11:46 a.m. UTC | #18
On Fri 03-07-20 13:32:21, David Hildenbrand wrote:
> On 03.07.20 12:59, Michal Hocko wrote:
> > On Fri 03-07-20 11:24:17, Michal Hocko wrote:
> >> [Cc Andi]
> >>
> >> On Fri 03-07-20 11:10:01, Michal Suchanek wrote:
> >>> On Wed, Jul 01, 2020 at 02:21:10PM +0200, Michal Hocko wrote:
> >>>> On Wed 01-07-20 13:30:57, David Hildenbrand wrote:
> >> [...]
> >>>>> Yep, looks like it.
> >>>>>
> >>>>> [    0.009726] SRAT: PXM 1 -> APIC 0x00 -> Node 0
> >>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x01 -> Node 0
> >>>>> [    0.009727] SRAT: PXM 1 -> APIC 0x02 -> Node 0
> >>>>> [    0.009728] SRAT: PXM 1 -> APIC 0x03 -> Node 0
> >>>>> [    0.009731] ACPI: SRAT: Node 0 PXM 1 [mem 0x00000000-0x0009ffff]
> >>>>> [    0.009732] ACPI: SRAT: Node 0 PXM 1 [mem 0x00100000-0xbfffffff]
> >>>>> [    0.009733] ACPI: SRAT: Node 0 PXM 1 [mem 0x100000000-0x13fffffff]
> >>>>
> >>>> This begs a question whether ppc can do the same thing?
> >>> Or x86 stop doing it so that you can see on what node you are running?
> >>>
> >>> What's the point of this indirection other than another way of avoiding
> >>> empty node 0?
> >>
> >> Honestly, I do not have any idea. I've traced it down to
> >> Author: Andi Kleen <ak@suse.de>
> >> Date:   Tue Jan 11 15:35:48 2005 -0800
> >>
> >>     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> >>
> >>     Fix fallout from the recent nodemask_t changes. The node ids assigned
> >>     in the SRAT parser were off by one.
> >>
> >>     I added a new first_unset_node() function to nodemask.h to allocate
> >>     IDs sanely.
> >>
> >>     Signed-off-by: Andi Kleen <ak@suse.de>
> >>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> >>
> >> which doesn't really tell all that much. The historical baggage and a
> >> long term behavior which is not really trivial to fix I suspect.
> > 
> > Thinking about this some more, this logic makes some sense afterall.
> > Especially in the world without memory hotplug which was very likely the
> > case back then. It is much better to have compact node mask rather than
> > sparse one. After all node numbers shouldn't really matter as long as
> > you have a clear mapping to the HW. I am not sure we export that
> > information (except for the kernel ring buffer) though.
> > 
> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> > 
> 
> I think the ACPI code takes care of properly mapping PXM to nodes.
> 
> So if I start with PXM 0 empty and PXM 1 populated, I will get
> PXM 1 == node 0 as described. Once I hotplug something to PXM 0 in QEMU
> 
> $ echo "object_add memory-backend-ram,id=mem0,size=1G" | sudo nc -U /var/tmp/monitor
> $ echo "device_add pc-dimm,id=dimm0,memdev=mem0,node=0" | sudo nc -U /var/tmp/monitor
> 
> $ echo "info numa" | sudo nc -U /var/tmp/monitor
> QEMU 5.0.50 monitor - type 'help' for more information
> (qemu) info numa
> 2 nodes
> node 0 cpus:
> node 0 size: 1024 MB
> node 0 plugged: 1024 MB
> node 1 cpus: 0 1 2 3
> node 1 size: 4096 MB
> node 1 plugged: 0 MB

Thanks for double checking.

> I get in the guest:
> 
> [   50.174435] ------------[ cut here ]------------
> [   50.175436] node 1 was absent from the node_possible_map
> [   50.176844] WARNING: CPU: 0 PID: 7 at mm/memory_hotplug.c:1021 add_memory_resource+0x8c/0x290

This would mean that the ACPI code or whoever does the remaping is not
adding the new node into possible nodes.

[...]
> I remember that we added that check just recently (due to powerpc if I am not wrong).
> Not sure why that triggers here.

This was a misbehaving Qemu IIRC providing a garbage map.
Srikar Dronamraju July 3, 2020, 12:58 p.m. UTC | #19
* Michal Hocko <mhocko@kernel.org> [2020-07-03 12:59:44]:

> > Honestly, I do not have any idea. I've traced it down to
> > Author: Andi Kleen <ak@suse.de>
> > Date:   Tue Jan 11 15:35:48 2005 -0800
> > 
> >     [PATCH] x86_64: Fix ACPI SRAT NUMA parsing
> > 
> >     Fix fallout from the recent nodemask_t changes. The node ids assigned
> >     in the SRAT parser were off by one.
> > 
> >     I added a new first_unset_node() function to nodemask.h to allocate
> >     IDs sanely.
> > 
> >     Signed-off-by: Andi Kleen <ak@suse.de>
> >     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > 
> > which doesn't really tell all that much. The historical baggage and a
> > long term behavior which is not really trivial to fix I suspect.
> 
> Thinking about this some more, this logic makes some sense afterall.
> Especially in the world without memory hotplug which was very likely the
> case back then. It is much better to have compact node mask rather than
> sparse one. After all node numbers shouldn't really matter as long as
> you have a clear mapping to the HW. I am not sure we export that
> information (except for the kernel ring buffer) though.
> 
> The memory hotplug changes that somehow because you can hotremove numa
> nodes and therefore make the nodemask sparse but that is not a common
> case. I am not sure what would happen if a completely new node was added
> and its corresponding node was already used by the renumbered one
> though. It would likely conflate the two I am afraid. But I am not sure
> this is really possible with x86 and a lack of a bug report would
> suggest that nobody is doing that at least.
> 

JFYI,
Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
hotplug on memoryless node. 

https://bugzilla.kernel.org/show_bug.cgi?id=202187
Andi Kleen July 6, 2020, 4:08 p.m. UTC | #20
> > What's the point of this indirection other than another way of avoiding
> > empty node 0?
> 
> Honestly, I do not have any idea. I've traced it down to
> Author: Andi Kleen <ak@suse.de>
> Date:   Tue Jan 11 15:35:48 2005 -0800

I don't remember all the details, and I can't even find the commit
(is it in linux-historic?).

But AFAIK there's no guarantee PXMs are small and continuous, so it
seemed better to have a clean zero based space.

Back then we had a lot of problems with buggy SRAT tables in BIOS,
so we really tried to avoid trusting the BIOS as much as possible.

-Andi
Andrew Morton Aug. 7, 2020, 4:32 a.m. UTC | #21
On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> > The memory hotplug changes that somehow because you can hotremove numa
> > nodes and therefore make the nodemask sparse but that is not a common
> > case. I am not sure what would happen if a completely new node was added
> > and its corresponding node was already used by the renumbered one
> > though. It would likely conflate the two I am afraid. But I am not sure
> > this is really possible with x86 and a lack of a bug report would
> > suggest that nobody is doing that at least.
> > 
> 
> JFYI,
> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> hotplug on memoryless node. 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=202187

So...  do we merge this patch or not?  Seems that the overall view is
"risky but nobody is likely to do anything better any time soon"?
David Hildenbrand Aug. 7, 2020, 6:58 a.m. UTC | #22
On 07.08.20 06:32, Andrew Morton wrote:
> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
>>> The memory hotplug changes that somehow because you can hotremove numa
>>> nodes and therefore make the nodemask sparse but that is not a common
>>> case. I am not sure what would happen if a completely new node was added
>>> and its corresponding node was already used by the renumbered one
>>> though. It would likely conflate the two I am afraid. But I am not sure
>>> this is really possible with x86 and a lack of a bug report would
>>> suggest that nobody is doing that at least.
>>>
>>
>> JFYI,
>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
>> hotplug on memoryless node. 
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202187
> 
> So...  do we merge this patch or not?  Seems that the overall view is
> "risky but nobody is likely to do anything better any time soon"?

I recall the issue Michal saw was "fix powerpc" vs. "break other
architectures". @Michal how should we proceed? At least x86-64 won't be
affected IIUC.
Michal Suchanek Aug. 7, 2020, 10:04 a.m. UTC | #23
On Fri, Aug 07, 2020 at 08:58:09AM +0200, David Hildenbrand wrote:
> On 07.08.20 06:32, Andrew Morton wrote:
> > On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> > 
> >>> The memory hotplug changes that somehow because you can hotremove numa
> >>> nodes and therefore make the nodemask sparse but that is not a common
> >>> case. I am not sure what would happen if a completely new node was added
> >>> and its corresponding node was already used by the renumbered one
> >>> though. It would likely conflate the two I am afraid. But I am not sure
> >>> this is really possible with x86 and a lack of a bug report would
> >>> suggest that nobody is doing that at least.
> >>>
> >>
> >> JFYI,
> >> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> >> hotplug on memoryless node. 
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=202187
> > 
> > So...  do we merge this patch or not?  Seems that the overall view is
> > "risky but nobody is likely to do anything better any time soon"?
> 
> I recall the issue Michal saw was "fix powerpc" vs. "break other
> architectures". @Michal how should we proceed? At least x86-64 won't be
> affected IIUC.
There is a patch to introduce the node remapping on ppc as well which
should eliminate the empty node 0.

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200731111916.243569-1-aneesh.kumar@linux.ibm.com/

Thanks

Michal
Srikar Dronamraju Aug. 12, 2020, 6:01 a.m. UTC | #24
Hi Andrew, Michal, David

* Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]:

> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
> > > The memory hotplug changes that somehow because you can hotremove numa
> > > nodes and therefore make the nodemask sparse but that is not a common
> > > case. I am not sure what would happen if a completely new node was added
> > > and its corresponding node was already used by the renumbered one
> > > though. It would likely conflate the two I am afraid. But I am not sure
> > > this is really possible with x86 and a lack of a bug report would
> > > suggest that nobody is doing that at least.
> > > 
> > 
> > JFYI,
> > Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> > hotplug on memoryless node. 
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=202187
> 
> So...  do we merge this patch or not?  Seems that the overall view is
> "risky but nobody is likely to do anything better any time soon"?

Can we decide on this one way or the other?
David Hildenbrand Aug. 18, 2020, 7:32 a.m. UTC | #25
On 12.08.20 08:01, Srikar Dronamraju wrote:
> Hi Andrew, Michal, David
> 
> * Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]:
> 
>> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>>
>>>> The memory hotplug changes that somehow because you can hotremove numa
>>>> nodes and therefore make the nodemask sparse but that is not a common
>>>> case. I am not sure what would happen if a completely new node was added
>>>> and its corresponding node was already used by the renumbered one
>>>> though. It would likely conflate the two I am afraid. But I am not sure
>>>> this is really possible with x86 and a lack of a bug report would
>>>> suggest that nobody is doing that at least.
>>>>
>>>
>>> JFYI,
>>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
>>> hotplug on memoryless node. 
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=202187
>>
>> So...  do we merge this patch or not?  Seems that the overall view is
>> "risky but nobody is likely to do anything better any time soon"?
> 
> Can we decide on this one way or the other?

Hmm, not sure who's the person to decide. I tend to prefer doing the
node renaming, handling this in ppc code; looking at the review of v2
there are still some concerns regarding numa distances.

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200817103238.158133-1-aneesh.kumar@linux.ibm.com/
Michal Hocko Aug. 18, 2020, 7:37 a.m. UTC | #26
On Tue 18-08-20 09:32:52, David Hildenbrand wrote:
> On 12.08.20 08:01, Srikar Dronamraju wrote:
> > Hi Andrew, Michal, David
> > 
> > * Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]:
> > 
> >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> >>
> >>>> The memory hotplug changes that somehow because you can hotremove numa
> >>>> nodes and therefore make the nodemask sparse but that is not a common
> >>>> case. I am not sure what would happen if a completely new node was added
> >>>> and its corresponding node was already used by the renumbered one
> >>>> though. It would likely conflate the two I am afraid. But I am not sure
> >>>> this is really possible with x86 and a lack of a bug report would
> >>>> suggest that nobody is doing that at least.
> >>>>
> >>>
> >>> JFYI,
> >>> Satheesh copied in this mailchain had opened a bug a year on crash with vcpu
> >>> hotplug on memoryless node. 
> >>>
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202187
> >>
> >> So...  do we merge this patch or not?  Seems that the overall view is
> >> "risky but nobody is likely to do anything better any time soon"?
> > 
> > Can we decide on this one way or the other?
> 
> Hmm, not sure who's the person to decide. I tend to prefer doing the
> node renaming, handling this in ppc code;

Agreed. That would be a safer option.
Srikar Dronamraju Aug. 18, 2020, 7:49 a.m. UTC | #27
* Michal Hocko <mhocko@suse.com> [2020-08-18 09:37:12]:

> On Tue 18-08-20 09:32:52, David Hildenbrand wrote:
> > On 12.08.20 08:01, Srikar Dronamraju wrote:
> > > Hi Andrew, Michal, David
> > > 
> > > * Andrew Morton <akpm@linux-foundation.org> [2020-08-06 21:32:11]:
> > > 
> > >> On Fri, 3 Jul 2020 18:28:23 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> > >>
> > >>>> The memory hotplug changes that somehow because you can hotremove numa
> > >>>> nodes and therefore make the nodemask sparse but that is not a common
> > >>>> case. I am not sure what would happen if a completely new node was added
> > >>>> and its corresponding node was already used by the renumbered one
> > >>>> though. It would likely conflate the two I am afraid. But I am not sure
> > >>>> this is really possible with x86 and a lack of a bug report would
> > >>>> suggest that nobody is doing that at least.
> > >>>>
> > >> So...  do we merge this patch or not?  Seems that the overall view is
> > >> "risky but nobody is likely to do anything better any time soon"?
> > > 
> > > Can we decide on this one way or the other?
> > 
> > Hmm, not sure who's the person to decide. I tend to prefer doing the
> > node renaming, handling this in ppc code;
> 
> Agreed. That would be a safer option.

Okay, will send arch specific v6 version.

> -- 
> Michal Hocko
> SUSE Labs
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..5187664558e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -117,8 +117,10 @@  EXPORT_SYMBOL(latent_entropy);
  */
 nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 	[N_POSSIBLE] = NODE_MASK_ALL,
+#ifdef CONFIG_NUMA
+	[N_ONLINE] = NODE_MASK_NONE,
+#else
 	[N_ONLINE] = { { [0] = 1UL } },
-#ifndef CONFIG_NUMA
 	[N_NORMAL_MEMORY] = { { [0] = 1UL } },
 #ifdef CONFIG_HIGHMEM
 	[N_HIGH_MEMORY] = { { [0] = 1UL } },