diff mbox

[1/2] arm64: avoid alloc memory on offline node

Message ID 874lhz3pmn.fsf@e105922-lin.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Punit Agrawal June 19, 2018, 12:52 p.m. UTC
Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> [...]
>> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> But some pci device may related to node 3, which be set in ACPI table.
>
> Could you double check that zonelists for node 3 are generated
> correctly?

The cpus in node 3 aren't onlined and there's no memory attached - I
suspect that no zonelists are built for this node.

We skip creating a node, if the number of SRAT entries parsed exceeds
NR_CPUS[0]. This in turn prevents onlining the numa node and so no
zonelists will be created for it.

I think the problem will go away if the cpus are restricted via the
kernel command line by setting nr_cpus.

Xie, can you try the below patch on top of the one enabling memoryless
nodes? I'm not sure this is the right solution but at least it'll
confirm the problem.

Thanks,
Punit

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73

-- >8 --

Comments

Lorenzo Pieralisi June 19, 2018, 2:08 p.m. UTC | #1
On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
> Michal Hocko <mhocko@kernel.org> writes:
> 
> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
> > [...]
> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
> >> But some pci device may related to node 3, which be set in ACPI table.
> >
> > Could you double check that zonelists for node 3 are generated
> > correctly?
> 
> The cpus in node 3 aren't onlined and there's no memory attached - I
> suspect that no zonelists are built for this node.
> 
> We skip creating a node, if the number of SRAT entries parsed exceeds
> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
> zonelists will be created for it.
> 
> I think the problem will go away if the cpus are restricted via the
> kernel command line by setting nr_cpus.
> 
> Xie, can you try the below patch on top of the one enabling memoryless
> nodes? I'm not sure this is the right solution but at least it'll
> confirm the problem.

This issue looks familiar (or at least related):

git log d3bd058826aa

The reason why the NR_CPUS guard is there is to avoid overflowing
the early_node_cpu_hwid array. IA64 does something different in
that respect compared to x86, we have to have a look into this.

Regardless, AFAICS the proximity domains to nodes mappings should not
depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
in ARM64 ACPI SRAT parsing.

Lorenzo

> 
> Thanks,
> Punit
> 
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/acpi_numa.c?h=v4.18-rc1#n73
> 
> -- >8 --
> diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
> index d190a7b231bf..fea0f7164f1a 100644
> --- a/arch/arm64/kernel/acpi_numa.c
> +++ b/arch/arm64/kernel/acpi_numa.c
> @@ -70,11 +70,9 @@ void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
>         if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
>                 return;
> 
> -   if (cpus_in_srat >= NR_CPUS) {
> + if (cpus_in_srat >= NR_CPUS)
>                 pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
>                              NR_CPUS);
> -           return;
> -   }
> 
>         pxm = pa->proximity_domain;
>         node = acpi_map_pxm_to_node(pxm);
Punit Agrawal June 19, 2018, 2:54 p.m. UTC | #2
Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> writes:

> On Tue, Jun 19, 2018 at 01:52:16PM +0100, Punit Agrawal wrote:
>> Michal Hocko <mhocko@kernel.org> writes:
>> 
>> > On Tue 19-06-18 20:03:07, Xie XiuQi wrote:
>> > [...]
>> >> I tested on a arm board with 128 cores 4 numa nodes, but I set CONFIG_NR_CPUS=72.
>> >> Then node 3 is not be created, because node 3 has no memory, and no cpu.
>> >> But some pci device may related to node 3, which be set in ACPI table.
>> >
>> > Could you double check that zonelists for node 3 are generated
>> > correctly?
>> 
>> The cpus in node 3 aren't onlined and there's no memory attached - I
>> suspect that no zonelists are built for this node.
>> 
>> We skip creating a node, if the number of SRAT entries parsed exceeds
>> NR_CPUS[0]. This in turn prevents onlining the numa node and so no
>> zonelists will be created for it.
>> 
>> I think the problem will go away if the cpus are restricted via the
>> kernel command line by setting nr_cpus.
>> 
>> Xie, can you try the below patch on top of the one enabling memoryless
>> nodes? I'm not sure this is the right solution but at least it'll
>> confirm the problem.
>
> This issue looks familiar (or at least related):
>
> git log d3bd058826aa

Indeed. Thanks for digging into this.

>
> The reason why the NR_CPUS guard is there is to avoid overflowing
> the early_node_cpu_hwid array.

Ah right... I missed that. The below patch is definitely not what we
want.

> IA64 does something different in
> that respect compared to x86, we have to have a look into this.
>
> Regardless, AFAICS the proximity domains to nodes mappings should not
> depend on CONFIG_NR_CPUS, it seems that there is something wrong in that
> in ARM64 ACPI SRAT parsing.

Not only SRAT parsing but it looks like there is a similar restriction
while parsing the ACPI MADT in acpi_map_gic_cpu_interface().

The incomplete parsing introduces a dependency on the ordering of
entries being aligned between SRAT and MADT when NR_CPUS is
restricted. We want to parse the entire table in both cases so that the
code is robust to reordering of entries.

In terms of $SUBJECT, I wonder if it's worth taking the original patch
as a temporary fix (it'll also be easier to backport) while we work on
fixing these other issues and enabling memoryless nodes.
Michal Hocko June 19, 2018, 3:14 p.m. UTC | #3
On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
[...]
> In terms of $SUBJECT, I wonder if it's worth taking the original patch
> as a temporary fix (it'll also be easier to backport) while we work on
> fixing these other issues and enabling memoryless nodes.

Well, x86 already does that but copying this antipatern is not really
nice. So it is good as a quick fix but it would be definitely much
better to have a robust fix. Who knows how many other places might hit
this. You certainly do not want to add a hack like this all over...
Punit Agrawal June 19, 2018, 3:35 p.m. UTC | #4
Michal Hocko <mhocko@kernel.org> writes:

> On Tue 19-06-18 15:54:26, Punit Agrawal wrote:
> [...]
>> In terms of $SUBJECT, I wonder if it's worth taking the original patch
>> as a temporary fix (it'll also be easier to backport) while we work on
>> fixing these other issues and enabling memoryless nodes.
>
> Well, x86 already does that but copying this antipatern is not really
> nice. So it is good as a quick fix but it would be definitely much
> better to have a robust fix. Who knows how many other places might hit
> this. You certainly do not want to add a hack like this all over...

Completely agree! I was only suggesting it as a temporary measure,
especially as it looked like a proper fix might be invasive.

Another fix might be to change the node specific allocation to node
agnostic allocations. It isn't clear why the allocation is being
requested from a specific node. I think Lorenzo suggested this in one of
the threads.

I've started putting together a set fixing the issues identified in this
thread. It should give a better idea on the best course of action.
diff mbox

Patch

diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..fea0f7164f1a 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -70,11 +70,9 @@  void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
        if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
                return;

-   if (cpus_in_srat >= NR_CPUS) {
+ if (cpus_in_srat >= NR_CPUS)
                pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
                             NR_CPUS);
-           return;
-   }

        pxm = pa->proximity_domain;
        node = acpi_map_pxm_to_node(pxm);