arm64: numa: rework ACPI NUMA initialization

Message ID	20180625130552.5636-1-lorenzo.pieralisi@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> To: linux-arm-kernel@lists.infradead.org Subject: [PATCH] arm64: numa: rework ACPI NUMA initialization Date: Mon, 25 Jun 2018 14:05:52 +0100 Message-Id: <20180625130552.5636-1-lorenzo.pieralisi@arm.com> Precedence: list Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>, Xie XiuQi <xiexiuqi@huawei.com>, Catalin Marinas <catalin.marinas@arm.com>, Punit Agrawal <punit.agrawal@arm.com>, Will Deacon <will.deacon@arm.com>, Jeremy Linton <jeremy.linton@arm.com>, linux-acpi@vger.kernel.org, Jonathan Cameron <jonathan.cameron@huawei.com>, Hanjun Guo <guohanjun@huawei.com>, Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Message ID

20180625130552.5636-1-lorenzo.pieralisi@arm.com (mailing list archive)

State

New, archived

Headers

From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH] arm64: numa: rework ACPI NUMA initialization
Date: Mon, 25 Jun 2018 14:05:52 +0100
Message-Id: <20180625130552.5636-1-lorenzo.pieralisi@arm.com>
Precedence: list
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>,
	Xie XiuQi <xiexiuqi@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>, 
	Punit Agrawal <punit.agrawal@arm.com>,
	Will Deacon <will.deacon@arm.com>, 
	Jeremy Linton <jeremy.linton@arm.com>, linux-acpi@vger.kernel.org,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	Hanjun Guo <guohanjun@huawei.com>,
	Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Commit Message

Lorenzo Pieralisi June 25, 2018, 1:05 p.m. UTC

Current ACPI ARM64 NUMA initialization code in

acpi_numa_gicc_affinity_init()

carries out NUMA nodes creation and cpu<->node mappings at the same time
in the arch backend so that a single SRAT walk is needed to parse both
pieces of information.  This implies that the cpu<->node mappings must
be stashed in an array (sized NR_CPUS) so that SMP code can later use
the stashed values to avoid another SRAT table walk to set-up the early
cpu<->node mappings.

If the kernel is configured with a NR_CPUS value less than the actual
processor entries in the SRAT (and MADT), the logic in
acpi_numa_gicc_affinity_init() is broken in that the cpu<->node mapping
is only carried out (and stashed for future use) only for a number of
SRAT entries up to NR_CPUS, which do not necessarily correspond to the
possible cpus detected at SMP initialization in
acpi_map_gic_cpu_interface() (ie MADT and SRAT processor entries order
is not enforced), which leaves the kernel with broken cpu<->node
mappings.

Furthermore, given the current ACPI NUMA code parsing logic in
acpi_numa_gicc_affinity_init(), PXM domains for CPUs that are not parsed
because they exceed NR_CPUS entries are not mapped to NUMA nodes (ie the
PXM corresponding node is not created in the kernel) leaving the system
with a broken NUMA topology.

Rework the ACPI ARM64 NUMA initialization process so that the NUMA
nodes creation and cpu<->node mappings are decoupled. cpu<->node
mappings are moved to SMP initialization code (where they are needed),
at the cost of an extra SRAT walk so that ACPI NUMA mappings can be
batched before being applied, fixing current parsing pitfalls.

Fixes: d8b47fca8c23 ("arm64, ACPI, NUMA: NUMA support based on SRAT and
SLIT")
Link: http://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Cc: Punit Agrawal <punit.agrawal@arm.com>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
---
 arch/arm64/include/asm/acpi.h |  6 ++-
 arch/arm64/kernel/acpi_numa.c | 88 ++++++++++++++++++++++++++-----------------
 arch/arm64/kernel/smp.c       | 39 +++++++++++++------
 3 files changed, 85 insertions(+), 48 deletions(-)

Comments

Hanjun Guo June 28, 2018, 12:55 p.m. UTC | #1

On 2018/6/25 21:05, Lorenzo Pieralisi wrote:
> Current ACPI ARM64 NUMA initialization code in
> 
> acpi_numa_gicc_affinity_init()
> 
> carries out NUMA nodes creation and cpu<->node mappings at the same time
> in the arch backend so that a single SRAT walk is needed to parse both
> pieces of information.  This implies that the cpu<->node mappings must
> be stashed in an array (sized NR_CPUS) so that SMP code can later use
> the stashed values to avoid another SRAT table walk to set-up the early
> cpu<->node mappings.
> 
> If the kernel is configured with a NR_CPUS value less than the actual
> processor entries in the SRAT (and MADT), the logic in
> acpi_numa_gicc_affinity_init() is broken in that the cpu<->node mapping
> is only carried out (and stashed for future use) only for a number of
> SRAT entries up to NR_CPUS, which do not necessarily correspond to the
> possible cpus detected at SMP initialization in
> acpi_map_gic_cpu_interface() (ie MADT and SRAT processor entries order
> is not enforced), which leaves the kernel with broken cpu<->node
> mappings.
> 
> Furthermore, given the current ACPI NUMA code parsing logic in
> acpi_numa_gicc_affinity_init(), PXM domains for CPUs that are not parsed
> because they exceed NR_CPUS entries are not mapped to NUMA nodes (ie the
> PXM corresponding node is not created in the kernel) leaving the system
> with a broken NUMA topology.
> 
> Rework the ACPI ARM64 NUMA initialization process so that the NUMA
> nodes creation and cpu<->node mappings are decoupled. cpu<->node
> mappings are moved to SMP initialization code (where they are needed),
> at the cost of an extra SRAT walk so that ACPI NUMA mappings can be
> batched before being applied, fixing current parsing pitfalls.
> 
> Fixes: d8b47fca8c23 ("arm64, ACPI, NUMA: NUMA support based on SRAT and
> SLIT")
> Link: http://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
> Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> Cc: Punit Agrawal <punit.agrawal@arm.com>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Hanjun Guo <guohanjun@huawei.com>
> Cc: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> Cc: Jeremy Linton <jeremy.linton@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Xie XiuQi <xiexiuqi@huawei.com>
> ---
>  arch/arm64/include/asm/acpi.h |  6 ++-
>  arch/arm64/kernel/acpi_numa.c | 88 ++++++++++++++++++++++++++-----------------
>  arch/arm64/kernel/smp.c       | 39 +++++++++++++------
>  3 files changed, 85 insertions(+), 48 deletions(-)

Looks good to me,

Acked-by: Hanjun Guo <hanjun.guo@linaro.org>

Tested on D05 with NR_CPUS=48 (with last NUMA node boot
without CPUs), the system works fine. If Xiuqi can test
this patch on D06 with memory-less node, that would be
more helpful.

Thanks
Hanjun

John Garry July 4, 2018, 11:23 a.m. UTC | #2

On 28/06/2018 13:55, Hanjun Guo wrote:
> On 2018/6/25 21:05, Lorenzo Pieralisi wrote:
>> Current ACPI ARM64 NUMA initialization code in
>>
>> acpi_numa_gicc_affinity_init()
>>
>> carries out NUMA nodes creation and cpu<->node mappings at the same time
>> in the arch backend so that a single SRAT walk is needed to parse both
>> pieces of information.  This implies that the cpu<->node mappings must
>> be stashed in an array (sized NR_CPUS) so that SMP code can later use
>> the stashed values to avoid another SRAT table walk to set-up the early
>> cpu<->node mappings.
>>
>> If the kernel is configured with a NR_CPUS value less than the actual
>> processor entries in the SRAT (and MADT), the logic in
>> acpi_numa_gicc_affinity_init() is broken in that the cpu<->node mapping
>> is only carried out (and stashed for future use) only for a number of
>> SRAT entries up to NR_CPUS, which do not necessarily correspond to the
>> possible cpus detected at SMP initialization in
>> acpi_map_gic_cpu_interface() (ie MADT and SRAT processor entries order
>> is not enforced), which leaves the kernel with broken cpu<->node
>> mappings.
>>
>> Furthermore, given the current ACPI NUMA code parsing logic in
>> acpi_numa_gicc_affinity_init(), PXM domains for CPUs that are not parsed
>> because they exceed NR_CPUS entries are not mapped to NUMA nodes (ie the
>> PXM corresponding node is not created in the kernel) leaving the system
>> with a broken NUMA topology.
>>
>> Rework the ACPI ARM64 NUMA initialization process so that the NUMA
>> nodes creation and cpu<->node mappings are decoupled. cpu<->node
>> mappings are moved to SMP initialization code (where they are needed),
>> at the cost of an extra SRAT walk so that ACPI NUMA mappings can be
>> batched before being applied, fixing current parsing pitfalls.
>>
>> Fixes: d8b47fca8c23 ("arm64, ACPI, NUMA: NUMA support based on SRAT and
>> SLIT")
>> Link: http://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
>> Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
>> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
>> Cc: Punit Agrawal <punit.agrawal@arm.com>
>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Hanjun Guo <guohanjun@huawei.com>
>> Cc: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>> Cc: Jeremy Linton <jeremy.linton@arm.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Xie XiuQi <xiexiuqi@huawei.com>
>> ---
>>  arch/arm64/include/asm/acpi.h |  6 ++-
>>  arch/arm64/kernel/acpi_numa.c | 88 ++++++++++++++++++++++++++-----------------
>>  arch/arm64/kernel/smp.c       | 39 +++++++++++++------
>>  3 files changed, 85 insertions(+), 48 deletions(-)
>
> Looks good to me,
>
> Acked-by: Hanjun Guo <hanjun.guo@linaro.org>
>
> Tested on D05 with NR_CPUS=48 (with last NUMA node boot
> without CPUs), the system works fine. If Xiuqi can test
> this patch on D06 with memory-less node, that would be
> more helpful.
>

Hi Lorenzo,

Thanks for this. I have noticed we now miss this log, which I think was 
somewhat useful:
ACPI: NUMA: SRAT: cpu_to_node_map[5] is too small, may not be able to 
use all cpus

(I tested arbitary 5 CPUs)

For example, the default ARM64 defconfig specifies NR_CPUs default at 
64, while some boards now have > 64 CPUs, so this info would be missed 
with a vanilla kernel, right?

Also, please note that we now have this log:
[    0.390565] smp: Brought up 4 nodes, 5 CPUs

while before we had:
[    0.390561] smp: Brought up 1 node, 5 CPUs

Maybe my understanding is wrong, but I find this misleading as only 1 
node was "Brought up".

But the patch fixes our crash on D06:
Tested-by: John Garry <john.garry@huawei.com>

Thanks very much,
John

> Thanks
> Hanjun
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> .
>

Lorenzo Pieralisi July 4, 2018, 1:21 p.m. UTC | #3

[+Michal]

On Wed, Jul 04, 2018 at 12:23:08PM +0100, John Garry wrote:
> On 28/06/2018 13:55, Hanjun Guo wrote:
> >On 2018/6/25 21:05, Lorenzo Pieralisi wrote:
> >>Current ACPI ARM64 NUMA initialization code in
> >>
> >>acpi_numa_gicc_affinity_init()
> >>
> >>carries out NUMA nodes creation and cpu<->node mappings at the same time
> >>in the arch backend so that a single SRAT walk is needed to parse both
> >>pieces of information.  This implies that the cpu<->node mappings must
> >>be stashed in an array (sized NR_CPUS) so that SMP code can later use
> >>the stashed values to avoid another SRAT table walk to set-up the early
> >>cpu<->node mappings.
> >>
> >>If the kernel is configured with a NR_CPUS value less than the actual
> >>processor entries in the SRAT (and MADT), the logic in
> >>acpi_numa_gicc_affinity_init() is broken in that the cpu<->node mapping
> >>is only carried out (and stashed for future use) only for a number of
> >>SRAT entries up to NR_CPUS, which do not necessarily correspond to the
> >>possible cpus detected at SMP initialization in
> >>acpi_map_gic_cpu_interface() (ie MADT and SRAT processor entries order
> >>is not enforced), which leaves the kernel with broken cpu<->node
> >>mappings.
> >>
> >>Furthermore, given the current ACPI NUMA code parsing logic in
> >>acpi_numa_gicc_affinity_init(), PXM domains for CPUs that are not parsed
> >>because they exceed NR_CPUS entries are not mapped to NUMA nodes (ie the
> >>PXM corresponding node is not created in the kernel) leaving the system
> >>with a broken NUMA topology.
> >>
> >>Rework the ACPI ARM64 NUMA initialization process so that the NUMA
> >>nodes creation and cpu<->node mappings are decoupled. cpu<->node
> >>mappings are moved to SMP initialization code (where they are needed),
> >>at the cost of an extra SRAT walk so that ACPI NUMA mappings can be
> >>batched before being applied, fixing current parsing pitfalls.
> >>
> >>Fixes: d8b47fca8c23 ("arm64, ACPI, NUMA: NUMA support based on SRAT and
> >>SLIT")
> >>Link: http://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
> >>Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
> >>Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
> >>Cc: Punit Agrawal <punit.agrawal@arm.com>
> >>Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> >>Cc: Will Deacon <will.deacon@arm.com>
> >>Cc: Hanjun Guo <guohanjun@huawei.com>
> >>Cc: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
> >>Cc: Jeremy Linton <jeremy.linton@arm.com>
> >>Cc: Catalin Marinas <catalin.marinas@arm.com>
> >>Cc: Xie XiuQi <xiexiuqi@huawei.com>
> >>---
> >> arch/arm64/include/asm/acpi.h |  6 ++-
> >> arch/arm64/kernel/acpi_numa.c | 88 ++++++++++++++++++++++++++-----------------
> >> arch/arm64/kernel/smp.c       | 39 +++++++++++++------
> >> 3 files changed, 85 insertions(+), 48 deletions(-)
> >
> >Looks good to me,
> >
> >Acked-by: Hanjun Guo <hanjun.guo@linaro.org>
> >
> >Tested on D05 with NR_CPUS=48 (with last NUMA node boot
> >without CPUs), the system works fine. If Xiuqi can test
> >this patch on D06 with memory-less node, that would be
> >more helpful.
> >
> 
> Hi Lorenzo,
> 
> Thanks for this. I have noticed we now miss this log, which I think
> was somewhat useful:
> ACPI: NUMA: SRAT: cpu_to_node_map[5] is too small, may not be able
> to use all cpus
> 
> (I tested arbitary 5 CPUs)
> 
> For example, the default ARM64 defconfig specifies NR_CPUs default
> at 64, while some boards now have > 64 CPUs, so this info would be
> missed with a vanilla kernel, right?

I did that on purpose since the aim of this patch is to remove
that restriction, we should not be limited by the NR_CPUS when
we parse the SRAT, that's what this patch does.

> Also, please note that we now have this log:
> [    0.390565] smp: Brought up 4 nodes, 5 CPUs
> 
> while before we had:
> [    0.390561] smp: Brought up 1 node, 5 CPUs
> 
> Maybe my understanding is wrong, but I find this misleading as only
> 1 node was "Brought up".

Well, that's exactly where the problem lies. This patch allows the
kernel to inizialize NUMA nodes associated with CPUs that are not
"brought up" with the current kernel owing to the NR_CPUS restrictions.

So I think this patch still does the right thing. I reworked
the code mechanically since it looked wrong to me but I have
to confess I do not understand the NUMA internals in-depth either.

AFAICS the original problem was that, by making the NUMA parsing
dependent on the NR_CPUS we were not "bringing online" NUMA nodes
that are associated with CPUs and this caused memory allocation
failures. If this patch fixes the problem that means that we
actually "bring up" the required NUMA nodes (and create
zonelist for them) correctly. So the update smp: log above should
be right.

I CC'ed Michal since he knows core NUMA internals much better than I do,
thoughts appreciated, thanks.

Lorenzo

> But the patch fixes our crash on D06:
> Tested-by: John Garry <john.garry@huawei.com>
> 
> Thanks very much,
> John
> 
> >Thanks
> >Hanjun
> >
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >.
> >
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

John Garry July 4, 2018, 2:17 p.m. UTC | #4

On 04/07/2018 14:21, Lorenzo Pieralisi wrote:
> [+Michal]
>
> On Wed, Jul 04, 2018 at 12:23:08PM +0100, John Garry wrote:
>> On 28/06/2018 13:55, Hanjun Guo wrote:
>>> On 2018/6/25 21:05, Lorenzo Pieralisi wrote:
>>>> Current ACPI ARM64 NUMA initialization code in
>>>>
>>>> acpi_numa_gicc_affinity_init()
>>>>
>>>> carries out NUMA nodes creation and cpu<->node mappings at the same time
>>>> in the arch backend so that a single SRAT walk is needed to parse both
>>>> pieces of information.  This implies that the cpu<->node mappings must
>>>> be stashed in an array (sized NR_CPUS) so that SMP code can later use
>>>> the stashed values to avoid another SRAT table walk to set-up the early
>>>> cpu<->node mappings.
>>>>
>>>> If the kernel is configured with a NR_CPUS value less than the actual
>>>> processor entries in the SRAT (and MADT), the logic in
>>>> acpi_numa_gicc_affinity_init() is broken in that the cpu<->node mapping
>>>> is only carried out (and stashed for future use) only for a number of
>>>> SRAT entries up to NR_CPUS, which do not necessarily correspond to the
>>>> possible cpus detected at SMP initialization in
>>>> acpi_map_gic_cpu_interface() (ie MADT and SRAT processor entries order
>>>> is not enforced), which leaves the kernel with broken cpu<->node
>>>> mappings.
>>>>
>>>> Furthermore, given the current ACPI NUMA code parsing logic in
>>>> acpi_numa_gicc_affinity_init(), PXM domains for CPUs that are not parsed
>>>> because they exceed NR_CPUS entries are not mapped to NUMA nodes (ie the
>>>> PXM corresponding node is not created in the kernel) leaving the system
>>>> with a broken NUMA topology.
>>>>
>>>> Rework the ACPI ARM64 NUMA initialization process so that the NUMA
>>>> nodes creation and cpu<->node mappings are decoupled. cpu<->node
>>>> mappings are moved to SMP initialization code (where they are needed),
>>>> at the cost of an extra SRAT walk so that ACPI NUMA mappings can be
>>>> batched before being applied, fixing current parsing pitfalls.
>>>>
>>>> Fixes: d8b47fca8c23 ("arm64, ACPI, NUMA: NUMA support based on SRAT and
>>>> SLIT")
>>>> Link: http://lkml.kernel.org/r/1527768879-88161-2-git-send-email-xiexiuqi@huawei.com
>>>> Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
>>>> Signed-off-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
>>>> Cc: Punit Agrawal <punit.agrawal@arm.com>
>>>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>>>> Cc: Will Deacon <will.deacon@arm.com>
>>>> Cc: Hanjun Guo <guohanjun@huawei.com>
>>>> Cc: Ganapatrao Kulkarni <gkulkarni@caviumnetworks.com>
>>>> Cc: Jeremy Linton <jeremy.linton@arm.com>
>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>> Cc: Xie XiuQi <xiexiuqi@huawei.com>
>>>> ---
>>>> arch/arm64/include/asm/acpi.h |  6 ++-
>>>> arch/arm64/kernel/acpi_numa.c | 88 ++++++++++++++++++++++++++-----------------
>>>> arch/arm64/kernel/smp.c       | 39 +++++++++++++------
>>>> 3 files changed, 85 insertions(+), 48 deletions(-)
>>>
>>> Looks good to me,
>>>
>>> Acked-by: Hanjun Guo <hanjun.guo@linaro.org>
>>>
>>> Tested on D05 with NR_CPUS=48 (with last NUMA node boot
>>> without CPUs), the system works fine. If Xiuqi can test
>>> this patch on D06 with memory-less node, that would be
>>> more helpful.
>>>
>>
>> Hi Lorenzo,
>>
>> Thanks for this. I have noticed we now miss this log, which I think
>> was somewhat useful:
>> ACPI: NUMA: SRAT: cpu_to_node_map[5] is too small, may not be able
>> to use all cpus
>>
>> (I tested arbitary 5 CPUs)
>>
>> For example, the default ARM64 defconfig specifies NR_CPUs default
>> at 64, while some boards now have > 64 CPUs, so this info would be
>> missed with a vanilla kernel, right?
>

Hi Lorenzo,

> I did that on purpose since the aim of this patch is to remove
> that restriction, we should not be limited by the NR_CPUS when
> we parse the SRAT, that's what this patch does.

OK, understood. But I still do think that it would be useful for the 
user to know that the kernel does not support the number of CPUs in the 
system, even if this parsing is not the right place to detect/report.

>
>> Also, please note that we now have this log:
>> [    0.390565] smp: Brought up 4 nodes, 5 CPUs
>>
>> while before we had:
>> [    0.390561] smp: Brought up 1 node, 5 CPUs
>>
>> Maybe my understanding is wrong, but I find this misleading as only
>> 1 node was "Brought up".
>
> Well, that's exactly where the problem lies. This patch allows the
> kernel to inizialize NUMA nodes associated with CPUs that are not
> "brought up" with the current kernel owing to the NR_CPUS restrictions.
>
> So I think this patch still does the right thing. I reworked
> the code mechanically since it looked wrong to me but I have
> to confess I do not understand the NUMA internals in-depth either.
>
> AFAICS the original problem was that, by making the NUMA parsing
> dependent on the NR_CPUS we were not "bringing online" NUMA nodes
> that are associated with CPUs and this caused memory allocation
> failures. If this patch fixes the problem that means that we
> actually "bring up" the required NUMA nodes (and create
> zonelist for them) correctly. So the update smp: log above should
> be right.
>
> I CC'ed Michal since he knows core NUMA internals much better than I do,
> thoughts appreciated, thanks.
>

For reference, here's the new log snippet:
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x2080000000-0x23ffffffff]
[    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x23fff6f840-0x23fff70fff]
[    0.000000] NUMA: Initmem setup node 1 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x23fff6e080-0x23fff6f83f]
[    0.000000] NUMA: NODE_DATA(1) on node 0
[    0.000000] NUMA: Initmem setup node 2 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x23fff6c8c0-0x23fff6e07f]
[    0.000000] NUMA: NODE_DATA(2) on node 0
[    0.000000] NUMA: Initmem setup node 3 [<memory-less node>]
[    0.000000] NUMA: NODE_DATA [mem 0x23fff6b100-0x23fff6c8bf]
[    0.000000] NUMA: NODE_DATA(3) on node 0
[    0.000000] Zone ranges:
[    0.000000]   DMA32    [mem 0x0000000000000000-0x00000000ffffffff]
[    0.000000]   Normal   [mem 0x0000000100000000-0x00000023ffffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000003942ffff]
[    0.000000]   node   0: [mem 0x0000000039430000-0x000000003956ffff]
[    0.000000]   node   0: [mem 0x0000000039570000-0x000000003963ffff]
[    0.000000]   node   0: [mem 0x0000000039640000-0x00000000396fffff]
[    0.000000]   node   0: [mem 0x0000000039700000-0x000000003971ffff]
[    0.000000]   node   0: [mem 0x0000000039720000-0x0000000039b6ffff]
[    0.000000]   node   0: [mem 0x0000000039b70000-0x000000003eb5ffff]
[    0.000000]   node   0: [mem 0x000000003eb60000-0x000000003eb8ffff]
[    0.000000]   node   0: [mem 0x000000003eb90000-0x000000003fbfffff]
[    0.000000]   node   0: [mem 0x0000002080000000-0x00000023ffffffff]
[    0.000000] Initmem setup node 0 [mem 
0x0000000000000000-0x00000023ffffffff]
[    0.000000] Could not find start_pfn for node 1
[    0.000000] Initmem setup node 1 [mem 
0x0000000000000000-0x0000000000000000]
[    0.000000] Could not find start_pfn for node 2
[    0.000000] Initmem setup node 2 [mem 
0x0000000000000000-0x0000000000000000]
[    0.000000] Could not find start_pfn for node 3
[    0.000000] Initmem setup node 3 [mem 
0x0000000000000000-0x0000000000000000]
[    0.000000] psci: probing for conduit method from ACPI.
[    0.000000] psci: PSCIv1.0 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30000 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30001 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30002 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30003 -> Node 0
[    0.000000] ACPI: NUMA: SRAT: PXM 0 -> MPIDR 0x30100 -> Node 0


Thanks,
John

 > Lorenzo
 >
>> But the patch fixes our crash on D06:
>> Tested-by: John Garry <john.garry@huawei.com>
>>
>> Thanks very much,
>> John
>>
>>> Thanks
>>> Hanjun
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> .
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> .
>

diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h
index 0db62a4cbce2..d1c290f29b6f 100644
--- a/arch/arm64/include/asm/acpi.h
+++ b/arch/arm64/include/asm/acpi.h
@@ -134,10 +134,12 @@  pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr);
 
 #ifdef CONFIG_ACPI_NUMA
 int arm64_acpi_numa_init(void);
-int acpi_numa_get_nid(unsigned int cpu, u64 hwid);
+int acpi_numa_get_nid(unsigned int cpu);
+void acpi_map_cpus_to_nodes(void);
 #else
 static inline int arm64_acpi_numa_init(void) { return -ENOSYS; }
-static inline int acpi_numa_get_nid(unsigned int cpu, u64 hwid) { return NUMA_NO_NODE; }
+static inline int acpi_numa_get_nid(unsigned int cpu) { return NUMA_NO_NODE; }
+static inline void acpi_map_cpus_to_nodes(void) { }
 #endif /* CONFIG_ACPI_NUMA */
 
 #define ACPI_TABLE_UPGRADE_MAX_PHYS MEMBLOCK_ALLOC_ACCESSIBLE
diff --git a/arch/arm64/kernel/acpi_numa.c b/arch/arm64/kernel/acpi_numa.c
index d190a7b231bf..4f4f1815e047 100644
--- a/arch/arm64/kernel/acpi_numa.c
+++ b/arch/arm64/kernel/acpi_numa.c
@@ -26,36 +26,73 @@ 
 #include <linux/module.h>
 #include <linux/topology.h>
 
-#include <acpi/processor.h>
 #include <asm/numa.h>
 
-static int cpus_in_srat;
+static int acpi_early_node_map[NR_CPUS] __initdata = { NUMA_NO_NODE };
 
-struct __node_cpu_hwid {
-	u32 node_id;    /* logical node containing this CPU */
-	u64 cpu_hwid;   /* MPIDR for this CPU */
-};
+int __init acpi_numa_get_nid(unsigned int cpu)
+{
+	return acpi_early_node_map[cpu];
+}
+
+static inline int get_cpu_for_acpi_id(u32 uid)
+{
+	int cpu;
+
+	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+		if (uid == get_acpi_id_for_cpu(cpu))
+			return cpu;
 
-static struct __node_cpu_hwid early_node_cpu_hwid[NR_CPUS] = {
-[0 ... NR_CPUS - 1] = {NUMA_NO_NODE, PHYS_CPUID_INVALID} };
+	return -EINVAL;
+}
 
-int acpi_numa_get_nid(unsigned int cpu, u64 hwid)
+static int __init acpi_parse_gicc_pxm(struct acpi_subtable_header *header,
+				      const unsigned long end)
 {
-	int i;
+	struct acpi_srat_gicc_affinity *pa;
+	int cpu, pxm, node;
 
-	for (i = 0; i < cpus_in_srat; i++) {
-		if (hwid == early_node_cpu_hwid[i].cpu_hwid)
-			return early_node_cpu_hwid[i].node_id;
-	}
+	if (srat_disabled())
+		return -EINVAL;
+
+	pa = (struct acpi_srat_gicc_affinity *)header;
+	if (!pa)
+		return -EINVAL;
+
+	if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
+		return 0;
 
-	return NUMA_NO_NODE;
+	pxm = pa->proximity_domain;
+	node = pxm_to_node(pxm);
+
+	/*
+	 * If we can't map the UID to a logical cpu this
+	 * means that the UID is not part of possible cpus
+	 * so we do not need a NUMA mapping for it, skip
+	 * the SRAT entry and keep parsing.
+	 */
+	cpu = get_cpu_for_acpi_id(pa->acpi_processor_uid);
+	if (cpu < 0)
+		return 0;
+
+	acpi_early_node_map[cpu] = node;
+	pr_info("SRAT: PXM %d -> MPIDR 0x%llx -> Node %d\n", pxm,
+		cpu_logical_map(cpu), node);
+
+	return 0;
+}
+
+void __init acpi_map_cpus_to_nodes(void)
+{
+	acpi_table_parse_entries(ACPI_SIG_SRAT, sizeof(struct acpi_table_srat),
+					    ACPI_SRAT_TYPE_GICC_AFFINITY,
+					    acpi_parse_gicc_pxm, 0);
 }
 
 /* Callback for Proximity Domain -> ACPI processor UID mapping */
 void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 {
 	int pxm, node;
-	phys_cpuid_t mpidr;
 
 	if (srat_disabled())
 		return;
@@ -70,12 +107,6 @@  void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 	if (!(pa->flags & ACPI_SRAT_GICC_ENABLED))
 		return;
 
-	if (cpus_in_srat >= NR_CPUS) {
-		pr_warn_once("SRAT: cpu_to_node_map[%d] is too small, may not be able to use all cpus\n",
-			     NR_CPUS);
-		return;
-	}
-
 	pxm = pa->proximity_domain;
 	node = acpi_map_pxm_to_node(pxm);
 
@@ -85,20 +116,7 @@  void __init acpi_numa_gicc_affinity_init(struct acpi_srat_gicc_affinity *pa)
 		return;
 	}
 
-	mpidr = acpi_map_madt_entry(pa->acpi_processor_uid);
-	if (mpidr == PHYS_CPUID_INVALID) {
-		pr_err("SRAT: PXM %d with ACPI ID %d has no valid MPIDR in MADT\n",
-			pxm, pa->acpi_processor_uid);
-		bad_srat();
-		return;
-	}
-
-	early_node_cpu_hwid[cpus_in_srat].node_id = node;
-	early_node_cpu_hwid[cpus_in_srat].cpu_hwid =  mpidr;
 	node_set(node, numa_nodes_parsed);
-	cpus_in_srat++;
-	pr_info("SRAT: PXM %d -> MPIDR 0x%Lx -> Node %d\n",
-		pxm, mpidr, node);
 }
 
 int __init arm64_acpi_numa_init(void)
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 2faa9863d2e5..64e3490749e5 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -518,7 +518,6 @@  acpi_map_gic_cpu_interface(struct acpi_madt_generic_interrupt *processor)
 		}
 		bootcpu_valid = true;
 		cpu_madt_gicc[0] = *processor;
-		early_map_cpu_to_node(0, acpi_numa_get_nid(0, hwid));
 		return;
 	}
 
@@ -541,8 +540,6 @@  acpi_map_gic_cpu_interface(struct acpi_madt_generic_interrupt *processor)
 	 */
 	acpi_set_mailbox_entry(cpu_count, processor);
 
-	early_map_cpu_to_node(cpu_count, acpi_numa_get_nid(cpu_count, hwid));
-
 	cpu_count++;
 }
 
@@ -562,8 +559,34 @@  acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header,
 
 	return 0;
 }
+
+static void __init acpi_parse_and_init_cpus(void)
+{
+	int i;
+
+	/*
+	 * do a walk of MADT to determine how many CPUs
+	 * we have including disabled CPUs, and get information
+	 * we need for SMP init.
+	 */
+	acpi_table_parse_madt(ACPI_MADT_TYPE_GENERIC_INTERRUPT,
+				      acpi_parse_gic_cpu_interface, 0);
+
+	/*
+	 * In ACPI, SMP and CPU NUMA information is provided in separate
+	 * static tables, namely the MADT and the SRAT.
+	 *
+	 * Thus, it is simpler to first create the cpu logical map through
+	 * an MADT walk and then map the logical cpus to their node ids
+	 * as separate steps.
+	 */
+	acpi_map_cpus_to_nodes();
+
+	for (i = 0; i < nr_cpu_ids; i++)
+		early_map_cpu_to_node(i, acpi_numa_get_nid(i));
+}
 #else
-#define acpi_table_parse_madt(...)	do { } while (0)
+#define acpi_parse_and_init_cpus(...)	do { } while (0)
 #endif
 
 /*
@@ -636,13 +659,7 @@  void __init smp_init_cpus(void)
 	if (acpi_disabled)
 		of_parse_and_init_cpus();
 	else
-		/*
-		 * do a walk of MADT to determine how many CPUs
-		 * we have including disabled CPUs, and get information
-		 * we need for SMP init
-		 */
-		acpi_table_parse_madt(ACPI_MADT_TYPE_GENERIC_INTERRUPT,
-				      acpi_parse_gic_cpu_interface, 0);
+		acpi_parse_and_init_cpus();
 
 	if (cpu_count > nr_cpu_ids)
 		pr_warn("Number of cores (%d) exceeds configured maximum of %u - clipping\n",

arm64: numa: rework ACPI NUMA initialization

Commit Message

Comments

Patch