hw/arm/virt: Fix CPU's default NUMA node ID

Message ID	20220126052410.36380-1-gshan@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Gavin Shan <gshan@redhat.com> To: qemu-arm@nongnu.org Subject: [PATCH] hw/arm/virt: Fix CPU's default NUMA node ID Date: Wed, 26 Jan 2022 13:24:10 +0800 Message-Id: <20220126052410.36380-1-gshan@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=gshan@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -29 X-Spam_score: -3.0 X-Spam_bar: --- X-Spam_report: (-3.0 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.158, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: peter.maydell@linaro.org, drjones@redhat.com, richard.henderson@linaro.org, qemu-devel@nongnu.org, shan.gavin@gmail.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	hw/arm/virt: Fix CPU's default NUMA node ID \| expand hw/arm/virt: Fix CPU's default NUMA node ID

Gavin Shan Jan. 26, 2022, 5:24 a.m. UTC

The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
when it isn't provided explicitly. However, the CPU topology isn't fully
considered in the default association and it causes CPU topology broken
warnings on booting Linux guest.

For example, the following warning messages are observed when the Linux guest
is booted with the following command lines.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
  -accel kvm -machine virt,gic-version=host               \
  -cpu host                                               \
  -smp 6,sockets=2,cores=3,threads=1                      \
  -m 1024M,slots=16,maxmem=64G                            \
  -object memory-backend-ram,id=mem0,size=128M            \
  -object memory-backend-ram,id=mem1,size=128M            \
  -object memory-backend-ram,id=mem2,size=128M            \
  -object memory-backend-ram,id=mem3,size=128M            \
  -object memory-backend-ram,id=mem4,size=128M            \
  -object memory-backend-ram,id=mem4,size=384M            \
  -numa node,nodeid=0,memdev=mem0                         \
  -numa node,nodeid=1,memdev=mem1                         \
  -numa node,nodeid=2,memdev=mem2                         \
  -numa node,nodeid=3,memdev=mem3                         \
  -numa node,nodeid=4,memdev=mem4                         \
  -numa node,nodeid=5,memdev=mem5
         :
  alternatives: patching kernel code
  BUG: arch topology borken
  the CLS domain not a subset of the MC domain
  <the above error log repeats>
  BUG: arch topology borken
  the DIE domain not a subset of the NODE domain

With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
are associated with NODE#0 to NODE#5 separately. That's incorrect because
CPU#0/1/2 should be associated with same NUMA node because they're seated
in same socket.

This fixes the issue by considering the socket when default CPU-to-NUMA
is given. With this applied, no more CPU topology broken warnings are seen
from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
no CPUs associated with NODE#2/3/4/5.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 hw/arm/virt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Andrew Jones Jan. 26, 2022, 7:41 a.m. UTC | #1

CCing Igor.

Thanks,
drew

On Wed, Jan 26, 2022 at 01:24:10PM +0800, Gavin Shan wrote:
> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
> when it isn't provided explicitly. However, the CPU topology isn't fully
> considered in the default association and it causes CPU topology broken
> warnings on booting Linux guest.
> 
> For example, the following warning messages are observed when the Linux guest
> is booted with the following command lines.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>   -accel kvm -machine virt,gic-version=host               \
>   -cpu host                                               \
>   -smp 6,sockets=2,cores=3,threads=1                      \
>   -m 1024M,slots=16,maxmem=64G                            \
>   -object memory-backend-ram,id=mem0,size=128M            \
>   -object memory-backend-ram,id=mem1,size=128M            \
>   -object memory-backend-ram,id=mem2,size=128M            \
>   -object memory-backend-ram,id=mem3,size=128M            \
>   -object memory-backend-ram,id=mem4,size=128M            \
>   -object memory-backend-ram,id=mem4,size=384M            \
>   -numa node,nodeid=0,memdev=mem0                         \
>   -numa node,nodeid=1,memdev=mem1                         \
>   -numa node,nodeid=2,memdev=mem2                         \
>   -numa node,nodeid=3,memdev=mem3                         \
>   -numa node,nodeid=4,memdev=mem4                         \
>   -numa node,nodeid=5,memdev=mem5
>          :
>   alternatives: patching kernel code
>   BUG: arch topology borken
>   the CLS domain not a subset of the MC domain
>   <the above error log repeats>
>   BUG: arch topology borken
>   the DIE domain not a subset of the NODE domain
> 
> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
> are associated with NODE#0 to NODE#5 separately. That's incorrect because
> CPU#0/1/2 should be associated with same NUMA node because they're seated
> in same socket.
> 
> This fixes the issue by considering the socket when default CPU-to-NUMA
> is given. With this applied, no more CPU topology broken warnings are seen
> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
> no CPUs associated with NODE#2/3/4/5.
> 
> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  hw/arm/virt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 141350bf21..b4a95522d3 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>  
>  static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>  {
> -    return idx % ms->numa_state->num_nodes;
> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>  }
>  
>  static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
> -- 
> 2.23.0
>

Igor Mammedov Jan. 26, 2022, 9:14 a.m. UTC | #2

On Wed, 26 Jan 2022 13:24:10 +0800
Gavin Shan <gshan@redhat.com> wrote:

> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
> when it isn't provided explicitly. However, the CPU topology isn't fully
> considered in the default association and it causes CPU topology broken
> warnings on booting Linux guest.
> 
> For example, the following warning messages are observed when the Linux guest
> is booted with the following command lines.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>   -accel kvm -machine virt,gic-version=host               \
>   -cpu host                                               \
>   -smp 6,sockets=2,cores=3,threads=1                      \
>   -m 1024M,slots=16,maxmem=64G                            \
>   -object memory-backend-ram,id=mem0,size=128M            \
>   -object memory-backend-ram,id=mem1,size=128M            \
>   -object memory-backend-ram,id=mem2,size=128M            \
>   -object memory-backend-ram,id=mem3,size=128M            \
>   -object memory-backend-ram,id=mem4,size=128M            \
>   -object memory-backend-ram,id=mem4,size=384M            \
>   -numa node,nodeid=0,memdev=mem0                         \
>   -numa node,nodeid=1,memdev=mem1                         \
>   -numa node,nodeid=2,memdev=mem2                         \
>   -numa node,nodeid=3,memdev=mem3                         \
>   -numa node,nodeid=4,memdev=mem4                         \
>   -numa node,nodeid=5,memdev=mem5
>          :
>   alternatives: patching kernel code
>   BUG: arch topology borken
>   the CLS domain not a subset of the MC domain
>   <the above error log repeats>
>   BUG: arch topology borken
>   the DIE domain not a subset of the NODE domain
> 
> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
> are associated with NODE#0 to NODE#5 separately. That's incorrect because
> CPU#0/1/2 should be associated with same NUMA node because they're seated
> in same socket.
> 
> This fixes the issue by considering the socket when default CPU-to-NUMA
> is given. With this applied, no more CPU topology broken warnings are seen
> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
> no CPUs associated with NODE#2/3/4/5.

From migration point of view it looks fine to me, and doesn't need a compat knob
since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
version those unless something is broken by it).


> Signed-off-by: Gavin Shan <gshan@redhat.com>
> ---
>  hw/arm/virt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 141350bf21..b4a95522d3 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>  
>  static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>  {
> -    return idx % ms->numa_state->num_nodes;
> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);

I'd like for ARM folks to confirm whether above is correct
(i.e. socket is NUMA node boundary and also if above topo vars
could have odd values. Don't look at horribly complicated x86
as example, but it showed that vendors could stash pretty much
anything there, so we should consider it here as well and maybe
forbid that in smp virt-arm parser)

>  }
>  
>  static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)

Zhijian Li (Fujitsu)" via Jan. 28, 2022, 7:05 a.m. UTC | #3

Hi,

On 2022/1/26 17:14, Igor Mammedov wrote:
> On Wed, 26 Jan 2022 13:24:10 +0800
> Gavin Shan <gshan@redhat.com> wrote:
>
>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
>> when it isn't provided explicitly. However, the CPU topology isn't fully
>> considered in the default association and it causes CPU topology broken
>> warnings on booting Linux guest.
>>
>> For example, the following warning messages are observed when the Linux guest
>> is booted with the following command lines.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>    -accel kvm -machine virt,gic-version=host               \
>>    -cpu host                                               \
>>    -smp 6,sockets=2,cores=3,threads=1                      \
>>    -m 1024M,slots=16,maxmem=64G                            \
>>    -object memory-backend-ram,id=mem0,size=128M            \
>>    -object memory-backend-ram,id=mem1,size=128M            \
>>    -object memory-backend-ram,id=mem2,size=128M            \
>>    -object memory-backend-ram,id=mem3,size=128M            \
>>    -object memory-backend-ram,id=mem4,size=128M            \
>>    -object memory-backend-ram,id=mem4,size=384M            \
>>    -numa node,nodeid=0,memdev=mem0                         \
>>    -numa node,nodeid=1,memdev=mem1                         \
>>    -numa node,nodeid=2,memdev=mem2                         \
>>    -numa node,nodeid=3,memdev=mem3                         \
>>    -numa node,nodeid=4,memdev=mem4                         \
>>    -numa node,nodeid=5,memdev=mem5
>>           :
>>    alternatives: patching kernel code
>>    BUG: arch topology borken
>>    the CLS domain not a subset of the MC domain
>>    <the above error log repeats>
>>    BUG: arch topology borken
>>    the DIE domain not a subset of the NODE domain
>>
>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
>> CPU#0/1/2 should be associated with same NUMA node because they're seated
>> in same socket.
>>
>> This fixes the issue by considering the socket when default CPU-to-NUMA
>> is given. With this applied, no more CPU topology broken warnings are seen
>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
>> no CPUs associated with NODE#2/3/4/5.
> >From migration point of view it looks fine to me, and doesn't need a compat knob
> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
> version those unless something is broken by it).
>
>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/arm/virt.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>> index 141350bf21..b4a95522d3 100644
>> --- a/hw/arm/virt.c
>> +++ b/hw/arm/virt.c
>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>>   
>>   static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>   {
>> -    return idx % ms->numa_state->num_nodes;
>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
> I'd like for ARM folks to confirm whether above is correct
> (i.e. socket is NUMA node boundary and also if above topo vars
> could have odd values. Don't look at horribly complicated x86
> as example, but it showed that vendors could stash pretty much
> anything there, so we should consider it here as well and maybe
> forbid that in smp virt-arm parser)
We now have a generic smp parser in machine-smp.c and it guarantees
different machine boards a correct group of topo vars: supported topo
vars being valid and value of unsupported ones being 1. I think it's safe
to use them here. Or am I missing something else?

Also, we may not need to include "dies" here because it's not supported
on ARM virt machine. I believe we will always have "ms->smp.dies==1"
for this machine.

Thanks,
Yanan
>>   }
>>   
>>   static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
>
> .

Peter Maydell Feb. 8, 2022, 2:49 p.m. UTC | #4

On Wed, 26 Jan 2022 at 09:14, Igor Mammedov <imammedo@redhat.com> wrote:
>
> On Wed, 26 Jan 2022 13:24:10 +0800
> Gavin Shan <gshan@redhat.com> wrote:
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 141350bf21..b4a95522d3 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
> >
> >  static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
> >  {
> > -    return idx % ms->numa_state->num_nodes;
> > +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>
> I'd like for ARM folks to confirm whether above is correct
> (i.e. socket is NUMA node boundary and also if above topo vars
> could have odd values. Don't look at horribly complicated x86
> as example, but it showed that vendors could stash pretty much
> anything there, so we should consider it here as well and maybe
> forbid that in smp virt-arm parser)

Is there anybody on the CC list who can answer this definitively?
Certainly I have no idea about this virtual topology stuff --
from my point of view I just want VMs to be able to have
multiple CPUs and I don't know anything about how real hardware
might choose to do NUMA topology either now or in future...

Put another way: this patch isn't on my list to do anything with;
please ping me when a decision has been made about whether it should
be applied or not.

thanks
-- PMM

Gavin Shan Feb. 15, 2022, 8:19 a.m. UTC | #5

On 1/28/22 3:05 PM, wangyanan (Y) via wrote
> On 2022/1/26 17:14, Igor Mammedov wrote:
>> On Wed, 26 Jan 2022 13:24:10 +0800
>> Gavin Shan <gshan@redhat.com> wrote:
>>
>>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
>>> when it isn't provided explicitly. However, the CPU topology isn't fully
>>> considered in the default association and it causes CPU topology broken
>>> warnings on booting Linux guest.
>>>
>>> For example, the following warning messages are observed when the Linux guest
>>> is booted with the following command lines.
>>>
>>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>    -accel kvm -machine virt,gic-version=host               \
>>>    -cpu host                                               \
>>>    -smp 6,sockets=2,cores=3,threads=1                      \
>>>    -m 1024M,slots=16,maxmem=64G                            \
>>>    -object memory-backend-ram,id=mem0,size=128M            \
>>>    -object memory-backend-ram,id=mem1,size=128M            \
>>>    -object memory-backend-ram,id=mem2,size=128M            \
>>>    -object memory-backend-ram,id=mem3,size=128M            \
>>>    -object memory-backend-ram,id=mem4,size=128M            \
>>>    -object memory-backend-ram,id=mem4,size=384M            \
>>>    -numa node,nodeid=0,memdev=mem0                         \
>>>    -numa node,nodeid=1,memdev=mem1                         \
>>>    -numa node,nodeid=2,memdev=mem2                         \
>>>    -numa node,nodeid=3,memdev=mem3                         \
>>>    -numa node,nodeid=4,memdev=mem4                         \
>>>    -numa node,nodeid=5,memdev=mem5
>>>           :
>>>    alternatives: patching kernel code
>>>    BUG: arch topology borken
>>>    the CLS domain not a subset of the MC domain
>>>    <the above error log repeats>
>>>    BUG: arch topology borken
>>>    the DIE domain not a subset of the NODE domain
>>>
>>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
>>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
>>> CPU#0/1/2 should be associated with same NUMA node because they're seated
>>> in same socket.
>>>
>>> This fixes the issue by considering the socket when default CPU-to-NUMA
>>> is given. With this applied, no more CPU topology broken warnings are seen
>>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
>>> no CPUs associated with NODE#2/3/4/5.
>> >From migration point of view it looks fine to me, and doesn't need a compat knob
>> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
>> version those unless something is broken by it).
>>
>>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>> ---
>>>   hw/arm/virt.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>> index 141350bf21..b4a95522d3 100644
>>> --- a/hw/arm/virt.c
>>> +++ b/hw/arm/virt.c
>>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>>>   static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>   {
>>> -    return idx % ms->numa_state->num_nodes;
>>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>> I'd like for ARM folks to confirm whether above is correct
>> (i.e. socket is NUMA node boundary and also if above topo vars
>> could have odd values. Don't look at horribly complicated x86
>> as example, but it showed that vendors could stash pretty much
>> anything there, so we should consider it here as well and maybe
>> forbid that in smp virt-arm parser)
> We now have a generic smp parser in machine-smp.c and it guarantees
> different machine boards a correct group of topo vars: supported topo
> vars being valid and value of unsupported ones being 1. I think it's safe
> to use them here. Or am I missing something else?
> 
> Also, we may not need to include "dies" here because it's not supported
> on ARM virt machine. I believe we will always have "ms->smp.dies==1"
> for this machine.
> 

I'm sorry for the delayed response because I'm just back from two weeks
holiday.

The issue isn't related to CPU topology directly. It's actually related
to the fact: the default NUMA node ID will be picked for one particular
CPU if the associated NUMA node ID isn't provided by users explicitly.
So it's related to the CPU-to-NUMA association.

For example, the CPU-to-NUMA association is breaking socket boundary
without the code change included in this patch when the guest is booted
with the command lines like below. With this patch applied, the CPU-to-NUMA
association is following socket boundary, to make Linux guest happy.

     -smp 6,sockets=2,cores=3,threads=1                      \
     -m 1024M,slots=16,maxmem=64G                            \
     -object memory-backend-ram,id=mem0,size=128M            \
     -object memory-backend-ram,id=mem1,size=128M            \
     -object memory-backend-ram,id=mem2,size=128M            \
     -object memory-backend-ram,id=mem3,size=128M            \
     -object memory-backend-ram,id=mem4,size=128M            \
     -object memory-backend-ram,id=mem4,size=384M            \
     -numa node,nodeid=0,memdev=mem0                         \
     -numa node,nodeid=1,memdev=mem1                         \
     -numa node,nodeid=2,memdev=mem2                         \
     -numa node,nodeid=3,memdev=mem3                         \
     -numa node,nodeid=4,memdev=mem4                         \
     -numa node,nodeid=5,memdev=mem5

     CPU     Core      Socket        NUMA-Node     NUA-Node-with-patch
     ------------------------------------------------------------------
      0       0          0             0           0
      1       1          0             1           0
      2       2          0             2           0
      3       0          1             3           1
      4       1          1             4           1
      5       2          1             5           1

I think it's fine to include "ms->smp.dies" here since it's always 1 on
virt machine. We needn't change the code once it's supported some day.

>>>   }
>>>   static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)

Thanks,
Gavin

Andrew Jones Feb. 15, 2022, 8:32 a.m. UTC | #6

On Tue, Feb 15, 2022 at 04:19:01PM +0800, Gavin Shan wrote:
> The issue isn't related to CPU topology directly. It's actually related
> to the fact: the default NUMA node ID will be picked for one particular
> CPU if the associated NUMA node ID isn't provided by users explicitly.
> So it's related to the CPU-to-NUMA association.
> 
> For example, the CPU-to-NUMA association is breaking socket boundary
> without the code change included in this patch when the guest is booted
> with the command lines like below. With this patch applied, the CPU-to-NUMA
> association is following socket boundary, to make Linux guest happy.

Gavin,

Please look at Igor's request for more information. Are we sure that a
socket is a NUMA node boundary? Are we sure we can assume an even
distribution for sockets to nodes or nodes to sockets? If so, where is
that documented?

Thanks,
drew

Gavin Shan Feb. 16, 2022, 10:58 a.m. UTC | #7

On 2/15/22 4:32 PM, Andrew Jones wrote:
> On Tue, Feb 15, 2022 at 04:19:01PM +0800, Gavin Shan wrote:
>> The issue isn't related to CPU topology directly. It's actually related
>> to the fact: the default NUMA node ID will be picked for one particular
>> CPU if the associated NUMA node ID isn't provided by users explicitly.
>> So it's related to the CPU-to-NUMA association.
>>
>> For example, the CPU-to-NUMA association is breaking socket boundary
>> without the code change included in this patch when the guest is booted
>> with the command lines like below. With this patch applied, the CPU-to-NUMA
>> association is following socket boundary, to make Linux guest happy.
> 
> Gavin,
> 
> Please look at Igor's request for more information. Are we sure that a
> socket is a NUMA node boundary? Are we sure we can assume an even
> distribution for sockets to nodes or nodes to sockets? If so, where is
> that documented?
> 

Yes, I was investigating the code for Igor's questions, but I didn't
reach to conclusion when I replied to Yanan. I will reply to Igor's
thread and lets discuss it through over thread.

Thanks,
Gavin

Gavin Shan Feb. 17, 2022, 2:14 a.m. UTC | #8

On 1/26/22 5:14 PM, Igor Mammedov wrote:
> On Wed, 26 Jan 2022 13:24:10 +0800
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
>> when it isn't provided explicitly. However, the CPU topology isn't fully
>> considered in the default association and it causes CPU topology broken
>> warnings on booting Linux guest.
>>
>> For example, the following warning messages are observed when the Linux guest
>> is booted with the following command lines.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>    -accel kvm -machine virt,gic-version=host               \
>>    -cpu host                                               \
>>    -smp 6,sockets=2,cores=3,threads=1                      \
>>    -m 1024M,slots=16,maxmem=64G                            \
>>    -object memory-backend-ram,id=mem0,size=128M            \
>>    -object memory-backend-ram,id=mem1,size=128M            \
>>    -object memory-backend-ram,id=mem2,size=128M            \
>>    -object memory-backend-ram,id=mem3,size=128M            \
>>    -object memory-backend-ram,id=mem4,size=128M            \
>>    -object memory-backend-ram,id=mem4,size=384M            \
>>    -numa node,nodeid=0,memdev=mem0                         \
>>    -numa node,nodeid=1,memdev=mem1                         \
>>    -numa node,nodeid=2,memdev=mem2                         \
>>    -numa node,nodeid=3,memdev=mem3                         \
>>    -numa node,nodeid=4,memdev=mem4                         \
>>    -numa node,nodeid=5,memdev=mem5
>>           :
>>    alternatives: patching kernel code
>>    BUG: arch topology borken
>>    the CLS domain not a subset of the MC domain
>>    <the above error log repeats>
>>    BUG: arch topology borken
>>    the DIE domain not a subset of the NODE domain
>>
>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
>> CPU#0/1/2 should be associated with same NUMA node because they're seated
>> in same socket.
>>
>> This fixes the issue by considering the socket when default CPU-to-NUMA
>> is given. With this applied, no more CPU topology broken warnings are seen
>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
>> no CPUs associated with NODE#2/3/4/5.
> 
>>From migration point of view it looks fine to me, and doesn't need a compat knob
> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
> version those unless something is broken by it).
> 
> 
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   hw/arm/virt.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>> index 141350bf21..b4a95522d3 100644
>> --- a/hw/arm/virt.c
>> +++ b/hw/arm/virt.c
>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>>   
>>   static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>   {
>> -    return idx % ms->numa_state->num_nodes;
>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
> 
> I'd like for ARM folks to confirm whether above is correct
> (i.e. socket is NUMA node boundary and also if above topo vars
> could have odd values. Don't look at horribly complicated x86
> as example, but it showed that vendors could stash pretty much
> anything there, so we should consider it here as well and maybe
> forbid that in smp virt-arm parser)
> 

After doing some investigation, I don't think the socket is NUMA node boundary.
Unfortunately, I didn't find it's documented like this in any documents after
checking device-tree specification, Linux CPU topology and NUMA binding documents.

However, there are two options here according to Linux (guest) kernel code:
(A) socket is NUMA node boundary  (B) CPU die is NUMA node boundary. They are
equivalent as CPU die isn't supported on arm/virt machine. Besides, the topology
of one-to-one association between socket and NUMA node sounds natural and simplified.
So I think (A) is the best way to go.

Another thing I want to explain here is how the changes affect the memory
allocation in Linux guest. Taking the command lines included in the commit
log as an example, the first two NUMA nodes are bound to CPUs while the other
4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node
won't accommodate the memory allocation until the memory in the near (local)
NUMA node becomes exhausted. However, it's uncertain how the memory is hosted
if memory binding isn't applied.

Besides, I think the code should be improved like below to avoid overflow on
ms->numa_state->num_nodes.

  static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
  {
-    return idx % ms->numa_state->num_nodes;
+    int node_idx;
+
+    node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
+    return node_idx % ms->numa_state->num_nodes;
  }


>>   }
>>   
>>   static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
> 

Thanks,
Gavin

Gavin Shan Feb. 25, 2022, 8:41 a.m. UTC | #9

Hi Igor,

On 2/17/22 10:14 AM, Gavin Shan wrote:
> On 1/26/22 5:14 PM, Igor Mammedov wrote:
>> On Wed, 26 Jan 2022 13:24:10 +0800
>> Gavin Shan <gshan@redhat.com> wrote:
>>
>>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
>>> when it isn't provided explicitly. However, the CPU topology isn't fully
>>> considered in the default association and it causes CPU topology broken
>>> warnings on booting Linux guest.
>>>
>>> For example, the following warning messages are observed when the Linux guest
>>> is booted with the following command lines.
>>>
>>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>    -accel kvm -machine virt,gic-version=host               \
>>>    -cpu host                                               \
>>>    -smp 6,sockets=2,cores=3,threads=1                      \
>>>    -m 1024M,slots=16,maxmem=64G                            \
>>>    -object memory-backend-ram,id=mem0,size=128M            \
>>>    -object memory-backend-ram,id=mem1,size=128M            \
>>>    -object memory-backend-ram,id=mem2,size=128M            \
>>>    -object memory-backend-ram,id=mem3,size=128M            \
>>>    -object memory-backend-ram,id=mem4,size=128M            \
>>>    -object memory-backend-ram,id=mem4,size=384M            \
>>>    -numa node,nodeid=0,memdev=mem0                         \
>>>    -numa node,nodeid=1,memdev=mem1                         \
>>>    -numa node,nodeid=2,memdev=mem2                         \
>>>    -numa node,nodeid=3,memdev=mem3                         \
>>>    -numa node,nodeid=4,memdev=mem4                         \
>>>    -numa node,nodeid=5,memdev=mem5
>>>           :
>>>    alternatives: patching kernel code
>>>    BUG: arch topology borken
>>>    the CLS domain not a subset of the MC domain
>>>    <the above error log repeats>
>>>    BUG: arch topology borken
>>>    the DIE domain not a subset of the NODE domain
>>>
>>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
>>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
>>> CPU#0/1/2 should be associated with same NUMA node because they're seated
>>> in same socket.
>>>
>>> This fixes the issue by considering the socket when default CPU-to-NUMA
>>> is given. With this applied, no more CPU topology broken warnings are seen
>>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
>>> no CPUs associated with NODE#2/3/4/5.
>>
>>> From migration point of view it looks fine to me, and doesn't need a compat knob
>> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
>> version those unless something is broken by it).
>>
>>
>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>> ---
>>>   hw/arm/virt.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>> index 141350bf21..b4a95522d3 100644
>>> --- a/hw/arm/virt.c
>>> +++ b/hw/arm/virt.c
>>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>>>   static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>   {
>>> -    return idx % ms->numa_state->num_nodes;
>>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>>
>> I'd like for ARM folks to confirm whether above is correct
>> (i.e. socket is NUMA node boundary and also if above topo vars
>> could have odd values. Don't look at horribly complicated x86
>> as example, but it showed that vendors could stash pretty much
>> anything there, so we should consider it here as well and maybe
>> forbid that in smp virt-arm parser)
>>
> 
> After doing some investigation, I don't think the socket is NUMA node boundary.
> Unfortunately, I didn't find it's documented like this in any documents after
> checking device-tree specification, Linux CPU topology and NUMA binding documents.
> 
> However, there are two options here according to Linux (guest) kernel code:
> (A) socket is NUMA node boundary  (B) CPU die is NUMA node boundary. They are
> equivalent as CPU die isn't supported on arm/virt machine. Besides, the topology
> of one-to-one association between socket and NUMA node sounds natural and simplified.
> So I think (A) is the best way to go.
> 
> Another thing I want to explain here is how the changes affect the memory
> allocation in Linux guest. Taking the command lines included in the commit
> log as an example, the first two NUMA nodes are bound to CPUs while the other
> 4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node
> won't accommodate the memory allocation until the memory in the near (local)
> NUMA node becomes exhausted. However, it's uncertain how the memory is hosted
> if memory binding isn't applied.
> 
> Besides, I think the code should be improved like below to avoid overflow on
> ms->numa_state->num_nodes.
> 
>   static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>   {
> -    return idx % ms->numa_state->num_nodes;
> +    int node_idx;
> +
> +    node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
> +    return node_idx % ms->numa_state->num_nodes;
>   }
> 
> 

Kindly ping...

>>>   }
>>>   static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
>>

Thanks,
Gavin

Igor Mammedov Feb. 25, 2022, 10:03 a.m. UTC | #10

On Fri, 25 Feb 2022 16:41:43 +0800
Gavin Shan <gshan@redhat.com> wrote:

> Hi Igor,
> 
> On 2/17/22 10:14 AM, Gavin Shan wrote:
> > On 1/26/22 5:14 PM, Igor Mammedov wrote:  
> >> On Wed, 26 Jan 2022 13:24:10 +0800
> >> Gavin Shan <gshan@redhat.com> wrote:
> >>  
> >>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
> >>> when it isn't provided explicitly. However, the CPU topology isn't fully
> >>> considered in the default association and it causes CPU topology broken
> >>> warnings on booting Linux guest.
> >>>
> >>> For example, the following warning messages are observed when the Linux guest
> >>> is booted with the following command lines.
> >>>
> >>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> >>>    -accel kvm -machine virt,gic-version=host               \
> >>>    -cpu host                                               \
> >>>    -smp 6,sockets=2,cores=3,threads=1                      \
> >>>    -m 1024M,slots=16,maxmem=64G                            \
> >>>    -object memory-backend-ram,id=mem0,size=128M            \
> >>>    -object memory-backend-ram,id=mem1,size=128M            \
> >>>    -object memory-backend-ram,id=mem2,size=128M            \
> >>>    -object memory-backend-ram,id=mem3,size=128M            \
> >>>    -object memory-backend-ram,id=mem4,size=128M            \
> >>>    -object memory-backend-ram,id=mem4,size=384M            \
> >>>    -numa node,nodeid=0,memdev=mem0                         \
> >>>    -numa node,nodeid=1,memdev=mem1                         \
> >>>    -numa node,nodeid=2,memdev=mem2                         \
> >>>    -numa node,nodeid=3,memdev=mem3                         \
> >>>    -numa node,nodeid=4,memdev=mem4                         \
> >>>    -numa node,nodeid=5,memdev=mem5
> >>>           :
> >>>    alternatives: patching kernel code
> >>>    BUG: arch topology borken
> >>>    the CLS domain not a subset of the MC domain
> >>>    <the above error log repeats>
> >>>    BUG: arch topology borken
> >>>    the DIE domain not a subset of the NODE domain
> >>>
> >>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
> >>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
> >>> CPU#0/1/2 should be associated with same NUMA node because they're seated
> >>> in same socket.
> >>>
> >>> This fixes the issue by considering the socket when default CPU-to-NUMA
> >>> is given. With this applied, no more CPU topology broken warnings are seen
> >>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
> >>> no CPUs associated with NODE#2/3/4/5.  
> >>  
> >>> From migration point of view it looks fine to me, and doesn't need a compat knob  
> >> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
> >> version those unless something is broken by it).
> >>
> >>  
> >>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >>> ---
> >>>   hw/arm/virt.c | 2 +-
> >>>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> >>> index 141350bf21..b4a95522d3 100644
> >>> --- a/hw/arm/virt.c
> >>> +++ b/hw/arm/virt.c
> >>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
> >>>   static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
> >>>   {
> >>> -    return idx % ms->numa_state->num_nodes;
> >>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);  
> >>
> >> I'd like for ARM folks to confirm whether above is correct
> >> (i.e. socket is NUMA node boundary and also if above topo vars
> >> could have odd values. Don't look at horribly complicated x86
> >> as example, but it showed that vendors could stash pretty much
> >> anything there, so we should consider it here as well and maybe
> >> forbid that in smp virt-arm parser)
> >>  
> > 
> > After doing some investigation, I don't think the socket is NUMA node boundary.
> > Unfortunately, I didn't find it's documented like this in any documents after
> > checking device-tree specification, Linux CPU topology and NUMA binding documents.
> > 
> > However, there are two options here according to Linux (guest) kernel code:
> > (A) socket is NUMA node boundary  (B) CPU die is NUMA node boundary. They are
> > equivalent as CPU die isn't supported on arm/virt machine. Besides, the topology
> > of one-to-one association between socket and NUMA node sounds natural and simplified.
> > So I think (A) is the best way to go.
> > 
> > Another thing I want to explain here is how the changes affect the memory
> > allocation in Linux guest. Taking the command lines included in the commit
> > log as an example, the first two NUMA nodes are bound to CPUs while the other
> > 4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node
> > won't accommodate the memory allocation until the memory in the near (local)
> > NUMA node becomes exhausted. However, it's uncertain how the memory is hosted
> > if memory binding isn't applied.
> > 
> > Besides, I think the code should be improved like below to avoid overflow on
> > ms->numa_state->num_nodes.
> > 
> >   static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
> >   {
> > -    return idx % ms->numa_state->num_nodes;
> > +    int node_idx;
> > +
> > +    node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
> > +    return node_idx % ms->numa_state->num_nodes;

using idx directly to deduce node looks a bit iffy
take x86_get_default_cpu_node_id() as an example,
it translates it uses idx to pick arch_id (APIC ID)
which has topology encoded into it and than translates
that to node boundary (pkg_id -> socket)

Probably the same should happen here.

PS:
may be a little on tangent to the topic but chunk above
mentions dies/clusters/cores/threads as possible attributes
for CPUs but virt_possible_cpu_arch_ids() says that only
has_thread_id = true
are supported, which looks broken to me.

> >   }
> > 
> >   
> 
> Kindly ping...
> 
> >>>   }
> >>>   static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)  
> >>  
> 
> Thanks,
> Gavin
>

Gavin Shan Feb. 28, 2022, 4:26 a.m. UTC | #11

Hi Igor,

On 2/25/22 6:03 PM, Igor Mammedov wrote:
> On Fri, 25 Feb 2022 16:41:43 +0800
> Gavin Shan <gshan@redhat.com> wrote:
>> On 2/17/22 10:14 AM, Gavin Shan wrote:
>>> On 1/26/22 5:14 PM, Igor Mammedov wrote:
>>>> On Wed, 26 Jan 2022 13:24:10 +0800
>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>   
>>>>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
>>>>> when it isn't provided explicitly. However, the CPU topology isn't fully
>>>>> considered in the default association and it causes CPU topology broken
>>>>> warnings on booting Linux guest.
>>>>>
>>>>> For example, the following warning messages are observed when the Linux guest
>>>>> is booted with the following command lines.
>>>>>
>>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>>>     -accel kvm -machine virt,gic-version=host               \
>>>>>     -cpu host                                               \
>>>>>     -smp 6,sockets=2,cores=3,threads=1                      \
>>>>>     -m 1024M,slots=16,maxmem=64G                            \
>>>>>     -object memory-backend-ram,id=mem0,size=128M            \
>>>>>     -object memory-backend-ram,id=mem1,size=128M            \
>>>>>     -object memory-backend-ram,id=mem2,size=128M            \
>>>>>     -object memory-backend-ram,id=mem3,size=128M            \
>>>>>     -object memory-backend-ram,id=mem4,size=128M            \
>>>>>     -object memory-backend-ram,id=mem4,size=384M            \
>>>>>     -numa node,nodeid=0,memdev=mem0                         \
>>>>>     -numa node,nodeid=1,memdev=mem1                         \
>>>>>     -numa node,nodeid=2,memdev=mem2                         \
>>>>>     -numa node,nodeid=3,memdev=mem3                         \
>>>>>     -numa node,nodeid=4,memdev=mem4                         \
>>>>>     -numa node,nodeid=5,memdev=mem5
>>>>>            :
>>>>>     alternatives: patching kernel code
>>>>>     BUG: arch topology borken
>>>>>     the CLS domain not a subset of the MC domain
>>>>>     <the above error log repeats>
>>>>>     BUG: arch topology borken
>>>>>     the DIE domain not a subset of the NODE domain
>>>>>
>>>>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
>>>>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
>>>>> CPU#0/1/2 should be associated with same NUMA node because they're seated
>>>>> in same socket.
>>>>>
>>>>> This fixes the issue by considering the socket when default CPU-to-NUMA
>>>>> is given. With this applied, no more CPU topology broken warnings are seen
>>>>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
>>>>> no CPUs associated with NODE#2/3/4/5.
>>>>   
>>>>>  From migration point of view it looks fine to me, and doesn't need a compat knob
>>>> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
>>>> version those unless something is broken by it).
>>>>
>>>>   
>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>> ---
>>>>>    hw/arm/virt.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>>>> index 141350bf21..b4a95522d3 100644
>>>>> --- a/hw/arm/virt.c
>>>>> +++ b/hw/arm/virt.c
>>>>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>>>>>    static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>>>    {
>>>>> -    return idx % ms->numa_state->num_nodes;
>>>>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>>>>
>>>> I'd like for ARM folks to confirm whether above is correct
>>>> (i.e. socket is NUMA node boundary and also if above topo vars
>>>> could have odd values. Don't look at horribly complicated x86
>>>> as example, but it showed that vendors could stash pretty much
>>>> anything there, so we should consider it here as well and maybe
>>>> forbid that in smp virt-arm parser)
>>>>   
>>>
>>> After doing some investigation, I don't think the socket is NUMA node boundary.
>>> Unfortunately, I didn't find it's documented like this in any documents after
>>> checking device-tree specification, Linux CPU topology and NUMA binding documents.
>>>
>>> However, there are two options here according to Linux (guest) kernel code:
>>> (A) socket is NUMA node boundary  (B) CPU die is NUMA node boundary. They are
>>> equivalent as CPU die isn't supported on arm/virt machine. Besides, the topology
>>> of one-to-one association between socket and NUMA node sounds natural and simplified.
>>> So I think (A) is the best way to go.
>>>
>>> Another thing I want to explain here is how the changes affect the memory
>>> allocation in Linux guest. Taking the command lines included in the commit
>>> log as an example, the first two NUMA nodes are bound to CPUs while the other
>>> 4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node
>>> won't accommodate the memory allocation until the memory in the near (local)
>>> NUMA node becomes exhausted. However, it's uncertain how the memory is hosted
>>> if memory binding isn't applied.
>>>
>>> Besides, I think the code should be improved like below to avoid overflow on
>>> ms->numa_state->num_nodes.
>>>
>>>    static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>    {
>>> -    return idx % ms->numa_state->num_nodes;
>>> +    int node_idx;
>>> +
>>> +    node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>>> +    return node_idx % ms->numa_state->num_nodes;
> 
> using idx directly to deduce node looks a bit iffy
> take x86_get_default_cpu_node_id() as an example,
> it translates it uses idx to pick arch_id (APIC ID)
> which has topology encoded into it and than translates
> that to node boundary (pkg_id -> socket)
> 
> Probably the same should happen here.
> 
> PS:
> may be a little on tangent to the topic but chunk above
> mentions dies/clusters/cores/threads as possible attributes
> for CPUs but virt_possible_cpu_arch_ids() says that only
> has_thread_id = true
> are supported, which looks broken to me.
> 

The x86's APIC ID, where the CPU topology is encoded, is something
ideal for arm/virt to follow here. With CPU topology embedded to
struct CPUArchId::arch_id, lots of ACPI tables like MADT, SRAT, PPTT
needs the corresponding update to expose it through "ACPI Processor UID"
field in those ACPI tables. It's much more than what we want to fix
the issue here because I don't see additional benefits to do it.

Besides, the package or socket index is exactly determined by 'idx'
on arm/virt. The CPU topology is exposed through ACPI PPTT, depending on
ms->smp. The threads/cores/clusters/dies in the struct determines
the indexes of threads ('idx') who belongs to the same socket.

Yes, the information maintained in ms->possible_cpus->cpu[i].props
is inconsistent to ms->smp. It means ms->possible_cpus->cpu[i].props.thread_id
is "CPU index" or "vcpu index", not "thread ID". Other fields like
sockets/dies/clusters/cores in ms->possible_cpus->cpu[i].props are
never used on arm/virt. However, the code changes included in this
patch to fix the broken CPU topology issue is still correct, or we
can enhance it like below in case 'has_socket_id' contains the correct
information in the future.

static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
{
-    return idx % ms->numa_state->num_nodes;
+    int socket_id;
+
+    if (ms->possible_cpus->cpus[idx].props.has_socket_id) {
+        socket_id = ms->possible_cpus->cpus[idx].props.socket_id;
+    } else {
+        socket_id = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
+    }
+
+    return socket_id % ms->numa_state->num_nodes;
}

>>>    }
>>>
>>>    
>>>>>    }
>>>>>    static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
>>>>   

Thanks,
Gavin

Igor Mammedov Feb. 28, 2022, 10:54 a.m. UTC | #12

On Mon, 28 Feb 2022 12:26:53 +0800
Gavin Shan <gshan@redhat.com> wrote:

> Hi Igor,
> 
> On 2/25/22 6:03 PM, Igor Mammedov wrote:
> > On Fri, 25 Feb 2022 16:41:43 +0800
> > Gavin Shan <gshan@redhat.com> wrote:  
> >> On 2/17/22 10:14 AM, Gavin Shan wrote:  
> >>> On 1/26/22 5:14 PM, Igor Mammedov wrote:  
> >>>> On Wed, 26 Jan 2022 13:24:10 +0800
> >>>> Gavin Shan <gshan@redhat.com> wrote:
> >>>>     
> >>>>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
> >>>>> when it isn't provided explicitly. However, the CPU topology isn't fully
> >>>>> considered in the default association and it causes CPU topology broken
> >>>>> warnings on booting Linux guest.
> >>>>>
> >>>>> For example, the following warning messages are observed when the Linux guest
> >>>>> is booted with the following command lines.
> >>>>>
> >>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> >>>>>     -accel kvm -machine virt,gic-version=host               \
> >>>>>     -cpu host                                               \
> >>>>>     -smp 6,sockets=2,cores=3,threads=1                      \
> >>>>>     -m 1024M,slots=16,maxmem=64G                            \
> >>>>>     -object memory-backend-ram,id=mem0,size=128M            \
> >>>>>     -object memory-backend-ram,id=mem1,size=128M            \
> >>>>>     -object memory-backend-ram,id=mem2,size=128M            \
> >>>>>     -object memory-backend-ram,id=mem3,size=128M            \
> >>>>>     -object memory-backend-ram,id=mem4,size=128M            \
> >>>>>     -object memory-backend-ram,id=mem4,size=384M            \
> >>>>>     -numa node,nodeid=0,memdev=mem0                         \
> >>>>>     -numa node,nodeid=1,memdev=mem1                         \
> >>>>>     -numa node,nodeid=2,memdev=mem2                         \
> >>>>>     -numa node,nodeid=3,memdev=mem3                         \
> >>>>>     -numa node,nodeid=4,memdev=mem4                         \
> >>>>>     -numa node,nodeid=5,memdev=mem5
> >>>>>            :
> >>>>>     alternatives: patching kernel code
> >>>>>     BUG: arch topology borken
> >>>>>     the CLS domain not a subset of the MC domain
> >>>>>     <the above error log repeats>
> >>>>>     BUG: arch topology borken
> >>>>>     the DIE domain not a subset of the NODE domain
> >>>>>
> >>>>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
> >>>>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
> >>>>> CPU#0/1/2 should be associated with same NUMA node because they're seated
> >>>>> in same socket.
> >>>>>
> >>>>> This fixes the issue by considering the socket when default CPU-to-NUMA
> >>>>> is given. With this applied, no more CPU topology broken warnings are seen
> >>>>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
> >>>>> no CPUs associated with NODE#2/3/4/5.  
> >>>>     
> >>>>>  From migration point of view it looks fine to me, and doesn't need a compat knob  
> >>>> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
> >>>> version those unless something is broken by it).
> >>>>
> >>>>     
> >>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >>>>> ---
> >>>>>    hw/arm/virt.c | 2 +-
> >>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> >>>>> index 141350bf21..b4a95522d3 100644
> >>>>> --- a/hw/arm/virt.c
> >>>>> +++ b/hw/arm/virt.c
> >>>>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
> >>>>>    static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
> >>>>>    {
> >>>>> -    return idx % ms->numa_state->num_nodes;
> >>>>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);  
> >>>>
> >>>> I'd like for ARM folks to confirm whether above is correct
> >>>> (i.e. socket is NUMA node boundary and also if above topo vars
> >>>> could have odd values. Don't look at horribly complicated x86
> >>>> as example, but it showed that vendors could stash pretty much
> >>>> anything there, so we should consider it here as well and maybe
> >>>> forbid that in smp virt-arm parser)
> >>>>     
> >>>
> >>> After doing some investigation, I don't think the socket is NUMA node boundary.
> >>> Unfortunately, I didn't find it's documented like this in any documents after
> >>> checking device-tree specification, Linux CPU topology and NUMA binding documents.
> >>>
> >>> However, there are two options here according to Linux (guest) kernel code:
> >>> (A) socket is NUMA node boundary  (B) CPU die is NUMA node boundary. They are
> >>> equivalent as CPU die isn't supported on arm/virt machine. Besides, the topology
> >>> of one-to-one association between socket and NUMA node sounds natural and simplified.
> >>> So I think (A) is the best way to go.
> >>>
> >>> Another thing I want to explain here is how the changes affect the memory
> >>> allocation in Linux guest. Taking the command lines included in the commit
> >>> log as an example, the first two NUMA nodes are bound to CPUs while the other
> >>> 4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node
> >>> won't accommodate the memory allocation until the memory in the near (local)
> >>> NUMA node becomes exhausted. However, it's uncertain how the memory is hosted
> >>> if memory binding isn't applied.
> >>>
> >>> Besides, I think the code should be improved like below to avoid overflow on
> >>> ms->numa_state->num_nodes.
> >>>
> >>>    static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
> >>>    {
> >>> -    return idx % ms->numa_state->num_nodes;
> >>> +    int node_idx;
> >>> +
> >>> +    node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
> >>> +    return node_idx % ms->numa_state->num_nodes;  
> > 
> > using idx directly to deduce node looks a bit iffy
> > take x86_get_default_cpu_node_id() as an example,
> > it translates it uses idx to pick arch_id (APIC ID)
> > which has topology encoded into it and than translates
> > that to node boundary (pkg_id -> socket)
> > 
> > Probably the same should happen here.
> > 
> > PS:
> > may be a little on tangent to the topic but chunk above
> > mentions dies/clusters/cores/threads as possible attributes
> > for CPUs but virt_possible_cpu_arch_ids() says that only
> > has_thread_id = true
> > are supported, which looks broken to me.
> >   
> 
> The x86's APIC ID, where the CPU topology is encoded, is something
> ideal for arm/virt to follow here. With CPU topology embedded to
> struct CPUArchId::arch_id, lots of ACPI tables like MADT, SRAT, PPTT
> needs the corresponding update to expose it through "ACPI Processor UID"
> field in those ACPI tables. It's much more than what we want to fix
> the issue here because I don't see additional benefits to do it.
>
> Besides, the package or socket index is exactly determined by 'idx'
> on arm/virt. The CPU topology is exposed through ACPI PPTT, depending on
> ms->smp. The threads/cores/clusters/dies in the struct determines
> the indexes of threads ('idx') who belongs to the same socket.
> 
> Yes, the information maintained in ms->possible_cpus->cpu[i].props
> is inconsistent to ms->smp. It means ms->possible_cpus->cpu[i].props.thread_id
> is "CPU index" or "vcpu index", not "thread ID". Other fields like
> sockets/dies/clusters/cores in ms->possible_cpus->cpu[i].props are
> never used on arm/virt. However, the code changes included in this
> patch to fix the broken CPU topology issue is still correct, or we
> can enhance it like below in case 'has_socket_id' contains the correct
> information in the future.
> 
> static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
> {
> -    return idx % ms->numa_state->num_nodes;
> +    int socket_id;
> +
> +    if (ms->possible_cpus->cpus[idx].props.has_socket_id) {
> +        socket_id = ms->possible_cpus->cpus[idx].props.socket_id;
> +    } else {
> +        socket_id = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
> +    }
> +
> +    return socket_id % ms->numa_state->num_nodes;
> }

All of above points out, that we already have a bunch of scattered code
that calculates topology in its own way every time.
Piling more code duplication on top of that while generic way to do it
exists, doesn't improve things at all. It just put burden of cleaning
up after you on someone else.
If it were the last day security/regression fix, it might be fine but
as it is, the issue could be fixed in systematic way without adding
more duplication.

So if you are reluctant to fix all already existing code, as minimum
one could fix virt_possible_cpu_arch_ids() to initialize all current
supported topology metadata once and then use it in
virt_get_default_cpu_node_id() instead of old/new mix above.
Cleanup of other places can be done incrementally by follow up patches.


> >>>    }
> >>>
> >>>      
> >>>>>    }
> >>>>>    static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)  
> >>>>     
> 
> Thanks,
> Gavin
>

Gavin Shan March 1, 2022, 9:14 a.m. UTC | #13

Hi Igor,

On 2/28/22 6:54 PM, Igor Mammedov wrote:
> On Mon, 28 Feb 2022 12:26:53 +0800
> Gavin Shan <gshan@redhat.com> wrote:
>> On 2/25/22 6:03 PM, Igor Mammedov wrote:
>>> On Fri, 25 Feb 2022 16:41:43 +0800
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>> On 2/17/22 10:14 AM, Gavin Shan wrote:
>>>>> On 1/26/22 5:14 PM, Igor Mammedov wrote:
>>>>>> On Wed, 26 Jan 2022 13:24:10 +0800
>>>>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>>>      
>>>>>>> The default CPU-to-NUMA association is given by mc->get_default_cpu_node_id()
>>>>>>> when it isn't provided explicitly. However, the CPU topology isn't fully
>>>>>>> considered in the default association and it causes CPU topology broken
>>>>>>> warnings on booting Linux guest.
>>>>>>>
>>>>>>> For example, the following warning messages are observed when the Linux guest
>>>>>>> is booted with the following command lines.
>>>>>>>
>>>>>>>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>>>>>      -accel kvm -machine virt,gic-version=host               \
>>>>>>>      -cpu host                                               \
>>>>>>>      -smp 6,sockets=2,cores=3,threads=1                      \
>>>>>>>      -m 1024M,slots=16,maxmem=64G                            \
>>>>>>>      -object memory-backend-ram,id=mem0,size=128M            \
>>>>>>>      -object memory-backend-ram,id=mem1,size=128M            \
>>>>>>>      -object memory-backend-ram,id=mem2,size=128M            \
>>>>>>>      -object memory-backend-ram,id=mem3,size=128M            \
>>>>>>>      -object memory-backend-ram,id=mem4,size=128M            \
>>>>>>>      -object memory-backend-ram,id=mem4,size=384M            \
>>>>>>>      -numa node,nodeid=0,memdev=mem0                         \
>>>>>>>      -numa node,nodeid=1,memdev=mem1                         \
>>>>>>>      -numa node,nodeid=2,memdev=mem2                         \
>>>>>>>      -numa node,nodeid=3,memdev=mem3                         \
>>>>>>>      -numa node,nodeid=4,memdev=mem4                         \
>>>>>>>      -numa node,nodeid=5,memdev=mem5
>>>>>>>             :
>>>>>>>      alternatives: patching kernel code
>>>>>>>      BUG: arch topology borken
>>>>>>>      the CLS domain not a subset of the MC domain
>>>>>>>      <the above error log repeats>
>>>>>>>      BUG: arch topology borken
>>>>>>>      the DIE domain not a subset of the NODE domain
>>>>>>>
>>>>>>> With current implementation of mc->get_default_cpu_node_id(), CPU#0 to CPU#5
>>>>>>> are associated with NODE#0 to NODE#5 separately. That's incorrect because
>>>>>>> CPU#0/1/2 should be associated with same NUMA node because they're seated
>>>>>>> in same socket.
>>>>>>>
>>>>>>> This fixes the issue by considering the socket when default CPU-to-NUMA
>>>>>>> is given. With this applied, no more CPU topology broken warnings are seen
>>>>>>> from the Linux guest. The 6 CPUs are associated with NODE#0/1, but there are
>>>>>>> no CPUs associated with NODE#2/3/4/5.
>>>>>>      
>>>>>>>   From migration point of view it looks fine to me, and doesn't need a compat knob
>>>>>> since NUMA data (on virt-arm) only used to construct ACPI tables (and we don't
>>>>>> version those unless something is broken by it).
>>>>>>
>>>>>>      
>>>>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>>>>> ---
>>>>>>>     hw/arm/virt.c | 2 +-
>>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>>>>>> index 141350bf21..b4a95522d3 100644
>>>>>>> --- a/hw/arm/virt.c
>>>>>>> +++ b/hw/arm/virt.c
>>>>>>> @@ -2499,7 +2499,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>>>>>>>     static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>>>>>     {
>>>>>>> -    return idx % ms->numa_state->num_nodes;
>>>>>>> +    return idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>>>>>>
>>>>>> I'd like for ARM folks to confirm whether above is correct
>>>>>> (i.e. socket is NUMA node boundary and also if above topo vars
>>>>>> could have odd values. Don't look at horribly complicated x86
>>>>>> as example, but it showed that vendors could stash pretty much
>>>>>> anything there, so we should consider it here as well and maybe
>>>>>> forbid that in smp virt-arm parser)
>>>>>>      
>>>>>
>>>>> After doing some investigation, I don't think the socket is NUMA node boundary.
>>>>> Unfortunately, I didn't find it's documented like this in any documents after
>>>>> checking device-tree specification, Linux CPU topology and NUMA binding documents.
>>>>>
>>>>> However, there are two options here according to Linux (guest) kernel code:
>>>>> (A) socket is NUMA node boundary  (B) CPU die is NUMA node boundary. They are
>>>>> equivalent as CPU die isn't supported on arm/virt machine. Besides, the topology
>>>>> of one-to-one association between socket and NUMA node sounds natural and simplified.
>>>>> So I think (A) is the best way to go.
>>>>>
>>>>> Another thing I want to explain here is how the changes affect the memory
>>>>> allocation in Linux guest. Taking the command lines included in the commit
>>>>> log as an example, the first two NUMA nodes are bound to CPUs while the other
>>>>> 4 NUMA nodes are regarded as remote NUMA nodes to CPUs. The remote NUMA node
>>>>> won't accommodate the memory allocation until the memory in the near (local)
>>>>> NUMA node becomes exhausted. However, it's uncertain how the memory is hosted
>>>>> if memory binding isn't applied.
>>>>>
>>>>> Besides, I think the code should be improved like below to avoid overflow on
>>>>> ms->numa_state->num_nodes.
>>>>>
>>>>>     static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>>>>>     {
>>>>> -    return idx % ms->numa_state->num_nodes;
>>>>> +    int node_idx;
>>>>> +
>>>>> +    node_idx = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>>>>> +    return node_idx % ms->numa_state->num_nodes;
>>>
>>> using idx directly to deduce node looks a bit iffy
>>> take x86_get_default_cpu_node_id() as an example,
>>> it translates it uses idx to pick arch_id (APIC ID)
>>> which has topology encoded into it and than translates
>>> that to node boundary (pkg_id -> socket)
>>>
>>> Probably the same should happen here.
>>>
>>> PS:
>>> may be a little on tangent to the topic but chunk above
>>> mentions dies/clusters/cores/threads as possible attributes
>>> for CPUs but virt_possible_cpu_arch_ids() says that only
>>> has_thread_id = true
>>> are supported, which looks broken to me.
>>>    
>>
>> The x86's APIC ID, where the CPU topology is encoded, is something
>> ideal for arm/virt to follow here. With CPU topology embedded to
>> struct CPUArchId::arch_id, lots of ACPI tables like MADT, SRAT, PPTT
>> needs the corresponding update to expose it through "ACPI Processor UID"
>> field in those ACPI tables. It's much more than what we want to fix
>> the issue here because I don't see additional benefits to do it.
>>
>> Besides, the package or socket index is exactly determined by 'idx'
>> on arm/virt. The CPU topology is exposed through ACPI PPTT, depending on
>> ms->smp. The threads/cores/clusters/dies in the struct determines
>> the indexes of threads ('idx') who belongs to the same socket.
>>
>> Yes, the information maintained in ms->possible_cpus->cpu[i].props
>> is inconsistent to ms->smp. It means ms->possible_cpus->cpu[i].props.thread_id
>> is "CPU index" or "vcpu index", not "thread ID". Other fields like
>> sockets/dies/clusters/cores in ms->possible_cpus->cpu[i].props are
>> never used on arm/virt. However, the code changes included in this
>> patch to fix the broken CPU topology issue is still correct, or we
>> can enhance it like below in case 'has_socket_id' contains the correct
>> information in the future.
>>
>> static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>> {
>> -    return idx % ms->numa_state->num_nodes;
>> +    int socket_id;
>> +
>> +    if (ms->possible_cpus->cpus[idx].props.has_socket_id) {
>> +        socket_id = ms->possible_cpus->cpus[idx].props.socket_id;
>> +    } else {
>> +        socket_id = idx / (ms->smp.dies * ms->smp.clusters * ms->smp.cores * ms->smp.threads);
>> +    }
>> +
>> +    return socket_id % ms->numa_state->num_nodes;
>> }
> 
> All of above points out, that we already have a bunch of scattered code
> that calculates topology in its own way every time.
> Piling more code duplication on top of that while generic way to do it
> exists, doesn't improve things at all. It just put burden of cleaning
> up after you on someone else.
> If it were the last day security/regression fix, it might be fine but
> as it is, the issue could be fixed in systematic way without adding
> more duplication.
> 
> So if you are reluctant to fix all already existing code, as minimum
> one could fix virt_possible_cpu_arch_ids() to initialize all current
> supported topology metadata once and then use it in
> virt_get_default_cpu_node_id() instead of old/new mix above.
> Cleanup of other places can be done incrementally by follow up patches.
> 

Thanks for your comments and continuous reviews. I will fix all the
discussed issues in v2, including to expose CPU topology through the
unified processor ID field in various ACPI tables. It's pointless to
calculate the topology and CPU-to-NUMA association in the scattered
codes.

>>>>>     }
>>>>>
>>>>>       
>>>>>>>     }
>>>>>>>     static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
>>>>>>      

Thanks,
Gavin

hw/arm/virt: Fix CPU's default NUMA node ID

Commit Message

Comments

Patch