diff mbox series

[v2] hw/arm/virt: Expose empty NUMA nodes through ACPI

Message ID 20211027052958.280741-1-gshan@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v2] hw/arm/virt: Expose empty NUMA nodes through ACPI | expand

Commit Message

Gavin Shan Oct. 27, 2021, 5:29 a.m. UTC
The empty NUMA nodes, where no memory resides, aren't exposed
through ACPI SRAT table. It's not user preferred behaviour because
the corresponding memory node devices are missed from the guest
kernel as the following example shows. It means the guest kernel
doesn't have the node information as user specifies. However,
memory can be still hot added to these empty NUMA nodes when
they're not exposed.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
  -accel kvm -machine virt,gic-version=host               \
  -cpu host -smp 4,sockets=2,cores=2,threads=1            \
  -m 1024M,slots=16,maxmem=64G                            \
  -object memory-backend-ram,id=mem0,size=512M            \
  -object memory-backend-ram,id=mem1,size=512M            \
  -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
  -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
  -numa node,nodeid=2                                     \
  -numa node,nodeid=3                                     \
     :
  guest# ls /sys/devices/system/node | grep node
  node0
  node1
  (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
  (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
  guest# ls /sys/devices/system/node | grep node
  node0
  node1
  node2
  guest# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
  Node 2 MemTotal:    1048576 kB

This exposes these empty NUMA nodes through ACPI SRAT table. With
this applied, the corresponding memory node devices can be found
from the guest. Note that the hotpluggable capability is explicitly
given to these empty NUMA nodes for sake of completeness.

  guest# ls /sys/devices/system/node | grep node
  node0
  node1
  node2
  node3
  guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
  Node 3 MemTotal:    0 kB
  (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
  (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
  guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
  Node 3 MemTotal:    1048576 kB

Signed-off-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
---
v2: Improved commit log as suggested by Drew and Igor.
---
 hw/arm/virt-acpi-build.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Comments

Igor Mammedov Oct. 27, 2021, 3:40 p.m. UTC | #1
On Wed, 27 Oct 2021 13:29:58 +0800
Gavin Shan <gshan@redhat.com> wrote:

> The empty NUMA nodes, where no memory resides, aren't exposed
> through ACPI SRAT table. It's not user preferred behaviour because
> the corresponding memory node devices are missed from the guest
> kernel as the following example shows. It means the guest kernel
> doesn't have the node information as user specifies. However,
> memory can be still hot added to these empty NUMA nodes when
> they're not exposed.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>   -accel kvm -machine virt,gic-version=host               \
>   -cpu host -smp 4,sockets=2,cores=2,threads=1            \
>   -m 1024M,slots=16,maxmem=64G                            \
>   -object memory-backend-ram,id=mem0,size=512M            \
>   -object memory-backend-ram,id=mem1,size=512M            \
>   -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
>   -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
>   -numa node,nodeid=2                                     \
>   -numa node,nodeid=3                                     \
>      :
>   guest# ls /sys/devices/system/node | grep node
>   node0
>   node1
>   (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>   (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>   guest# ls /sys/devices/system/node | grep node
>   node0
>   node1
>   node2
>   guest# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
>   Node 2 MemTotal:    1048576 kB
> 
> This exposes these empty NUMA nodes through ACPI SRAT table. With
> this applied, the corresponding memory node devices can be found
> from the guest. Note that the hotpluggable capability is explicitly
> given to these empty NUMA nodes for sake of completeness.
> 
>   guest# ls /sys/devices/system/node | grep node
>   node0
>   node1
>   node2
>   node3
>   guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>   Node 3 MemTotal:    0 kB
>   (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>   (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>   guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>   Node 3 MemTotal:    1048576 kB

I'm still not sure why this is necessary and if it's a good idea,
is there a real hardware that have such nodes?

SRAT is used to assign resources to nodes, I haven't seen it being
used  as means to describe an empty node anywhere in the spec.
(perhaps we should not allow empty nodes on QEMU CLI at all).

Then if we really need this, why it's done for ARM only
and not for x86?

> Signed-off-by: Gavin Shan <gshan@redhat.com>
> Reviewed-by: Andrew Jones <drjones@redhat.com>
> ---
> v2: Improved commit log as suggested by Drew and Igor.
> ---
>  hw/arm/virt-acpi-build.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 674f902652..a4c95b2f64 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -526,6 +526,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
>      AcpiTable table = { .sig = "SRAT", .rev = 3, .oem_id = vms->oem_id,
>                          .oem_table_id = vms->oem_table_id };
> +    MemoryAffinityFlags flags;
>  
>      acpi_table_begin(&table, table_data);
>      build_append_int_noprefix(table_data, 1, 4); /* Reserved */
> @@ -547,12 +548,15 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>  
>      mem_base = vms->memmap[VIRT_MEM].base;
>      for (i = 0; i < ms->numa_state->num_nodes; ++i) {
> -        if (ms->numa_state->nodes[i].node_mem > 0) {
> -            build_srat_memory(table_data, mem_base,
> -                              ms->numa_state->nodes[i].node_mem, i,
> -                              MEM_AFFINITY_ENABLED);
> -            mem_base += ms->numa_state->nodes[i].node_mem;
> +        if (ms->numa_state->nodes[i].node_mem) {
> +            flags = MEM_AFFINITY_ENABLED;
> +        } else {
> +            flags = MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE;
>          }
> +
> +        build_srat_memory(table_data, mem_base,
> +                          ms->numa_state->nodes[i].node_mem, i, flags);
that will create 0 length memory range, which is "Enabled",
I'm not sure it's safe thing to do.

As side effect this will also create empty ranges for memory-less
nodes that have only CPUs, where it's not necessary.

I'd really try avoid adding empty ranges unless it hard requirement,
described somewhere or fixes a bug that can't be fixed elsewhere.

> +        mem_base += ms->numa_state->nodes[i].node_mem;
>      }
>  
>      if (ms->nvdimms_state->is_enabled) {
Gavin Shan Oct. 28, 2021, 11:32 a.m. UTC | #2
On 10/28/21 2:40 AM, Igor Mammedov wrote:
> On Wed, 27 Oct 2021 13:29:58 +0800
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> The empty NUMA nodes, where no memory resides, aren't exposed
>> through ACPI SRAT table. It's not user preferred behaviour because
>> the corresponding memory node devices are missed from the guest
>> kernel as the following example shows. It means the guest kernel
>> doesn't have the node information as user specifies. However,
>> memory can be still hot added to these empty NUMA nodes when
>> they're not exposed.
>>
>>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>    -accel kvm -machine virt,gic-version=host               \
>>    -cpu host -smp 4,sockets=2,cores=2,threads=1            \
>>    -m 1024M,slots=16,maxmem=64G                            \
>>    -object memory-backend-ram,id=mem0,size=512M            \
>>    -object memory-backend-ram,id=mem1,size=512M            \
>>    -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
>>    -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
>>    -numa node,nodeid=2                                     \
>>    -numa node,nodeid=3                                     \
>>       :
>>    guest# ls /sys/devices/system/node | grep node
>>    node0
>>    node1
>>    (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>>    (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>>    guest# ls /sys/devices/system/node | grep node
>>    node0
>>    node1
>>    node2
>>    guest# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
>>    Node 2 MemTotal:    1048576 kB
>>
>> This exposes these empty NUMA nodes through ACPI SRAT table. With
>> this applied, the corresponding memory node devices can be found
>> from the guest. Note that the hotpluggable capability is explicitly
>> given to these empty NUMA nodes for sake of completeness.
>>
>>    guest# ls /sys/devices/system/node | grep node
>>    node0
>>    node1
>>    node2
>>    node3
>>    guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>>    Node 3 MemTotal:    0 kB
>>    (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>>    (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>>    guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>>    Node 3 MemTotal:    1048576 kB
> 
> I'm still not sure why this is necessary and if it's a good idea,
> is there a real hardware that have such nodes?
> 
> SRAT is used to assign resources to nodes, I haven't seen it being
> used  as means to describe an empty node anywhere in the spec.
> (perhaps we should not allow empty nodes on QEMU CLI at all).
> 
> Then if we really need this, why it's done for ARM only
> and not for x86?
> 

I think this case exists in real hardware where the memory DIMM
isn't plugged, but the node is still probed. Besides, this patch
addresses two issues:

(1) To make the information contained in guest kernel consistent
     to the command line as the user expects. It means the sysfs
     entries for these empty NUMA nodes in guest kernel reflects
     what user provided.

(2) Without this patch, the node number can be twisted from user's
     perspective. As the example included in the commit log, node3
     should be created, but node2 is actually created. The patch
     reserves the NUMA node IDs in advance to avoid the issue.

     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
        :
     -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
     -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
     -numa node,nodeid=2                                     \
     -numa node,nodeid=3                                     \
     guest# ls /sys/devices/system/node | grep node
     node0  node1
     (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
     (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
     guest# ls /sys/devices/system/node | grep node
     node0  node1  node2

We definitely need empty NUMA nodes from QEMU CLI. One case I heard
of is kdump developer specify NUMA nodes and corresponding pc-dimm
objects for memory hot-add and test the memory usability. I'm not
familiar with ACPI specification, but linux kernel fetches NUMA
node IDs from the following ACPI tables on ARM64. It's possible
the empty NUMA node IDs are parsed from GENERIC_AFFINITY or SLIT
tables if they exist in the corresponding ACPI tables.

     ACPI_SRAT_TYPE_MEMORY_AFFINITY
     ACPI_SRAT_TYPE_GENERIC_AFFINITY
     ACPI_SIG_SLIT                          # if it exists

So I think other architectures including x86 needs similar mechanism
to expose NUMA node IDs through ACPI table. If you agree, I can post
additional patches to do this after this one is settled and merged.

>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> Reviewed-by: Andrew Jones <drjones@redhat.com>
>> ---
>> v2: Improved commit log as suggested by Drew and Igor.
>> ---
>>   hw/arm/virt-acpi-build.c | 14 +++++++++-----
>>   1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
>> index 674f902652..a4c95b2f64 100644
>> --- a/hw/arm/virt-acpi-build.c
>> +++ b/hw/arm/virt-acpi-build.c
>> @@ -526,6 +526,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>       const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
>>       AcpiTable table = { .sig = "SRAT", .rev = 3, .oem_id = vms->oem_id,
>>                           .oem_table_id = vms->oem_table_id };
>> +    MemoryAffinityFlags flags;
>>   
>>       acpi_table_begin(&table, table_data);
>>       build_append_int_noprefix(table_data, 1, 4); /* Reserved */
>> @@ -547,12 +548,15 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>   
>>       mem_base = vms->memmap[VIRT_MEM].base;
>>       for (i = 0; i < ms->numa_state->num_nodes; ++i) {
>> -        if (ms->numa_state->nodes[i].node_mem > 0) {
>> -            build_srat_memory(table_data, mem_base,
>> -                              ms->numa_state->nodes[i].node_mem, i,
>> -                              MEM_AFFINITY_ENABLED);
>> -            mem_base += ms->numa_state->nodes[i].node_mem;
>> +        if (ms->numa_state->nodes[i].node_mem) {
>> +            flags = MEM_AFFINITY_ENABLED;
>> +        } else {
>> +            flags = MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE;
>>           }
>> +
>> +        build_srat_memory(table_data, mem_base,
>> +                          ms->numa_state->nodes[i].node_mem, i, flags);
> that will create 0 length memory range, which is "Enabled",
> I'm not sure it's safe thing to do.
> 
> As side effect this will also create empty ranges for memory-less
> nodes that have only CPUs, where it's not necessary.
> 
> I'd really try avoid adding empty ranges unless it hard requirement,
> described somewhere or fixes a bug that can't be fixed elsewhere.
> 

It's safe to Linux at least as I tested on ARM64. The (zero) memory
block doesn't affect anything. Besides, the memory block which has
been marked as hotpluggable won't be handled in Linux on ARM64
actually.

Yes, the empty NUMA nodes are meaningless to CPUs until memory is
hot added into them.


>> +        mem_base += ms->numa_state->nodes[i].node_mem;
>>       }
>>   
>>       if (ms->nvdimms_state->is_enabled) {
> 

Thanks,
Gavin
Igor Mammedov Nov. 1, 2021, 8:44 a.m. UTC | #3
On Thu, 28 Oct 2021 22:32:09 +1100
Gavin Shan <gshan@redhat.com> wrote:

> On 10/28/21 2:40 AM, Igor Mammedov wrote:
> > On Wed, 27 Oct 2021 13:29:58 +0800
> > Gavin Shan <gshan@redhat.com> wrote:
> >   
> >> The empty NUMA nodes, where no memory resides, aren't exposed
> >> through ACPI SRAT table. It's not user preferred behaviour because
> >> the corresponding memory node devices are missed from the guest
> >> kernel as the following example shows. It means the guest kernel
> >> doesn't have the node information as user specifies. However,
> >> memory can be still hot added to these empty NUMA nodes when
> >> they're not exposed.
> >>
> >>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
> >>    -accel kvm -machine virt,gic-version=host               \
> >>    -cpu host -smp 4,sockets=2,cores=2,threads=1            \
> >>    -m 1024M,slots=16,maxmem=64G                            \
> >>    -object memory-backend-ram,id=mem0,size=512M            \
> >>    -object memory-backend-ram,id=mem1,size=512M            \
> >>    -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
> >>    -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
> >>    -numa node,nodeid=2                                     \
> >>    -numa node,nodeid=3                                     \
> >>       :
> >>    guest# ls /sys/devices/system/node | grep node
> >>    node0
> >>    node1
> >>    (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
> >>    (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
> >>    guest# ls /sys/devices/system/node | grep node
> >>    node0
> >>    node1
> >>    node2
> >>    guest# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
> >>    Node 2 MemTotal:    1048576 kB
> >>
> >> This exposes these empty NUMA nodes through ACPI SRAT table. With
> >> this applied, the corresponding memory node devices can be found
> >> from the guest. Note that the hotpluggable capability is explicitly
> >> given to these empty NUMA nodes for sake of completeness.
> >>
> >>    guest# ls /sys/devices/system/node | grep node
> >>    node0
> >>    node1
> >>    node2
> >>    node3
> >>    guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
> >>    Node 3 MemTotal:    0 kB
> >>    (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
> >>    (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
> >>    guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
> >>    Node 3 MemTotal:    1048576 kB  
> > 
> > I'm still not sure why this is necessary and if it's a good idea,
> > is there a real hardware that have such nodes?
> > 
> > SRAT is used to assign resources to nodes, I haven't seen it being
> > used  as means to describe an empty node anywhere in the spec.
> > (perhaps we should not allow empty nodes on QEMU CLI at all).
> > 
> > Then if we really need this, why it's done for ARM only
> > and not for x86?
> >   
> 
> I think this case exists in real hardware where the memory DIMM
> isn't plugged, but the node is still probed.
Then please, provide SRAT table from such hw
(a lot of them (to justify it as defacto 'standard')?
since such hw firmware could be buggy as well).

BTW, fake memory node doesn't have to be present to make guest
notice an existing numa node. it can be represented by affinity
entries as well (see chapter:System Resource Affinity Table (SRAT)
in the spec).

At the moment, I'm totally unconvinced that empty numa nodes
are valid to provide.


> Besides, this patch
> addresses two issues:
> 
> (1) To make the information contained in guest kernel consistent
>      to the command line as the user expects. It means the sysfs
>      entries for these empty NUMA nodes in guest kernel reflects
>      what user provided.
-numa/SRAT describe boot time configuration.
So if you do not specify empty nodes on CLI, then number of nodes
would be consistent.

> (2) Without this patch, the node number can be twisted from user's
>      perspective. As the example included in the commit log, node3
>      should be created, but node2 is actually created. The patch
>      reserves the NUMA node IDs in advance to avoid the issue.
> 
>      /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>         :
>      -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
>      -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
>      -numa node,nodeid=2                                     \
>      -numa node,nodeid=3                                     \
>      guest# ls /sys/devices/system/node | grep node
>      node0  node1
>      (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>      (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>      guest# ls /sys/devices/system/node | grep node
>      node0  node1  node2

The same node numbering on guest side and QEMU CLI works only
by accident not by design. In short numbers may not match
(in linux kernel case it depends on the order the nodes
are enumerated), if you really want numbers to match then fix
guest kernel to use proximity domain for numbering.

The important thing here is that resources are grouped
together, according to proximity domain.

> We definitely need empty NUMA nodes from QEMU CLI. One case I heard
> of is kdump developer specify NUMA nodes and corresponding pc-dimm
> objects for memory hot-add and test the memory usability.

Question is if the node has to be absolutely empty for this?
It should be possible to use a node that has CPUs assigned to it.

Or add pc-dimm at runtime, which should dynamically create
a numa node for it if the node wasn't described before.

> I'm not
> familiar with ACPI specification, but linux kernel fetches NUMA
> node IDs from the following ACPI tables on ARM64. It's possible
> the empty NUMA node IDs are parsed from GENERIC_AFFINITY or SLIT
> tables if they exist in the corresponding ACPI tables.
> 
>      ACPI_SRAT_TYPE_MEMORY_AFFINITY
>      ACPI_SRAT_TYPE_GENERIC_AFFINITY
any possible entry type can be a source for numa node,
if guest doesn't do this it's guest's bug to fix.

>      ACPI_SIG_SLIT                          # if it exists
that's a recent addition tied to [1].
1) https://www.mail-archive.com/qemu-devel@nongnu.org/msg843453.html
If I recall correctly, related QEMU patch was dropped.

> So I think other architectures including x86 needs similar mechanism
> to expose NUMA node IDs through ACPI table. If you agree, I can post
> additional patches to do this after this one is settled and merged.

I do not agree to bogus entries approach at all.
Sometimes, we merge 'out of spec' changes but that should be
baked by 'must have' justification and tested with wide
range of guest OSes (if Windows (with its more strict ACPI impl.)
boots on virt-arm machine it should be tested as well).

So far I don't see 'must have' aspect in bogus nodes,
only a convenience one (with 'works by accident' caveat).

I'm sorry for being stingy about out of spec things,
but that is typical source of regressions on ACPI side
which is noticed too late when users come back with broken
guest after release.

> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> Reviewed-by: Andrew Jones <drjones@redhat.com>
> >> ---
> >> v2: Improved commit log as suggested by Drew and Igor.
> >> ---
> >>   hw/arm/virt-acpi-build.c | 14 +++++++++-----
> >>   1 file changed, 9 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> >> index 674f902652..a4c95b2f64 100644
> >> --- a/hw/arm/virt-acpi-build.c
> >> +++ b/hw/arm/virt-acpi-build.c
> >> @@ -526,6 +526,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
> >>       const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
> >>       AcpiTable table = { .sig = "SRAT", .rev = 3, .oem_id = vms->oem_id,
> >>                           .oem_table_id = vms->oem_table_id };
> >> +    MemoryAffinityFlags flags;
> >>   
> >>       acpi_table_begin(&table, table_data);
> >>       build_append_int_noprefix(table_data, 1, 4); /* Reserved */
> >> @@ -547,12 +548,15 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
> >>   
> >>       mem_base = vms->memmap[VIRT_MEM].base;
> >>       for (i = 0; i < ms->numa_state->num_nodes; ++i) {
> >> -        if (ms->numa_state->nodes[i].node_mem > 0) {
> >> -            build_srat_memory(table_data, mem_base,
> >> -                              ms->numa_state->nodes[i].node_mem, i,
> >> -                              MEM_AFFINITY_ENABLED);
> >> -            mem_base += ms->numa_state->nodes[i].node_mem;
> >> +        if (ms->numa_state->nodes[i].node_mem) {
> >> +            flags = MEM_AFFINITY_ENABLED;
> >> +        } else {
> >> +            flags = MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE;
> >>           }
> >> +
> >> +        build_srat_memory(table_data, mem_base,
> >> +                          ms->numa_state->nodes[i].node_mem, i, flags);  
> > that will create 0 length memory range, which is "Enabled",
> > I'm not sure it's safe thing to do.
> > 
> > As side effect this will also create empty ranges for memory-less
> > nodes that have only CPUs, where it's not necessary.
> > 
> > I'd really try avoid adding empty ranges unless it hard requirement,
> > described somewhere or fixes a bug that can't be fixed elsewhere.
> >   
> 
> It's safe to Linux at least as I tested on ARM64. The (zero) memory
> block doesn't affect anything. Besides, the memory block which has
> been marked as hotpluggable won't be handled in Linux on ARM64
> actually.
> 
> Yes, the empty NUMA nodes are meaningless to CPUs until memory is
> hot added into them.
> 
> >> +        mem_base += ms->numa_state->nodes[i].node_mem;
> >>       }
> >>   
> >>       if (ms->nvdimms_state->is_enabled) {  
> >   
> 
> Thanks,
> Gavin
>
Gavin Shan Nov. 1, 2021, 11:44 p.m. UTC | #4
On 11/1/21 7:44 PM, Igor Mammedov wrote:
> On Thu, 28 Oct 2021 22:32:09 +1100
> Gavin Shan <gshan@redhat.com> wrote: 
>> On 10/28/21 2:40 AM, Igor Mammedov wrote:
>>> On Wed, 27 Oct 2021 13:29:58 +0800
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>    
>>>> The empty NUMA nodes, where no memory resides, aren't exposed
>>>> through ACPI SRAT table. It's not user preferred behaviour because
>>>> the corresponding memory node devices are missed from the guest
>>>> kernel as the following example shows. It means the guest kernel
>>>> doesn't have the node information as user specifies. However,
>>>> memory can be still hot added to these empty NUMA nodes when
>>>> they're not exposed.
>>>>
>>>>     /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>>>     -accel kvm -machine virt,gic-version=host               \
>>>>     -cpu host -smp 4,sockets=2,cores=2,threads=1            \
>>>>     -m 1024M,slots=16,maxmem=64G                            \
>>>>     -object memory-backend-ram,id=mem0,size=512M            \
>>>>     -object memory-backend-ram,id=mem1,size=512M            \
>>>>     -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
>>>>     -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
>>>>     -numa node,nodeid=2                                     \
>>>>     -numa node,nodeid=3                                     \
>>>>        :
>>>>     guest# ls /sys/devices/system/node | grep node
>>>>     node0
>>>>     node1
>>>>     (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>>>>     (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>>>>     guest# ls /sys/devices/system/node | grep node
>>>>     node0
>>>>     node1
>>>>     node2
>>>>     guest# cat /sys/devices/system/node/node2/meminfo | grep MemTotal
>>>>     Node 2 MemTotal:    1048576 kB
>>>>
>>>> This exposes these empty NUMA nodes through ACPI SRAT table. With
>>>> this applied, the corresponding memory node devices can be found
>>>> from the guest. Note that the hotpluggable capability is explicitly
>>>> given to these empty NUMA nodes for sake of completeness.
>>>>
>>>>     guest# ls /sys/devices/system/node | grep node
>>>>     node0
>>>>     node1
>>>>     node2
>>>>     node3
>>>>     guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>>>>     Node 3 MemTotal:    0 kB
>>>>     (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>>>>     (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>>>>     guest# cat /sys/devices/system/node/node3/meminfo | grep MemTotal
>>>>     Node 3 MemTotal:    1048576 kB
>>>
>>> I'm still not sure why this is necessary and if it's a good idea,
>>> is there a real hardware that have such nodes?
>>>
>>> SRAT is used to assign resources to nodes, I haven't seen it being
>>> used  as means to describe an empty node anywhere in the spec.
>>> (perhaps we should not allow empty nodes on QEMU CLI at all).
>>>
>>> Then if we really need this, why it's done for ARM only
>>> and not for x86?
>>>    
>>
>> I think this case exists in real hardware where the memory DIMM
>> isn't plugged, but the node is still probed.
> Then please, provide SRAT table from such hw
> (a lot of them (to justify it as defacto 'standard')?
> since such hw firmware could be buggy as well).
> 
> BTW, fake memory node doesn't have to be present to make guest
> notice an existing numa node. it can be represented by affinity
> entries as well (see chapter:System Resource Affinity Table (SRAT)
> in the spec).
> 
> At the moment, I'm totally unconvinced that empty numa nodes
> are valid to provide.
> 

Igor, thanks for your continuous review. I don't have strong sense
the fake nodes should be presented. So please ignore this patch
until it's needed by virtio-mem. In that time, I can revisit this.
More context is provided as below to make the discussion complete.

> 
>> Besides, this patch
>> addresses two issues:
>>
>> (1) To make the information contained in guest kernel consistent
>>       to the command line as the user expects. It means the sysfs
>>       entries for these empty NUMA nodes in guest kernel reflects
>>       what user provided.
> -numa/SRAT describe boot time configuration.
> So if you do not specify empty nodes on CLI, then number of nodes
> would be consistent.
> 

Correct.

>> (2) Without this patch, the node number can be twisted from user's
>>       perspective. As the example included in the commit log, node3
>>       should be created, but node2 is actually created. The patch
>>       reserves the NUMA node IDs in advance to avoid the issue.
>>
>>       /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>>          :
>>       -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
>>       -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
>>       -numa node,nodeid=2                                     \
>>       -numa node,nodeid=3                                     \
>>       guest# ls /sys/devices/system/node | grep node
>>       node0  node1
>>       (qemu) object_add memory-backend-ram,id=hp-mem0,size=1G
>>       (qemu) device_add pc-dimm,id=hp-dimm0,node=3,memdev=hp-mem0
>>       guest# ls /sys/devices/system/node | grep node
>>       node0  node1  node2
> 
> The same node numbering on guest side and QEMU CLI works only
> by accident not by design. In short numbers may not match
> (in linux kernel case it depends on the order the nodes
> are enumerated), if you really want numbers to match then fix
> guest kernel to use proximity domain for numbering.
> 
> The important thing here is that resources are grouped
> together, according to proximity domain.
> 

Linux ACPI driver avoids to take the proximity domain as node number,
meaning Linux kernel doesn't support discrete node IDs.

>> We definitely need empty NUMA nodes from QEMU CLI. One case I heard
>> of is kdump developer specify NUMA nodes and corresponding pc-dimm
>> objects for memory hot-add and test the memory usability.
> 
> Question is if the node has to be absolutely empty for this?
> It should be possible to use a node that has CPUs assigned to it.
> 
> Or add pc-dimm at runtime, which should dynamically create
> a numa node for it if the node wasn't described before.
> 

Yes, I think so.

>> I'm not
>> familiar with ACPI specification, but linux kernel fetches NUMA
>> node IDs from the following ACPI tables on ARM64. It's possible
>> the empty NUMA node IDs are parsed from GENERIC_AFFINITY or SLIT
>> tables if they exist in the corresponding ACPI tables.
>>
>>       ACPI_SRAT_TYPE_MEMORY_AFFINITY
>>       ACPI_SRAT_TYPE_GENERIC_AFFINITY
> any possible entry type can be a source for numa node,
> if guest doesn't do this it's guest's bug to fix.
> 
>>       ACPI_SIG_SLIT                          # if it exists
> that's a recent addition tied to [1].
> 1) https://www.mail-archive.com/qemu-devel@nongnu.org/msg843453.html
> If I recall correctly, related QEMU patch was dropped.
> 
>> So I think other architectures including x86 needs similar mechanism
>> to expose NUMA node IDs through ACPI table. If you agree, I can post
>> additional patches to do this after this one is settled and merged.
> 
> I do not agree to bogus entries approach at all.
> Sometimes, we merge 'out of spec' changes but that should be
> baked by 'must have' justification and tested with wide
> range of guest OSes (if Windows (with its more strict ACPI impl.)
> boots on virt-arm machine it should be tested as well).
> 
> So far I don't see 'must have' aspect in bogus nodes,
> only a convenience one (with 'works by accident' caveat).
> 
> I'm sorry for being stingy about out of spec things,
> but that is typical source of regressions on ACPI side
> which is noticed too late when users come back with broken
> guest after release.
> 

Yeah, I agree. I don't have strong sense to expose these empty nodes
for now. Please ignore the patch.

>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> Reviewed-by: Andrew Jones <drjones@redhat.com>
>>>> ---
>>>> v2: Improved commit log as suggested by Drew and Igor.
>>>> ---
>>>>    hw/arm/virt-acpi-build.c | 14 +++++++++-----
>>>>    1 file changed, 9 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
>>>> index 674f902652..a4c95b2f64 100644
>>>> --- a/hw/arm/virt-acpi-build.c
>>>> +++ b/hw/arm/virt-acpi-build.c
>>>> @@ -526,6 +526,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>>>        const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
>>>>        AcpiTable table = { .sig = "SRAT", .rev = 3, .oem_id = vms->oem_id,
>>>>                            .oem_table_id = vms->oem_table_id };
>>>> +    MemoryAffinityFlags flags;
>>>>    
>>>>        acpi_table_begin(&table, table_data);
>>>>        build_append_int_noprefix(table_data, 1, 4); /* Reserved */
>>>> @@ -547,12 +548,15 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>>>>    
>>>>        mem_base = vms->memmap[VIRT_MEM].base;
>>>>        for (i = 0; i < ms->numa_state->num_nodes; ++i) {
>>>> -        if (ms->numa_state->nodes[i].node_mem > 0) {
>>>> -            build_srat_memory(table_data, mem_base,
>>>> -                              ms->numa_state->nodes[i].node_mem, i,
>>>> -                              MEM_AFFINITY_ENABLED);
>>>> -            mem_base += ms->numa_state->nodes[i].node_mem;
>>>> +        if (ms->numa_state->nodes[i].node_mem) {
>>>> +            flags = MEM_AFFINITY_ENABLED;
>>>> +        } else {
>>>> +            flags = MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE;
>>>>            }
>>>> +
>>>> +        build_srat_memory(table_data, mem_base,
>>>> +                          ms->numa_state->nodes[i].node_mem, i, flags);
>>> that will create 0 length memory range, which is "Enabled",
>>> I'm not sure it's safe thing to do.
>>>
>>> As side effect this will also create empty ranges for memory-less
>>> nodes that have only CPUs, where it's not necessary.
>>>
>>> I'd really try avoid adding empty ranges unless it hard requirement,
>>> described somewhere or fixes a bug that can't be fixed elsewhere.
>>>    
>>
>> It's safe to Linux at least as I tested on ARM64. The (zero) memory
>> block doesn't affect anything. Besides, the memory block which has
>> been marked as hotpluggable won't be handled in Linux on ARM64
>> actually.
>>
>> Yes, the empty NUMA nodes are meaningless to CPUs until memory is
>> hot added into them.
>>
>>>> +        mem_base += ms->numa_state->nodes[i].node_mem;
>>>>        }
>>>>    
>>>>        if (ms->nvdimms_state->is_enabled) {
>>>    
>>

Thanks,
Gavin
Andrew Jones Nov. 2, 2021, 7:39 a.m. UTC | #5
On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote:
> 
> Yeah, I agree. I don't have strong sense to expose these empty nodes
> for now. Please ignore the patch.
>

So were describing empty numa nodes on the command line ever a reasonable
thing to do? What happens on x86 machine types when describing empty numa
nodes? I'm starting to think that the solution all along was just to
error out when a numa node has memory size = 0...

Thanks,
drew
Gavin Shan Nov. 5, 2021, 12:47 p.m. UTC | #6
Hi Drew and Igor,

On 11/2/21 6:39 PM, Andrew Jones wrote:
> On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote:
>>
>> Yeah, I agree. I don't have strong sense to expose these empty nodes
>> for now. Please ignore the patch.
>>
> 
> So were describing empty numa nodes on the command line ever a reasonable
> thing to do? What happens on x86 machine types when describing empty numa
> nodes? I'm starting to think that the solution all along was just to
> error out when a numa node has memory size = 0...
> 

Sorry for the delay as I spent a few days looking into linux virtio-mem
driver. I'm afraid we still need this patch for ARM64. I don't think x86
has this issue even though I didn't experiment on X86. For example, I
have the following command lines. The hot added memory is put into node#0
instead of node#2, which is wrong.

There are several bitmaps tracking the node states in Linux kernel. One of
them is @possible_map, which tracks the nodes available, but don't have to
be online. @passible_map is sorted out from the following ACPI table.

   ACPI_SRAT_TYPE_MEMORY_AFFINITY
   ACPI_SRAT_TYPE_GENERIC_AFFINITY
   ACPI_SIG_SLIT                          # if it exists when optional distance map
                                          # is provided on QEMU side.

Note: Drew might ask why we have node#2 in "/sys/devices/system/node" again.
hw/arm/virt-acpi-build.c::build_srat() creates additional node in ACPI SRAT
table and the node's PXM is 3 ((ms->numa_state->num_nodes - 1)) in this case,
but linux kernel assigns node#2 to it.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
   -accel kvm -machine virt,gic-version=host               \
   -cpu host -smp 4,sockets=2,cores=2,threads=1            \
   -m 1024M,slots=16,maxmem=64G                            \
   -object memory-backend-ram,id=mem0,size=512M            \
   -object memory-backend-ram,id=mem1,size=512M            \
   -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
   -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
   -numa node,nodeid=2 -numa node,nodeid=3                 \
   -object memory-backend-ram,id=vmem0,size=512M           \
   -device virtio-mem-pci,id=vm0,memdev=vmem0,node=2,requested-size=0 \
   -object memory-backend-ram,id=vmem1,size=512M           \
   -device virtio-mem-pci,id=vm1,memdev=vmem1,node=3,requested-size=0
      :
   # ls  /sys/devices/system/node | grep node
   node0
   node1
   node2
   # cat /proc/meminfo | grep MemTotal\:
   MemTotal:        1003104 kB
   # cat /sys/devices/system/node/node0/meminfo | grep MemTotal\:
   Node 0 MemTotal: 524288 kB

   (qemu) qom-set vm0 requested-size 512M
   # cat /proc/meminfo | grep MemTotal\:
   MemTotal:        1527392 kB
   # cat /sys/devices/system/node/node0/meminfo | grep MemTotal\:
   Node 0 MemTotal: 1013652 kB

Try above test after the patch is applied. The hot added memory is
put into node#2 correctly as the user expected.

   # ls  /sys/devices/system/node | grep node
   node0
   node1
   node2
   node3
   # cat /proc/meminfo | grep MemTotal\:
   MemTotal:        1003100 kB
   # cat /sys/devices/system/node/node2/meminfo | grep MemTotal\:
   Node 2 MemTotal: 0 kB

   (qemu) qom-set vm0 requested-size 512M
   # cat /proc/meminfo | grep MemTotal\:
   MemTotal:        1527388 kB
   # cat /sys/devices/system/node/node2/meminfo | grep MemTotal\:
   Node 2 MemTotal: 524288 kB

Thanks,
Gavin
Igor Mammedov Nov. 10, 2021, 10:33 a.m. UTC | #7
On Fri, 5 Nov 2021 23:47:37 +1100
Gavin Shan <gshan@redhat.com> wrote:

> Hi Drew and Igor,
> 
> On 11/2/21 6:39 PM, Andrew Jones wrote:
> > On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote:  
> >>
> >> Yeah, I agree. I don't have strong sense to expose these empty nodes
> >> for now. Please ignore the patch.
> >>  
> > 
> > So were describing empty numa nodes on the command line ever a reasonable
> > thing to do? What happens on x86 machine types when describing empty numa
> > nodes? I'm starting to think that the solution all along was just to
> > error out when a numa node has memory size = 0...

memory less nodes are fine as long as there is another type of device
that describes  a node (apic/gic/...).
But there is no way in spec to describe completely empty nodes,
and I dislike adding out of spec entries just to fake an empty node.


> Sorry for the delay as I spent a few days looking into linux virtio-mem
> driver. I'm afraid we still need this patch for ARM64. I don't think x86

does it behave the same way is using pc-dimm hotplug instead of virtio-mem?

CCing David
as it might be virtio-mem issue.

PS:
maybe for virtio-mem-pci, we need to add GENERIC_AFFINITY entry into SRAT
and describe it as PCI device (we don't do that yet if I'm no mistaken).

> has this issue even though I didn't experiment on X86. For example, I
> have the following command lines. The hot added memory is put into node#0
> instead of node#2, which is wrong.
> 
> There are several bitmaps tracking the node states in Linux kernel. One of
> them is @possible_map, which tracks the nodes available, but don't have to
> be online. @passible_map is sorted out from the following ACPI table.
> 
>    ACPI_SRAT_TYPE_MEMORY_AFFINITY
>    ACPI_SRAT_TYPE_GENERIC_AFFINITY
>    ACPI_SIG_SLIT                          # if it exists when optional distance map
>                                           # is provided on QEMU side.
> 
> Note: Drew might ask why we have node#2 in "/sys/devices/system/node" again.
> hw/arm/virt-acpi-build.c::build_srat() creates additional node in ACPI SRAT
> table and the node's PXM is 3 ((ms->numa_state->num_nodes - 1)) in this case,
> but linux kernel assigns node#2 to it.
> 
>    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \
>    -accel kvm -machine virt,gic-version=host               \
>    -cpu host -smp 4,sockets=2,cores=2,threads=1            \
>    -m 1024M,slots=16,maxmem=64G                            \
>    -object memory-backend-ram,id=mem0,size=512M            \
>    -object memory-backend-ram,id=mem1,size=512M            \
>    -numa node,nodeid=0,cpus=0-1,memdev=mem0                \
>    -numa node,nodeid=1,cpus=2-3,memdev=mem1                \
>    -numa node,nodeid=2 -numa node,nodeid=3                 \
>    -object memory-backend-ram,id=vmem0,size=512M           \
>    -device virtio-mem-pci,id=vm0,memdev=vmem0,node=2,requested-size=0 \
>    -object memory-backend-ram,id=vmem1,size=512M           \
>    -device virtio-mem-pci,id=vm1,memdev=vmem1,node=3,requested-size=0
>       :
>    # ls  /sys/devices/system/node | grep node
>    node0
>    node1
>    node2
>    # cat /proc/meminfo | grep MemTotal\:
>    MemTotal:        1003104 kB
>    # cat /sys/devices/system/node/node0/meminfo | grep MemTotal\:
>    Node 0 MemTotal: 524288 kB
> 
>    (qemu) qom-set vm0 requested-size 512M
>    # cat /proc/meminfo | grep MemTotal\:
>    MemTotal:        1527392 kB
>    # cat /sys/devices/system/node/node0/meminfo | grep MemTotal\:
>    Node 0 MemTotal: 1013652 kB
> 
> Try above test after the patch is applied. The hot added memory is
> put into node#2 correctly as the user expected.
> 
>    # ls  /sys/devices/system/node | grep node
>    node0
>    node1
>    node2
>    node3
>    # cat /proc/meminfo | grep MemTotal\:
>    MemTotal:        1003100 kB
>    # cat /sys/devices/system/node/node2/meminfo | grep MemTotal\:
>    Node 2 MemTotal: 0 kB
> 
>    (qemu) qom-set vm0 requested-size 512M
>    # cat /proc/meminfo | grep MemTotal\:
>    MemTotal:        1527388 kB
>    # cat /sys/devices/system/node/node2/meminfo | grep MemTotal\:
>    Node 2 MemTotal: 524288 kB
> 
> Thanks,
> Gavin
> 
> 
>    
>
David Hildenbrand Nov. 10, 2021, 11:01 a.m. UTC | #8
On 10.11.21 11:33, Igor Mammedov wrote:
> On Fri, 5 Nov 2021 23:47:37 +1100
> Gavin Shan <gshan@redhat.com> wrote:
> 
>> Hi Drew and Igor,
>>
>> On 11/2/21 6:39 PM, Andrew Jones wrote:
>>> On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote:  
>>>>
>>>> Yeah, I agree. I don't have strong sense to expose these empty nodes
>>>> for now. Please ignore the patch.
>>>>  
>>>
>>> So were describing empty numa nodes on the command line ever a reasonable
>>> thing to do? What happens on x86 machine types when describing empty numa
>>> nodes? I'm starting to think that the solution all along was just to
>>> error out when a numa node has memory size = 0...
> 
> memory less nodes are fine as long as there is another type of device
> that describes  a node (apic/gic/...).
> But there is no way in spec to describe completely empty nodes,
> and I dislike adding out of spec entries just to fake an empty node.
> 

There are reasonable *upcoming* use cases for initially completely empty
NUMA nodes with virtio-mem: being able to expose a dynamic amount of
performance-differentiated memory to a VM. I don't know of any existing
use cases that would require that as of now.

Examples include exposing HBM or PMEM to the VM. Just like on real HW,
this memory is exposed via cpu-less, special nodes. In contrast to real
HW, the memory is hotplugged later (I don't think HW supports hotplug
like that yet, but it might just be a matter of time).

The same should be true when using DIMMs instead of virtio-mem in this
example.

> 
>> Sorry for the delay as I spent a few days looking into linux virtio-mem
>> driver. I'm afraid we still need this patch for ARM64. I don't think x86
> 
> does it behave the same way is using pc-dimm hotplug instead of virtio-mem?
> 
> CCing David
> as it might be virtio-mem issue.

Can someone share the details why it's a problem on arm64 but not on
x86-64? I assume this really only applies when having a dedicated, empty
node -- correct?

> 
> PS:
> maybe for virtio-mem-pci, we need to add GENERIC_AFFINITY entry into SRAT
> and describe it as PCI device (we don't do that yet if I'm no mistaken).

virtio-mem exposes the PXM itself, and avoids exposing it memory via any
kind of platform specific firmware maps. The PXM gets translated in the
guest accordingly. For now there was no need to expose this in SRAT --
the SRAT is really only used to expose the maximum possible PFN to the
VM, just like it would have to be used to expose "this is a possible node".

Of course, we could use any other paravirtualized interface to expose
both information. For example, on s390x, I'll have to introduce a new
hypercall to query the "device memory region" to detect the maximum
possible PFN, because existing interfaces don't allow for that. For now
we're ruinning SRAT to expose "maximum possible PFN" simply because it's
easy to re-use.

But I assume that hotplugging a DIMM to an empty node will have similar
issues on arm64.

> 
>> has this issue even though I didn't experiment on X86. For example, I
>> have the following command lines. The hot added memory is put into node#0
>> instead of node#2, which is wrong.

I assume Linux will always fallback to node 0 if node X is not possible
when translating the PXM.
Igor Mammedov Nov. 12, 2021, 1:27 p.m. UTC | #9
On Wed, 10 Nov 2021 12:01:11 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 10.11.21 11:33, Igor Mammedov wrote:
> > On Fri, 5 Nov 2021 23:47:37 +1100
> > Gavin Shan <gshan@redhat.com> wrote:
> >   
> >> Hi Drew and Igor,
> >>
> >> On 11/2/21 6:39 PM, Andrew Jones wrote:  
> >>> On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote:    
> >>>>
> >>>> Yeah, I agree. I don't have strong sense to expose these empty nodes
> >>>> for now. Please ignore the patch.
> >>>>    
> >>>
> >>> So were describing empty numa nodes on the command line ever a reasonable
> >>> thing to do? What happens on x86 machine types when describing empty numa
> >>> nodes? I'm starting to think that the solution all along was just to
> >>> error out when a numa node has memory size = 0...  
> > 
> > memory less nodes are fine as long as there is another type of device
> > that describes  a node (apic/gic/...).
> > But there is no way in spec to describe completely empty nodes,
> > and I dislike adding out of spec entries just to fake an empty node.
> >   
> 
> There are reasonable *upcoming* use cases for initially completely empty
> NUMA nodes with virtio-mem: being able to expose a dynamic amount of
> performance-differentiated memory to a VM. I don't know of any existing
> use cases that would require that as of now.
> 
> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
> this memory is exposed via cpu-less, special nodes. In contrast to real
> HW, the memory is hotplugged later (I don't think HW supports hotplug
> like that yet, but it might just be a matter of time).

I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
some by MEMORY entries. Or nodes created dynamically like with normal
hotplug memory.


> The same should be true when using DIMMs instead of virtio-mem in this
> example.
> 
> >   
> >> Sorry for the delay as I spent a few days looking into linux virtio-mem
> >> driver. I'm afraid we still need this patch for ARM64. I don't think x86  
> > 
> > does it behave the same way is using pc-dimm hotplug instead of virtio-mem?
> > 
> > CCing David
> > as it might be virtio-mem issue.  
> 
> Can someone share the details why it's a problem on arm64 but not on
> x86-64? I assume this really only applies when having a dedicated, empty
> node -- correct?
> 
> > 
> > PS:
> > maybe for virtio-mem-pci, we need to add GENERIC_AFFINITY entry into SRAT
> > and describe it as PCI device (we don't do that yet if I'm no mistaken).  
> 
> virtio-mem exposes the PXM itself, and avoids exposing it memory via any
> kind of platform specific firmware maps. The PXM gets translated in the
> guest accordingly. For now there was no need to expose this in SRAT --
> the SRAT is really only used to expose the maximum possible PFN to the
> VM, just like it would have to be used to expose "this is a possible node".
> 
> Of course, we could use any other paravirtualized interface to expose
> both information. For example, on s390x, I'll have to introduce a new
> hypercall to query the "device memory region" to detect the maximum
> possible PFN, because existing interfaces don't allow for that. For now
> we're ruinning SRAT to expose "maximum possible PFN" simply because it's
> easy to re-use.
> 
> But I assume that hotplugging a DIMM to an empty node will have similar
> issues on arm64.
> 
> >   
> >> has this issue even though I didn't experiment on X86. For example, I
> >> have the following command lines. The hot added memory is put into node#0
> >> instead of node#2, which is wrong.  
> 
> I assume Linux will always fallback to node 0 if node X is not possible
> when translating the PXM.

I tested how x86 behaves, with pc-dimm, and it seems that
fc43 guest works only sometimes.
cmd:
  -numa node,memdev=mem,cpus=0 -numa node,cpus=1 -numa node -numa node

1: hotplug into the empty last node creates a new node dynamically 
2: hotplug into intermediate empty node (last-1) is broken, memory goes into the first node

We should check if it possible to fix guest instead of adding bogus SRAT entries.
David Hildenbrand Nov. 16, 2021, 11:11 a.m. UTC | #10
>>
>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>> this memory is exposed via cpu-less, special nodes. In contrast to real
>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>> like that yet, but it might just be a matter of time).
> 
> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> some by MEMORY entries. Or nodes created dynamically like with normal
> hotplug memory.
> 

I'm certainly no SRAT expert, but seems like under VMWare something
similar can happen:

https://lkml.kernel.org/r/BAE95F0C-FAA7-40C6-A0D6-5049B1207A27@vmware.com

"VM was powered on with 4 vCPUs (4 NUMA nodes) and 4GB memory.
ACPI SRAT reports 128 possible CPUs and 128 possible NUMA nodes."

Note that that discussion is about hotplugging CPUs to memory-less,
hotplugged nodes.

But there seems to be some way to expose possible NUMA nodes. Maybe
that's via GENERIC_AFFINITY.
Jonathan Cameron Nov. 17, 2021, 2:30 p.m. UTC | #11
On Tue, 16 Nov 2021 12:11:29 +0100
David Hildenbrand <david@redhat.com> wrote:

> >>
> >> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
> >> this memory is exposed via cpu-less, special nodes. In contrast to real
> >> HW, the memory is hotplugged later (I don't think HW supports hotplug
> >> like that yet, but it might just be a matter of time).  
> > 
> > I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> > some by MEMORY entries. Or nodes created dynamically like with normal
> > hotplug memory.
> >   

The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
to Generic Initiator Affinity.  So no good for memory. This is meant for
representation of accelerators / network cards etc so you can get the NUMA
characteristics for them accessing Memory in other nodes.

My understanding of 'traditional' memory hotplug is that typically the
PA into which memory is hotplugged is known at boot time whether or not
the memory is physically present.  As such, you present that in SRAT and rely
on the EFI memory map / other information sources to know the memory isn't
there.  When it is hotplugged later the address is looked up in SRAT to identify
the NUMA node.

That model is less useful for more flexible entities like virtio-mem or
indeed physical hardware such as CXL type 3 memory devices which typically
need their own nodes.

For the CXL type 3 option, currently proposal is to use the CXL table entries
representing Physical Address space regions to work out how many NUMA nodes
are needed and just create extra ones at boot.
https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com

It's a heuristic as we might need more nodes to represent things well kernel
side, but it's better than nothing and less effort that true dynamic node creation.
If you chase through the earlier versions of Alison's patch you will find some
discussion of that.

I wonder if virtio-mem should just grow a CDAT instance via a DOE?

That would make all this stuff discoverable via PCI config space rather than ACPI
CDAT is at:
https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
(nothing stops others using it though AFAIK).

However, then we'd actually need either dynamic node creation in the OS, or
some sort of reserved pool of extra nodes.  Long term it may be the most
flexible option.

Jonathan

> 
> I'm certainly no SRAT expert, but seems like under VMWare something
> similar can happen:
> 
> https://lkml.kernel.org/r/BAE95F0C-FAA7-40C6-A0D6-5049B1207A27@vmware.com
> 
> "VM was powered on with 4 vCPUs (4 NUMA nodes) and 4GB memory.
> ACPI SRAT reports 128 possible CPUs and 128 possible NUMA nodes."
> 
> Note that that discussion is about hotplugging CPUs to memory-less,
> hotplugged nodes.
> 
> But there seems to be some way to expose possible NUMA nodes. Maybe
> that's via GENERIC_AFFINITY.
>
David Hildenbrand Nov. 17, 2021, 6:08 p.m. UTC | #12
On 17.11.21 15:30, Jonathan Cameron wrote:
> On Tue, 16 Nov 2021 12:11:29 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>>>>
>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>>>> like that yet, but it might just be a matter of time).  
>>>
>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
>>> some by MEMORY entries. Or nodes created dynamically like with normal
>>> hotplug memory.
>>>   
> 

Hi Jonathan,

> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
> to Generic Initiator Affinity.  So no good for memory. This is meant for
> representation of accelerators / network cards etc so you can get the NUMA
> characteristics for them accessing Memory in other nodes.
> 
> My understanding of 'traditional' memory hotplug is that typically the
> PA into which memory is hotplugged is known at boot time whether or not
> the memory is physically present.  As such, you present that in SRAT and rely
> on the EFI memory map / other information sources to know the memory isn't
> there.  When it is hotplugged later the address is looked up in SRAT to identify
> the NUMA node.

in virtualized environments we use the SRAT only to indicate the hotpluggable
region (-> indicate maximum possible PFN to the guest OS), the actual present
memory+PXM assignment is not done via SRAT. We differ quite a lot here from
actual hardware I think.

> 
> That model is less useful for more flexible entities like virtio-mem or
> indeed physical hardware such as CXL type 3 memory devices which typically
> need their own nodes.
> 
> For the CXL type 3 option, currently proposal is to use the CXL table entries
> representing Physical Address space regions to work out how many NUMA nodes
> are needed and just create extra ones at boot.
> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
> 
> It's a heuristic as we might need more nodes to represent things well kernel
> side, but it's better than nothing and less effort that true dynamic node creation.
> If you chase through the earlier versions of Alison's patch you will find some
> discussion of that.
> 
> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
> 
> That would make all this stuff discoverable via PCI config space rather than ACPI
> CDAT is at:
> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> (nothing stops others using it though AFAIK).
> 
> However, then we'd actually need either dynamic node creation in the OS, or
> some sort of reserved pool of extra nodes.  Long term it may be the most
> flexible option.


I think for virtio-mem it's actually a bit simpler:

a) The user defined on the QEMU cmdline an empty node
b) The user assigned a virtio-mem device to a node, either when 
   coldplugging or hotplugging the device.

So we don't actually "hotplug" a new node, the (possible) node is already known
to QEMU right when starting up. It's just a matter of exposing that fact to the
guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
It's seems to boild down to an ACPI limitation.

Conceptually, virtio-mem on an empty node in QEMU is not that different from
hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
an empty node. But I guess it all just doesn't work with QEMU as of now.


In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via

	build_srat_memory(table_data, machine->device_memory->base,
			  hotpluggable_address_space_size, nb_numa_nodes - 1,
			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);

So we tell the guest OS "this range is hotpluggable" and "it contains to
this node unless the device says something different". From both values we
can -- when under QEMU -- conclude the maximum possible PFN and the maximum
possible node. But the latter is not what Linux does: it simply maps the last
numa node (indicated in the memory entry) to a PXM
(-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).


I do wonder if we could simply expose the same hotpluggable range via multiple nodes:

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index a3ad6abd33..6c0ab442ea 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2084,6 +2084,22 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
      * providing _PXM method if necessary.
      */
     if (hotpluggable_address_space_size) {
+        /*
+         * For the guest to "know" about possible nodes, we'll indicate the
+         * same hotpluggable region to all empty nodes.
+         */
+        for (i = 0; i < nb_numa_nodes - 1; i++) {
+            if (machine->numa_state->nodes[i].node_mem > 0) {
+                continue;
+            }
+            build_srat_memory(table_data, machine->device_memory->base,
+                              hotpluggable_address_space_size, i,
+                              MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
+        }
+        /*
+         * Historically, we always indicated all hotpluggable memory to the
+         * last node -- if it was empty or not.
+         */
         build_srat_memory(table_data, machine->device_memory->base,
                           hotpluggable_address_space_size, nb_numa_nodes - 1,
                           MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);


Of course, this won't make CPU hotplug to empty nodes happy if we don't have
mempory hotplug enabled for a VM. I did not check in detail if that is valid
according to ACPI -- Linux might eat it (did not try yet, though).
David Hildenbrand Nov. 17, 2021, 6:26 p.m. UTC | #13
On 12.11.21 14:27, Igor Mammedov wrote:
> On Wed, 10 Nov 2021 12:01:11 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 10.11.21 11:33, Igor Mammedov wrote:
>>> On Fri, 5 Nov 2021 23:47:37 +1100
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>   
>>>> Hi Drew and Igor,
>>>>
>>>> On 11/2/21 6:39 PM, Andrew Jones wrote:  
>>>>> On Tue, Nov 02, 2021 at 10:44:08AM +1100, Gavin Shan wrote:    
>>>>>>
>>>>>> Yeah, I agree. I don't have strong sense to expose these empty nodes
>>>>>> for now. Please ignore the patch.
>>>>>>    
>>>>>
>>>>> So were describing empty numa nodes on the command line ever a reasonable
>>>>> thing to do? What happens on x86 machine types when describing empty numa
>>>>> nodes? I'm starting to think that the solution all along was just to
>>>>> error out when a numa node has memory size = 0...  
>>>
>>> memory less nodes are fine as long as there is another type of device
>>> that describes  a node (apic/gic/...).
>>> But there is no way in spec to describe completely empty nodes,
>>> and I dislike adding out of spec entries just to fake an empty node.
>>>   
>>
>> There are reasonable *upcoming* use cases for initially completely empty
>> NUMA nodes with virtio-mem: being able to expose a dynamic amount of
>> performance-differentiated memory to a VM. I don't know of any existing
>> use cases that would require that as of now.
>>
>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>> this memory is exposed via cpu-less, special nodes. In contrast to real
>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>> like that yet, but it might just be a matter of time).
> 
> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> some by MEMORY entries. Or nodes created dynamically like with normal
> hotplug memory.
> 
> 
>> The same should be true when using DIMMs instead of virtio-mem in this
>> example.
>>
>>>   
>>>> Sorry for the delay as I spent a few days looking into linux virtio-mem
>>>> driver. I'm afraid we still need this patch for ARM64. I don't think x86  
>>>
>>> does it behave the same way is using pc-dimm hotplug instead of virtio-mem?
>>>
>>> CCing David
>>> as it might be virtio-mem issue.  
>>
>> Can someone share the details why it's a problem on arm64 but not on
>> x86-64? I assume this really only applies when having a dedicated, empty
>> node -- correct?
>>
>>>
>>> PS:
>>> maybe for virtio-mem-pci, we need to add GENERIC_AFFINITY entry into SRAT
>>> and describe it as PCI device (we don't do that yet if I'm no mistaken).  
>>
>> virtio-mem exposes the PXM itself, and avoids exposing it memory via any
>> kind of platform specific firmware maps. The PXM gets translated in the
>> guest accordingly. For now there was no need to expose this in SRAT --
>> the SRAT is really only used to expose the maximum possible PFN to the
>> VM, just like it would have to be used to expose "this is a possible node".
>>
>> Of course, we could use any other paravirtualized interface to expose
>> both information. For example, on s390x, I'll have to introduce a new
>> hypercall to query the "device memory region" to detect the maximum
>> possible PFN, because existing interfaces don't allow for that. For now
>> we're ruinning SRAT to expose "maximum possible PFN" simply because it's
>> easy to re-use.
>>
>> But I assume that hotplugging a DIMM to an empty node will have similar
>> issues on arm64.
>>
>>>   
>>>> has this issue even though I didn't experiment on X86. For example, I
>>>> have the following command lines. The hot added memory is put into node#0
>>>> instead of node#2, which is wrong.  
>>
>> I assume Linux will always fallback to node 0 if node X is not possible
>> when translating the PXM.
> 
> I tested how x86 behaves, with pc-dimm, and it seems that
> fc43 guest works only sometimes.
> cmd:
>   -numa node,memdev=mem,cpus=0 -numa node,cpus=1 -numa node -numa node
> 
> 1: hotplug into the empty last node creates a new node dynamically 
> 2: hotplug into intermediate empty node (last-1) is broken, memory goes into the first node

See my other reply: Reason is that we (QEMU) indicate all hotpluggable
memory as belonging to the last NUMA node. When processing that SRAT
entry, Linux maps that PXM to an actual node.
Jonathan Cameron Nov. 18, 2021, 10:28 a.m. UTC | #14
On Wed, 17 Nov 2021 19:08:28 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 17.11.21 15:30, Jonathan Cameron wrote:
> > On Tue, 16 Nov 2021 12:11:29 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >>>>
> >>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
> >>>> this memory is exposed via cpu-less, special nodes. In contrast to real
> >>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
> >>>> like that yet, but it might just be a matter of time).    
> >>>
> >>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> >>> some by MEMORY entries. Or nodes created dynamically like with normal
> >>> hotplug memory.
> >>>     
> >   
> 
> Hi Jonathan,
> 
> > The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
> > to Generic Initiator Affinity.  So no good for memory. This is meant for
> > representation of accelerators / network cards etc so you can get the NUMA
> > characteristics for them accessing Memory in other nodes.
> > 
> > My understanding of 'traditional' memory hotplug is that typically the
> > PA into which memory is hotplugged is known at boot time whether or not
> > the memory is physically present.  As such, you present that in SRAT and rely
> > on the EFI memory map / other information sources to know the memory isn't
> > there.  When it is hotplugged later the address is looked up in SRAT to identify
> > the NUMA node.  
> 
> in virtualized environments we use the SRAT only to indicate the hotpluggable
> region (-> indicate maximum possible PFN to the guest OS), the actual present
> memory+PXM assignment is not done via SRAT. We differ quite a lot here from
> actual hardware I think.
> 
> > 
> > That model is less useful for more flexible entities like virtio-mem or
> > indeed physical hardware such as CXL type 3 memory devices which typically
> > need their own nodes.
> > 
> > For the CXL type 3 option, currently proposal is to use the CXL table entries
> > representing Physical Address space regions to work out how many NUMA nodes
> > are needed and just create extra ones at boot.
> > https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
> > 
> > It's a heuristic as we might need more nodes to represent things well kernel
> > side, but it's better than nothing and less effort that true dynamic node creation.
> > If you chase through the earlier versions of Alison's patch you will find some
> > discussion of that.
> > 
> > I wonder if virtio-mem should just grow a CDAT instance via a DOE?
> > 
> > That would make all this stuff discoverable via PCI config space rather than ACPI
> > CDAT is at:
> > https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> > (nothing stops others using it though AFAIK).
> > 
> > However, then we'd actually need either dynamic node creation in the OS, or
> > some sort of reserved pool of extra nodes.  Long term it may be the most
> > flexible option.  
> 
> 
> I think for virtio-mem it's actually a bit simpler:
> 
> a) The user defined on the QEMU cmdline an empty node
> b) The user assigned a virtio-mem device to a node, either when 
>    coldplugging or hotplugging the device.
> 
> So we don't actually "hotplug" a new node, the (possible) node is already known
> to QEMU right when starting up. It's just a matter of exposing that fact to the
> guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
> It's seems to boild down to an ACPI limitation.
> 
> Conceptually, virtio-mem on an empty node in QEMU is not that different from
> hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
> an empty node. But I guess it all just doesn't work with QEMU as of now.

A side distraction perhaps, but there is a code first acpi proposal to add
a 'softer' form of CPU hotplug 
https://bugzilla.tianocore.org/show_bug.cgi?id=3706

Whilst the reason for that proposal was for arm64 systems where there is no architected
physical hotplug, it might partly solve the empty node question for CPUs.  They won't
be empty, there will simply be CPUs that are marked as 'Online capable'.

> 
> In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via
> 
> 	build_srat_memory(table_data, machine->device_memory->base,
> 			  hotpluggable_address_space_size, nb_numa_nodes - 1,
> 			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> 
> So we tell the guest OS "this range is hotpluggable" and "it contains to
> this node unless the device says something different". From both values we
> can -- when under QEMU -- conclude the maximum possible PFN and the maximum
> possible node. But the latter is not what Linux does: it simply maps the last
> numa node (indicated in the memory entry) to a PXM
> (-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).
yeah.  There is nothing in ACPI that says there can't be holes in the node numbering
so Linux does a remapping as you point out.

> 
> 
> I do wonder if we could simply expose the same hotpluggable range via multiple nodes:

Fairly sure the answer to this is no.  You'd have to indicate different ranges and
then put the virtio-mem in the right one.  Now I can't actually find anywhere in the
ACPI spec that says that but I'm 99% sure Linux won't like and I'm fairly sure if we
query it with ACPI folks the answer will be a no you can't don't that.


> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index a3ad6abd33..6c0ab442ea 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2084,6 +2084,22 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>       * providing _PXM method if necessary.
>       */
>      if (hotpluggable_address_space_size) {
> +        /*
> +         * For the guest to "know" about possible nodes, we'll indicate the
> +         * same hotpluggable region to all empty nodes.
> +         */
> +        for (i = 0; i < nb_numa_nodes - 1; i++) {
> +            if (machine->numa_state->nodes[i].node_mem > 0) {
> +                continue;
> +            }
> +            build_srat_memory(table_data, machine->device_memory->base,
> +                              hotpluggable_address_space_size, i,
> +                              MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> +        }
> +        /*
> +         * Historically, we always indicated all hotpluggable memory to the
> +         * last node -- if it was empty or not.
> +         */
>          build_srat_memory(table_data, machine->device_memory->base,
>                            hotpluggable_address_space_size, nb_numa_nodes - 1,
>                            MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> 
> 
> Of course, this won't make CPU hotplug to empty nodes happy if we don't have
> mempory hotplug enabled for a VM. I did not check in detail if that is valid
> according to ACPI -- Linux might eat it (did not try yet, though).
> 
>
David Hildenbrand Nov. 18, 2021, 11:06 a.m. UTC | #15
On 18.11.21 11:28, Jonathan Cameron wrote:
> On Wed, 17 Nov 2021 19:08:28 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 17.11.21 15:30, Jonathan Cameron wrote:
>>> On Tue, 16 Nov 2021 12:11:29 +0100
>>> David Hildenbrand <david@redhat.com> wrote:
>>>   
>>>>>>
>>>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>>>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
>>>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>>>>>> like that yet, but it might just be a matter of time).    
>>>>>
>>>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
>>>>> some by MEMORY entries. Or nodes created dynamically like with normal
>>>>> hotplug memory.
>>>>>     
>>>   
>>
>> Hi Jonathan,
>>
>>> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
>>> to Generic Initiator Affinity.  So no good for memory. This is meant for
>>> representation of accelerators / network cards etc so you can get the NUMA
>>> characteristics for them accessing Memory in other nodes.
>>>
>>> My understanding of 'traditional' memory hotplug is that typically the
>>> PA into which memory is hotplugged is known at boot time whether or not
>>> the memory is physically present.  As such, you present that in SRAT and rely
>>> on the EFI memory map / other information sources to know the memory isn't
>>> there.  When it is hotplugged later the address is looked up in SRAT to identify
>>> the NUMA node.  
>>
>> in virtualized environments we use the SRAT only to indicate the hotpluggable
>> region (-> indicate maximum possible PFN to the guest OS), the actual present
>> memory+PXM assignment is not done via SRAT. We differ quite a lot here from
>> actual hardware I think.
>>
>>>
>>> That model is less useful for more flexible entities like virtio-mem or
>>> indeed physical hardware such as CXL type 3 memory devices which typically
>>> need their own nodes.
>>>
>>> For the CXL type 3 option, currently proposal is to use the CXL table entries
>>> representing Physical Address space regions to work out how many NUMA nodes
>>> are needed and just create extra ones at boot.
>>> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
>>>
>>> It's a heuristic as we might need more nodes to represent things well kernel
>>> side, but it's better than nothing and less effort that true dynamic node creation.
>>> If you chase through the earlier versions of Alison's patch you will find some
>>> discussion of that.
>>>
>>> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
>>>
>>> That would make all this stuff discoverable via PCI config space rather than ACPI
>>> CDAT is at:
>>> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
>>> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
>>> (nothing stops others using it though AFAIK).
>>>
>>> However, then we'd actually need either dynamic node creation in the OS, or
>>> some sort of reserved pool of extra nodes.  Long term it may be the most
>>> flexible option.  
>>
>>
>> I think for virtio-mem it's actually a bit simpler:
>>
>> a) The user defined on the QEMU cmdline an empty node
>> b) The user assigned a virtio-mem device to a node, either when 
>>    coldplugging or hotplugging the device.
>>
>> So we don't actually "hotplug" a new node, the (possible) node is already known
>> to QEMU right when starting up. It's just a matter of exposing that fact to the
>> guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
>> It's seems to boild down to an ACPI limitation.
>>
>> Conceptually, virtio-mem on an empty node in QEMU is not that different from
>> hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
>> an empty node. But I guess it all just doesn't work with QEMU as of now.
> 
> A side distraction perhaps, but there is a code first acpi proposal to add
> a 'softer' form of CPU hotplug 
> https://bugzilla.tianocore.org/show_bug.cgi?id=3706
> 
> Whilst the reason for that proposal was for arm64 systems where there is no architected
> physical hotplug, it might partly solve the empty node question for CPUs.  They won't
> be empty, there will simply be CPUs that are marked as 'Online capable'.
> 
>>
>> In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via
>>
>> 	build_srat_memory(table_data, machine->device_memory->base,
>> 			  hotpluggable_address_space_size, nb_numa_nodes - 1,
>> 			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
>>
>> So we tell the guest OS "this range is hotpluggable" and "it contains to
>> this node unless the device says something different". From both values we
>> can -- when under QEMU -- conclude the maximum possible PFN and the maximum
>> possible node. But the latter is not what Linux does: it simply maps the last
>> numa node (indicated in the memory entry) to a PXM
>> (-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).
> yeah.  There is nothing in ACPI that says there can't be holes in the node numbering
> so Linux does a remapping as you point out.
> 
>>
>>
>> I do wonder if we could simply expose the same hotpluggable range via multiple nodes:
> 
> Fairly sure the answer to this is no.  You'd have to indicate different ranges and
> then put the virtio-mem in the right one. 

And I repeat, this is in no way different to DIMMs/NVDIMMs. We cannot predict
the future when hotplugging DIMMS/NVDIMMs/virtio-mem/... to some node later. We only
have access to that information when coldplugging devices, but even a
hotunplug+hotplug can change that info. Whatever we expose via ACPI is moot
already and just a hint to the guest OS "maximum possible PFN".

We've been abusing ACPI hotpluggable region for years for virt DIMM hotplug,
putting it to some fantasy node and having it just work with hotplug of
DIMMs/NVDIMMs. The only issue we have is empty nodes. We differ from real
HW already significantly (especially, never exposing DIMMs via e820 to
the guest, which I call a feature and not a bug).

> Now I can't actually find anywhere in the
> ACPI spec that says that but I'm 99% sure Linux won't like and I'm fairly sure if we
> query it with ACPI folks the answer will be a no you can't don't that.

I didn't find anything that contradicts it in the spec as well. It's not really
specified what's allowed and what's not :)

FWIW, the code I shared works with 5.11.12-300.fc34.x86_64 inside the guest flawlessly.

#! /bin/bash
sudo build/qemu-system-x86_64 \
    --enable-kvm \
    -m 8G,maxmem=32G,slots=1 \
    -object memory-backend-memfd,id=mem,size=8G \
    -numa node,nodeid=0,memdev=mem,cpus=0-4 \
    -numa node,nodeid=1 -numa node,nodeid=2 \
    -numa node,nodeid=3 -numa node,nodeid=4 \
    -smp sockets=2,cores=4 \
    -nographic \
    -nodefaults \
    -net nic -net user \
    -chardev stdio,nosignal,id=serial \
    -hda Fedora-Cloud-Base-34-1.2.x86_64.qcow2 \
    -cdrom /home/dhildenb/git/cloud-init/cloud-init.iso \
    -device isa-serial,chardev=serial \
    -chardev socket,id=monitor,path=/var/tmp/mon_src,server,nowait \
    -mon chardev=monitor,mode=readline \
    -object memory-backend-memfd,id=mem0,size=8G \
    -device virtio-mem-pci,id=vmem0,memdev=mem0,node=1,requested-size=128M \
    -object memory-backend-memfd,id=mem1,size=8G \
    -device virtio-mem-pci,id=vmem1,memdev=mem1,node=2,requested-size=128M \
    -object memory-backend-memfd,id=mem2,size=8G \
    -device virtio-mem-pci,id=vmem2,memdev=mem2,node=3,requested-size=128M

[root@vm-0 ~]# dmesg | grep "ACPI: SRAT: Node"
[    0.009933] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.009939] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[    0.009941] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x23fffffff]
[    0.009942] ACPI: SRAT: Node 1 PXM 1 [mem 0x240000000-0x87fffffff] hotplug
[    0.009944] ACPI: SRAT: Node 2 PXM 2 [mem 0x240000000-0x87fffffff] hotplug
[    0.009946] ACPI: SRAT: Node 3 PXM 3 [mem 0x240000000-0x87fffffff] hotplug
[    0.009947] ACPI: SRAT: Node 4 PXM 4 [mem 0x240000000-0x87fffffff] hotplug
[root@vm-0 ~]# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 7950 MB
node 0 free: 7692 MB
node 1 cpus:
node 1 size: 128 MB
node 1 free: 123 MB
node 2 cpus:
node 2 size: 128 MB
node 2 free: 127 MB
node 3 cpus:
node 3 size: 128 MB
node 3 free: 127 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10
Jonathan Cameron Nov. 18, 2021, 11:23 a.m. UTC | #16
On Thu, 18 Nov 2021 12:06:27 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 18.11.21 11:28, Jonathan Cameron wrote:
> > On Wed, 17 Nov 2021 19:08:28 +0100
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 17.11.21 15:30, Jonathan Cameron wrote:  
> >>> On Tue, 16 Nov 2021 12:11:29 +0100
> >>> David Hildenbrand <david@redhat.com> wrote:
> >>>     
> >>>>>>
> >>>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
> >>>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
> >>>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
> >>>>>> like that yet, but it might just be a matter of time).      
> >>>>>
> >>>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> >>>>> some by MEMORY entries. Or nodes created dynamically like with normal
> >>>>> hotplug memory.
> >>>>>       
> >>>     
> >>
> >> Hi Jonathan,
> >>  
> >>> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
> >>> to Generic Initiator Affinity.  So no good for memory. This is meant for
> >>> representation of accelerators / network cards etc so you can get the NUMA
> >>> characteristics for them accessing Memory in other nodes.
> >>>
> >>> My understanding of 'traditional' memory hotplug is that typically the
> >>> PA into which memory is hotplugged is known at boot time whether or not
> >>> the memory is physically present.  As such, you present that in SRAT and rely
> >>> on the EFI memory map / other information sources to know the memory isn't
> >>> there.  When it is hotplugged later the address is looked up in SRAT to identify
> >>> the NUMA node.    
> >>
> >> in virtualized environments we use the SRAT only to indicate the hotpluggable
> >> region (-> indicate maximum possible PFN to the guest OS), the actual present
> >> memory+PXM assignment is not done via SRAT. We differ quite a lot here from
> >> actual hardware I think.
> >>  
> >>>
> >>> That model is less useful for more flexible entities like virtio-mem or
> >>> indeed physical hardware such as CXL type 3 memory devices which typically
> >>> need their own nodes.
> >>>
> >>> For the CXL type 3 option, currently proposal is to use the CXL table entries
> >>> representing Physical Address space regions to work out how many NUMA nodes
> >>> are needed and just create extra ones at boot.
> >>> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
> >>>
> >>> It's a heuristic as we might need more nodes to represent things well kernel
> >>> side, but it's better than nothing and less effort that true dynamic node creation.
> >>> If you chase through the earlier versions of Alison's patch you will find some
> >>> discussion of that.
> >>>
> >>> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
> >>>
> >>> That would make all this stuff discoverable via PCI config space rather than ACPI
> >>> CDAT is at:
> >>> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> >>> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> >>> (nothing stops others using it though AFAIK).
> >>>
> >>> However, then we'd actually need either dynamic node creation in the OS, or
> >>> some sort of reserved pool of extra nodes.  Long term it may be the most
> >>> flexible option.    
> >>
> >>
> >> I think for virtio-mem it's actually a bit simpler:
> >>
> >> a) The user defined on the QEMU cmdline an empty node
> >> b) The user assigned a virtio-mem device to a node, either when 
> >>    coldplugging or hotplugging the device.
> >>
> >> So we don't actually "hotplug" a new node, the (possible) node is already known
> >> to QEMU right when starting up. It's just a matter of exposing that fact to the
> >> guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
> >> It's seems to boild down to an ACPI limitation.
> >>
> >> Conceptually, virtio-mem on an empty node in QEMU is not that different from
> >> hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
> >> an empty node. But I guess it all just doesn't work with QEMU as of now.  
> > 
> > A side distraction perhaps, but there is a code first acpi proposal to add
> > a 'softer' form of CPU hotplug 
> > https://bugzilla.tianocore.org/show_bug.cgi?id=3706
> > 
> > Whilst the reason for that proposal was for arm64 systems where there is no architected
> > physical hotplug, it might partly solve the empty node question for CPUs.  They won't
> > be empty, there will simply be CPUs that are marked as 'Online capable'.
> >   
> >>
> >> In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via
> >>
> >> 	build_srat_memory(table_data, machine->device_memory->base,
> >> 			  hotpluggable_address_space_size, nb_numa_nodes - 1,
> >> 			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> >>
> >> So we tell the guest OS "this range is hotpluggable" and "it contains to
> >> this node unless the device says something different". From both values we
> >> can -- when under QEMU -- conclude the maximum possible PFN and the maximum
> >> possible node. But the latter is not what Linux does: it simply maps the last
> >> numa node (indicated in the memory entry) to a PXM
> >> (-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).  
> > yeah.  There is nothing in ACPI that says there can't be holes in the node numbering
> > so Linux does a remapping as you point out.
> >   
> >>
> >>
> >> I do wonder if we could simply expose the same hotpluggable range via multiple nodes:  
> > 
> > Fairly sure the answer to this is no.  You'd have to indicate different ranges and
> > then put the virtio-mem in the right one.   
> 
> And I repeat, this is in no way different to DIMMs/NVDIMMs. We cannot predict
> the future when hotplugging DIMMS/NVDIMMs/virtio-mem/... to some node later. We only
> have access to that information when coldplugging devices, but even a
> hotunplug+hotplug can change that info. Whatever we expose via ACPI is moot
> already and just a hint to the guest OS "maximum possible PFN".

Sure, so the solution is a large non overlapping extra node for each node on the
underlying physical system.  It uses a lot of PA space, but I'm going to assume
the system isn't so big that that PA space exhaustion is a problem?  For a sensible setup
those node would match the actual memory present on the underlying system.

For physical CCIX systems we did this with SRAT entries with XTB per node to match
what the host supported.  On our particular platform those PA ranges were well separated
from each other due to how the system routing worked, but the principal is the same.
Those supported a huge amount of memory being hotplugged.

> 
> We've been abusing ACPI hotpluggable region for years for virt DIMM hotplug,
> putting it to some fantasy node and having it just work with hotplug of
> DIMMs/NVDIMMs. The only issue we have is empty nodes. We differ from real
> HW already significantly (especially, never exposing DIMMs via e820 to
> the guest, which I call a feature and not a bug).

Understood.
> 
> > Now I can't actually find anywhere in the
> > ACPI spec that says that but I'm 99% sure Linux won't like and I'm fairly sure if we
> > query it with ACPI folks the answer will be a no you can't don't that.  
> 
> I didn't find anything that contradicts it in the spec as well. It's not really
> specified what's allowed and what's not :)
> 
> FWIW, the code I shared works with 5.11.12-300.fc34.x86_64 inside the guest flawlessly.

Hmm. I'm surprised that works at all and my worry is there is no reason it will continue
to work.

Jonathan

 
> 
> #! /bin/bash
> sudo build/qemu-system-x86_64 \
>     --enable-kvm \
>     -m 8G,maxmem=32G,slots=1 \
>     -object memory-backend-memfd,id=mem,size=8G \
>     -numa node,nodeid=0,memdev=mem,cpus=0-4 \
>     -numa node,nodeid=1 -numa node,nodeid=2 \
>     -numa node,nodeid=3 -numa node,nodeid=4 \
>     -smp sockets=2,cores=4 \
>     -nographic \
>     -nodefaults \
>     -net nic -net user \
>     -chardev stdio,nosignal,id=serial \
>     -hda Fedora-Cloud-Base-34-1.2.x86_64.qcow2 \
>     -cdrom /home/dhildenb/git/cloud-init/cloud-init.iso \
>     -device isa-serial,chardev=serial \
>     -chardev socket,id=monitor,path=/var/tmp/mon_src,server,nowait \
>     -mon chardev=monitor,mode=readline \
>     -object memory-backend-memfd,id=mem0,size=8G \
>     -device virtio-mem-pci,id=vmem0,memdev=mem0,node=1,requested-size=128M \
>     -object memory-backend-memfd,id=mem1,size=8G \
>     -device virtio-mem-pci,id=vmem1,memdev=mem1,node=2,requested-size=128M \
>     -object memory-backend-memfd,id=mem2,size=8G \
>     -device virtio-mem-pci,id=vmem2,memdev=mem2,node=3,requested-size=128M
> 
> [root@vm-0 ~]# dmesg | grep "ACPI: SRAT: Node"
> [    0.009933] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [    0.009939] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> [    0.009941] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x23fffffff]
> [    0.009942] ACPI: SRAT: Node 1 PXM 1 [mem 0x240000000-0x87fffffff] hotplug
> [    0.009944] ACPI: SRAT: Node 2 PXM 2 [mem 0x240000000-0x87fffffff] hotplug
> [    0.009946] ACPI: SRAT: Node 3 PXM 3 [mem 0x240000000-0x87fffffff] hotplug
> [    0.009947] ACPI: SRAT: Node 4 PXM 4 [mem 0x240000000-0x87fffffff] hotplug
> [root@vm-0 ~]# numactl --hardware
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 6 7
> node 0 size: 7950 MB
> node 0 free: 7692 MB
> node 1 cpus:
> node 1 size: 128 MB
> node 1 free: 123 MB
> node 2 cpus:
> node 2 size: 128 MB
> node 2 free: 127 MB
> node 3 cpus:
> node 3 size: 128 MB
> node 3 free: 127 MB
> node distances:
> node   0   1   2   3 
>   0:  10  20  20  20 
>   1:  20  10  20  20 
>   2:  20  20  10  20 
>   3:  20  20  20  10 
> 
> 
>
Jonathan Cameron Nov. 19, 2021, 10:58 a.m. UTC | #17
On Thu, 18 Nov 2021 11:23:06 +0000
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Thu, 18 Nov 2021 12:06:27 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
> > On 18.11.21 11:28, Jonathan Cameron wrote:  
> > > On Wed, 17 Nov 2021 19:08:28 +0100
> > > David Hildenbrand <david@redhat.com> wrote:
> > >     
> > >> On 17.11.21 15:30, Jonathan Cameron wrote:    
> > >>> On Tue, 16 Nov 2021 12:11:29 +0100
> > >>> David Hildenbrand <david@redhat.com> wrote:
> > >>>       
> > >>>>>>
> > >>>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
> > >>>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
> > >>>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
> > >>>>>> like that yet, but it might just be a matter of time).        
> > >>>>>
> > >>>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> > >>>>> some by MEMORY entries. Or nodes created dynamically like with normal
> > >>>>> hotplug memory.
> > >>>>>         
> > >>>       
> > >>
> > >> Hi Jonathan,
> > >>    
> > >>> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
> > >>> to Generic Initiator Affinity.  So no good for memory. This is meant for
> > >>> representation of accelerators / network cards etc so you can get the NUMA
> > >>> characteristics for them accessing Memory in other nodes.
> > >>>
> > >>> My understanding of 'traditional' memory hotplug is that typically the
> > >>> PA into which memory is hotplugged is known at boot time whether or not
> > >>> the memory is physically present.  As such, you present that in SRAT and rely
> > >>> on the EFI memory map / other information sources to know the memory isn't
> > >>> there.  When it is hotplugged later the address is looked up in SRAT to identify
> > >>> the NUMA node.      
> > >>
> > >> in virtualized environments we use the SRAT only to indicate the hotpluggable
> > >> region (-> indicate maximum possible PFN to the guest OS), the actual present
> > >> memory+PXM assignment is not done via SRAT. We differ quite a lot here from
> > >> actual hardware I think.
> > >>    
> > >>>
> > >>> That model is less useful for more flexible entities like virtio-mem or
> > >>> indeed physical hardware such as CXL type 3 memory devices which typically
> > >>> need their own nodes.
> > >>>
> > >>> For the CXL type 3 option, currently proposal is to use the CXL table entries
> > >>> representing Physical Address space regions to work out how many NUMA nodes
> > >>> are needed and just create extra ones at boot.
> > >>> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
> > >>>
> > >>> It's a heuristic as we might need more nodes to represent things well kernel
> > >>> side, but it's better than nothing and less effort that true dynamic node creation.
> > >>> If you chase through the earlier versions of Alison's patch you will find some
> > >>> discussion of that.
> > >>>
> > >>> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
> > >>>
> > >>> That would make all this stuff discoverable via PCI config space rather than ACPI
> > >>> CDAT is at:
> > >>> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> > >>> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> > >>> (nothing stops others using it though AFAIK).
> > >>>
> > >>> However, then we'd actually need either dynamic node creation in the OS, or
> > >>> some sort of reserved pool of extra nodes.  Long term it may be the most
> > >>> flexible option.      
> > >>
> > >>
> > >> I think for virtio-mem it's actually a bit simpler:
> > >>
> > >> a) The user defined on the QEMU cmdline an empty node
> > >> b) The user assigned a virtio-mem device to a node, either when 
> > >>    coldplugging or hotplugging the device.
> > >>
> > >> So we don't actually "hotplug" a new node, the (possible) node is already known
> > >> to QEMU right when starting up. It's just a matter of exposing that fact to the
> > >> guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
> > >> It's seems to boild down to an ACPI limitation.
> > >>
> > >> Conceptually, virtio-mem on an empty node in QEMU is not that different from
> > >> hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
> > >> an empty node. But I guess it all just doesn't work with QEMU as of now.    
> > > 
> > > A side distraction perhaps, but there is a code first acpi proposal to add
> > > a 'softer' form of CPU hotplug 
> > > https://bugzilla.tianocore.org/show_bug.cgi?id=3706
> > > 
> > > Whilst the reason for that proposal was for arm64 systems where there is no architected
> > > physical hotplug, it might partly solve the empty node question for CPUs.  They won't
> > > be empty, there will simply be CPUs that are marked as 'Online capable'.
> > >     
> > >>
> > >> In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via
> > >>
> > >> 	build_srat_memory(table_data, machine->device_memory->base,
> > >> 			  hotpluggable_address_space_size, nb_numa_nodes - 1,
> > >> 			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> > >>
> > >> So we tell the guest OS "this range is hotpluggable" and "it contains to
> > >> this node unless the device says something different". From both values we
> > >> can -- when under QEMU -- conclude the maximum possible PFN and the maximum
> > >> possible node. But the latter is not what Linux does: it simply maps the last
> > >> numa node (indicated in the memory entry) to a PXM
> > >> (-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).    
> > > yeah.  There is nothing in ACPI that says there can't be holes in the node numbering
> > > so Linux does a remapping as you point out.
> > >     
> > >>
> > >>
> > >> I do wonder if we could simply expose the same hotpluggable range via multiple nodes:    
> > > 
> > > Fairly sure the answer to this is no.  You'd have to indicate different ranges and
> > > then put the virtio-mem in the right one.     
> > 
> > And I repeat, this is in no way different to DIMMs/NVDIMMs. We cannot predict
> > the future when hotplugging DIMMS/NVDIMMs/virtio-mem/... to some node later. We only
> > have access to that information when coldplugging devices, but even a
> > hotunplug+hotplug can change that info. Whatever we expose via ACPI is moot
> > already and just a hint to the guest OS "maximum possible PFN".  
> 
> Sure, so the solution is a large non overlapping extra node for each node on the
> underlying physical system.  It uses a lot of PA space, but I'm going to assume
> the system isn't so big that that PA space exhaustion is a problem?  For a sensible setup
> those node would match the actual memory present on the underlying system.
> 
> For physical CCIX systems we did this with SRAT entries with XTB per node to match
> what the host supported.  On our particular platform those PA ranges were well separated
> from each other due to how the system routing worked, but the principal is the same.
> Those supported a huge amount of memory being hotplugged.
> 
> > 
> > We've been abusing ACPI hotpluggable region for years for virt DIMM hotplug,
> > putting it to some fantasy node and having it just work with hotplug of
> > DIMMs/NVDIMMs. The only issue we have is empty nodes. We differ from real
> > HW already significantly (especially, never exposing DIMMs via e820 to
> > the guest, which I call a feature and not a bug).  
> 
> Understood.
> >   
> > > Now I can't actually find anywhere in the
> > > ACPI spec that says that but I'm 99% sure Linux won't like and I'm fairly sure if we
> > > query it with ACPI folks the answer will be a no you can't don't that.    
> > 
> > I didn't find anything that contradicts it in the spec as well. It's not really
> > specified what's allowed and what's not :)
> > 
> > FWIW, the code I shared works with 5.11.12-300.fc34.x86_64 inside the guest flawlessly.  
> 
> Hmm. I'm surprised that works at all and my worry is there is no reason it will continue
> to work.

I've checked with some of our firmware people and the response was very much against doing this
on the basis it makes no sense in any physical system to have overlapping regions.

I'll reach out to our ASWG representatives to see if we can get the ACPI spec clarified.
(Given question is from a public mailing list this should be under the code first policy).

My view is that a clarification should be added to state that these regions must not overlap.

Jonathan

> 
> Jonathan
> 
>  
> > 
> > #! /bin/bash
> > sudo build/qemu-system-x86_64 \
> >     --enable-kvm \
> >     -m 8G,maxmem=32G,slots=1 \
> >     -object memory-backend-memfd,id=mem,size=8G \
> >     -numa node,nodeid=0,memdev=mem,cpus=0-4 \
> >     -numa node,nodeid=1 -numa node,nodeid=2 \
> >     -numa node,nodeid=3 -numa node,nodeid=4 \
> >     -smp sockets=2,cores=4 \
> >     -nographic \
> >     -nodefaults \
> >     -net nic -net user \
> >     -chardev stdio,nosignal,id=serial \
> >     -hda Fedora-Cloud-Base-34-1.2.x86_64.qcow2 \
> >     -cdrom /home/dhildenb/git/cloud-init/cloud-init.iso \
> >     -device isa-serial,chardev=serial \
> >     -chardev socket,id=monitor,path=/var/tmp/mon_src,server,nowait \
> >     -mon chardev=monitor,mode=readline \
> >     -object memory-backend-memfd,id=mem0,size=8G \
> >     -device virtio-mem-pci,id=vmem0,memdev=mem0,node=1,requested-size=128M \
> >     -object memory-backend-memfd,id=mem1,size=8G \
> >     -device virtio-mem-pci,id=vmem1,memdev=mem1,node=2,requested-size=128M \
> >     -object memory-backend-memfd,id=mem2,size=8G \
> >     -device virtio-mem-pci,id=vmem2,memdev=mem2,node=3,requested-size=128M
> > 
> > [root@vm-0 ~]# dmesg | grep "ACPI: SRAT: Node"
> > [    0.009933] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> > [    0.009939] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
> > [    0.009941] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x23fffffff]
> > [    0.009942] ACPI: SRAT: Node 1 PXM 1 [mem 0x240000000-0x87fffffff] hotplug
> > [    0.009944] ACPI: SRAT: Node 2 PXM 2 [mem 0x240000000-0x87fffffff] hotplug
> > [    0.009946] ACPI: SRAT: Node 3 PXM 3 [mem 0x240000000-0x87fffffff] hotplug
> > [    0.009947] ACPI: SRAT: Node 4 PXM 4 [mem 0x240000000-0x87fffffff] hotplug
> > [root@vm-0 ~]# numactl --hardware
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1 2 3 4 5 6 7
> > node 0 size: 7950 MB
> > node 0 free: 7692 MB
> > node 1 cpus:
> > node 1 size: 128 MB
> > node 1 free: 123 MB
> > node 2 cpus:
> > node 2 size: 128 MB
> > node 2 free: 127 MB
> > node 3 cpus:
> > node 3 size: 128 MB
> > node 3 free: 127 MB
> > node distances:
> > node   0   1   2   3 
> >   0:  10  20  20  20 
> >   1:  20  10  20  20 
> >   2:  20  20  10  20 
> >   3:  20  20  20  10 
> > 
> > 
> >   
>
David Hildenbrand Nov. 19, 2021, 11:33 a.m. UTC | #18
On 19.11.21 11:58, Jonathan Cameron wrote:
> On Thu, 18 Nov 2021 11:23:06 +0000
> Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> 
>> On Thu, 18 Nov 2021 12:06:27 +0100
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> On 18.11.21 11:28, Jonathan Cameron wrote:  
>>>> On Wed, 17 Nov 2021 19:08:28 +0100
>>>> David Hildenbrand <david@redhat.com> wrote:
>>>>     
>>>>> On 17.11.21 15:30, Jonathan Cameron wrote:    
>>>>>> On Tue, 16 Nov 2021 12:11:29 +0100
>>>>>> David Hildenbrand <david@redhat.com> wrote:
>>>>>>       
>>>>>>>>>
>>>>>>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
>>>>>>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
>>>>>>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
>>>>>>>>> like that yet, but it might just be a matter of time).        
>>>>>>>>
>>>>>>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
>>>>>>>> some by MEMORY entries. Or nodes created dynamically like with normal
>>>>>>>> hotplug memory.
>>>>>>>>         
>>>>>>       
>>>>>
>>>>> Hi Jonathan,
>>>>>    
>>>>>> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
>>>>>> to Generic Initiator Affinity.  So no good for memory. This is meant for
>>>>>> representation of accelerators / network cards etc so you can get the NUMA
>>>>>> characteristics for them accessing Memory in other nodes.
>>>>>>
>>>>>> My understanding of 'traditional' memory hotplug is that typically the
>>>>>> PA into which memory is hotplugged is known at boot time whether or not
>>>>>> the memory is physically present.  As such, you present that in SRAT and rely
>>>>>> on the EFI memory map / other information sources to know the memory isn't
>>>>>> there.  When it is hotplugged later the address is looked up in SRAT to identify
>>>>>> the NUMA node.      
>>>>>
>>>>> in virtualized environments we use the SRAT only to indicate the hotpluggable
>>>>> region (-> indicate maximum possible PFN to the guest OS), the actual present
>>>>> memory+PXM assignment is not done via SRAT. We differ quite a lot here from
>>>>> actual hardware I think.
>>>>>    
>>>>>>
>>>>>> That model is less useful for more flexible entities like virtio-mem or
>>>>>> indeed physical hardware such as CXL type 3 memory devices which typically
>>>>>> need their own nodes.
>>>>>>
>>>>>> For the CXL type 3 option, currently proposal is to use the CXL table entries
>>>>>> representing Physical Address space regions to work out how many NUMA nodes
>>>>>> are needed and just create extra ones at boot.
>>>>>> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
>>>>>>
>>>>>> It's a heuristic as we might need more nodes to represent things well kernel
>>>>>> side, but it's better than nothing and less effort that true dynamic node creation.
>>>>>> If you chase through the earlier versions of Alison's patch you will find some
>>>>>> discussion of that.
>>>>>>
>>>>>> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
>>>>>>
>>>>>> That would make all this stuff discoverable via PCI config space rather than ACPI
>>>>>> CDAT is at:
>>>>>> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
>>>>>> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
>>>>>> (nothing stops others using it though AFAIK).
>>>>>>
>>>>>> However, then we'd actually need either dynamic node creation in the OS, or
>>>>>> some sort of reserved pool of extra nodes.  Long term it may be the most
>>>>>> flexible option.      
>>>>>
>>>>>
>>>>> I think for virtio-mem it's actually a bit simpler:
>>>>>
>>>>> a) The user defined on the QEMU cmdline an empty node
>>>>> b) The user assigned a virtio-mem device to a node, either when 
>>>>>    coldplugging or hotplugging the device.
>>>>>
>>>>> So we don't actually "hotplug" a new node, the (possible) node is already known
>>>>> to QEMU right when starting up. It's just a matter of exposing that fact to the
>>>>> guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
>>>>> It's seems to boild down to an ACPI limitation.
>>>>>
>>>>> Conceptually, virtio-mem on an empty node in QEMU is not that different from
>>>>> hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
>>>>> an empty node. But I guess it all just doesn't work with QEMU as of now.    
>>>>
>>>> A side distraction perhaps, but there is a code first acpi proposal to add
>>>> a 'softer' form of CPU hotplug 
>>>> https://bugzilla.tianocore.org/show_bug.cgi?id=3706
>>>>
>>>> Whilst the reason for that proposal was for arm64 systems where there is no architected
>>>> physical hotplug, it might partly solve the empty node question for CPUs.  They won't
>>>> be empty, there will simply be CPUs that are marked as 'Online capable'.
>>>>     
>>>>>
>>>>> In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via
>>>>>
>>>>> 	build_srat_memory(table_data, machine->device_memory->base,
>>>>> 			  hotpluggable_address_space_size, nb_numa_nodes - 1,
>>>>> 			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
>>>>>
>>>>> So we tell the guest OS "this range is hotpluggable" and "it contains to
>>>>> this node unless the device says something different". From both values we
>>>>> can -- when under QEMU -- conclude the maximum possible PFN and the maximum
>>>>> possible node. But the latter is not what Linux does: it simply maps the last
>>>>> numa node (indicated in the memory entry) to a PXM
>>>>> (-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).    
>>>> yeah.  There is nothing in ACPI that says there can't be holes in the node numbering
>>>> so Linux does a remapping as you point out.
>>>>     
>>>>>
>>>>>
>>>>> I do wonder if we could simply expose the same hotpluggable range via multiple nodes:    
>>>>
>>>> Fairly sure the answer to this is no.  You'd have to indicate different ranges and
>>>> then put the virtio-mem in the right one.     
>>>
>>> And I repeat, this is in no way different to DIMMs/NVDIMMs. We cannot predict
>>> the future when hotplugging DIMMS/NVDIMMs/virtio-mem/... to some node later. We only
>>> have access to that information when coldplugging devices, but even a
>>> hotunplug+hotplug can change that info. Whatever we expose via ACPI is moot
>>> already and just a hint to the guest OS "maximum possible PFN".  
>>
>> Sure, so the solution is a large non overlapping extra node for each node on the
>> underlying physical system.  It uses a lot of PA space, but I'm going to assume
>> the system isn't so big that that PA space exhaustion is a problem?  For a sensible setup
>> those node would match the actual memory present on the underlying system.
>>
>> For physical CCIX systems we did this with SRAT entries with XTB per node to match
>> what the host supported.  On our particular platform those PA ranges were well separated
>> from each other due to how the system routing worked, but the principal is the same.
>> Those supported a huge amount of memory being hotplugged.
>>
>>>
>>> We've been abusing ACPI hotpluggable region for years for virt DIMM hotplug,
>>> putting it to some fantasy node and having it just work with hotplug of
>>> DIMMs/NVDIMMs. The only issue we have is empty nodes. We differ from real
>>> HW already significantly (especially, never exposing DIMMs via e820 to
>>> the guest, which I call a feature and not a bug).  
>>
>> Understood.
>>>   
>>>> Now I can't actually find anywhere in the
>>>> ACPI spec that says that but I'm 99% sure Linux won't like and I'm fairly sure if we
>>>> query it with ACPI folks the answer will be a no you can't don't that.    
>>>
>>> I didn't find anything that contradicts it in the spec as well. It's not really
>>> specified what's allowed and what's not :)
>>>
>>> FWIW, the code I shared works with 5.11.12-300.fc34.x86_64 inside the guest flawlessly.  
>>
>> Hmm. I'm surprised that works at all and my worry is there is no reason it will continue
>> to work.
> 
> I've checked with some of our firmware people and the response was very much against doing this
> on the basis it makes no sense in any physical system to have overlapping regions.
> 
> I'll reach out to our ASWG representatives to see if we can get the ACPI spec clarified.
> (Given question is from a public mailing list this should be under the code first policy).
> 
> My view is that a clarification should be added to state that these regions must not overlap.

I'd really appreciate if we could instead have something that makes virt
happy as well ("makes no sense in any physical system"), because virt is
most probably the biggest actual consumer of ACPI memory hotplug out
there (!).

I mean, for virt as is we will never know which PA range will belong to
which node upfront. All we know is that there is a PA range that could
belong to node X-Z. Gluing a single range to a single node doesn't make
too much sense for virt, which is why we have just been using it to
indicate the maximum possible PFN with a fantasy node.

Overlapping regions would really simplify the whole thing, and I think
if we go down that path we should go one step further and indicate the
hotpluggable region to all nodes that might see hotplug (QEMU -> all
nodes). The ACPI clarification would then be that we can have
overlapping ranges and that on overlapping ranges all indicated nodes
would be a possible target later. That would make perfect sense to me
and make both phys and virt happy.



Two ways to avoid overlapping regions, which aren't much better:

1) Split the hotpluggable region up into fantasy regions and assign one
fantasy region to each actual node.

The fantasy regions will have nothing to do with reality late (just like
what we have right now with the last node getting assigned the whole
hotpluggable region) and devices might overlap, but we don't really
care, because the devices expose the actual node themselves.


2) Duplicate the hotpluggable region accross all nodes

We would have one hotpluggable region with a dedicate PA space, and
hotplug the device into the respective node PA space.

That can be problematic, though, as we can easily run out of PA space.
For example, my Ryzen 9 cannot address anything above 1 TiB. So if we'd
have a hotpluggable region of 256 GiB, we'll already be in trouble with
more than 3 nodes.
Jonathan Cameron Nov. 19, 2021, 5:26 p.m. UTC | #19
On Fri, 19 Nov 2021 12:33:27 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 19.11.21 11:58, Jonathan Cameron wrote:
> > On Thu, 18 Nov 2021 11:23:06 +0000
> > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:
> >   
> >> On Thu, 18 Nov 2021 12:06:27 +0100
> >> David Hildenbrand <david@redhat.com> wrote:
> >>  
> >>> On 18.11.21 11:28, Jonathan Cameron wrote:    
> >>>> On Wed, 17 Nov 2021 19:08:28 +0100
> >>>> David Hildenbrand <david@redhat.com> wrote:
> >>>>       
> >>>>> On 17.11.21 15:30, Jonathan Cameron wrote:      
> >>>>>> On Tue, 16 Nov 2021 12:11:29 +0100
> >>>>>> David Hildenbrand <david@redhat.com> wrote:
> >>>>>>         
> >>>>>>>>>
> >>>>>>>>> Examples include exposing HBM or PMEM to the VM. Just like on real HW,
> >>>>>>>>> this memory is exposed via cpu-less, special nodes. In contrast to real
> >>>>>>>>> HW, the memory is hotplugged later (I don't think HW supports hotplug
> >>>>>>>>> like that yet, but it might just be a matter of time).          
> >>>>>>>>
> >>>>>>>> I suppose some of that maybe covered by GENERIC_AFFINITY entries in SRAT
> >>>>>>>> some by MEMORY entries. Or nodes created dynamically like with normal
> >>>>>>>> hotplug memory.
> >>>>>>>>           
> >>>>>>         
> >>>>>
> >>>>> Hi Jonathan,
> >>>>>      
> >>>>>> The naming of the define is unhelpful.  GENERIC_AFFINITY here corresponds
> >>>>>> to Generic Initiator Affinity.  So no good for memory. This is meant for
> >>>>>> representation of accelerators / network cards etc so you can get the NUMA
> >>>>>> characteristics for them accessing Memory in other nodes.
> >>>>>>
> >>>>>> My understanding of 'traditional' memory hotplug is that typically the
> >>>>>> PA into which memory is hotplugged is known at boot time whether or not
> >>>>>> the memory is physically present.  As such, you present that in SRAT and rely
> >>>>>> on the EFI memory map / other information sources to know the memory isn't
> >>>>>> there.  When it is hotplugged later the address is looked up in SRAT to identify
> >>>>>> the NUMA node.        
> >>>>>
> >>>>> in virtualized environments we use the SRAT only to indicate the hotpluggable
> >>>>> region (-> indicate maximum possible PFN to the guest OS), the actual present
> >>>>> memory+PXM assignment is not done via SRAT. We differ quite a lot here from
> >>>>> actual hardware I think.
> >>>>>      
> >>>>>>
> >>>>>> That model is less useful for more flexible entities like virtio-mem or
> >>>>>> indeed physical hardware such as CXL type 3 memory devices which typically
> >>>>>> need their own nodes.
> >>>>>>
> >>>>>> For the CXL type 3 option, currently proposal is to use the CXL table entries
> >>>>>> representing Physical Address space regions to work out how many NUMA nodes
> >>>>>> are needed and just create extra ones at boot.
> >>>>>> https://lore.kernel.org/linux-cxl/163553711933.2509508.2203471175679990.stgit@dwillia2-desk3.amr.corp.intel.com
> >>>>>>
> >>>>>> It's a heuristic as we might need more nodes to represent things well kernel
> >>>>>> side, but it's better than nothing and less effort that true dynamic node creation.
> >>>>>> If you chase through the earlier versions of Alison's patch you will find some
> >>>>>> discussion of that.
> >>>>>>
> >>>>>> I wonder if virtio-mem should just grow a CDAT instance via a DOE?
> >>>>>>
> >>>>>> That would make all this stuff discoverable via PCI config space rather than ACPI
> >>>>>> CDAT is at:
> >>>>>> https://uefi.org/sites/default/files/resources/Coherent%20Device%20Attribute%20Table_1.01.pdf
> >>>>>> but the table access protocol over PCI DOE is currently in the CXL 2.0 spec
> >>>>>> (nothing stops others using it though AFAIK).
> >>>>>>
> >>>>>> However, then we'd actually need either dynamic node creation in the OS, or
> >>>>>> some sort of reserved pool of extra nodes.  Long term it may be the most
> >>>>>> flexible option.        
> >>>>>
> >>>>>
> >>>>> I think for virtio-mem it's actually a bit simpler:
> >>>>>
> >>>>> a) The user defined on the QEMU cmdline an empty node
> >>>>> b) The user assigned a virtio-mem device to a node, either when 
> >>>>>    coldplugging or hotplugging the device.
> >>>>>
> >>>>> So we don't actually "hotplug" a new node, the (possible) node is already known
> >>>>> to QEMU right when starting up. It's just a matter of exposing that fact to the
> >>>>> guest OS -- similar to how we expose the maximum possible PFN to the guest OS.
> >>>>> It's seems to boild down to an ACPI limitation.
> >>>>>
> >>>>> Conceptually, virtio-mem on an empty node in QEMU is not that different from
> >>>>> hot/coldplugging a CPU to an empty node or hot/coldplugging a DIMM/NVDIMM to
> >>>>> an empty node. But I guess it all just doesn't work with QEMU as of now.      
> >>>>
> >>>> A side distraction perhaps, but there is a code first acpi proposal to add
> >>>> a 'softer' form of CPU hotplug 
> >>>> https://bugzilla.tianocore.org/show_bug.cgi?id=3706
> >>>>
> >>>> Whilst the reason for that proposal was for arm64 systems where there is no architected
> >>>> physical hotplug, it might partly solve the empty node question for CPUs.  They won't
> >>>> be empty, there will simply be CPUs that are marked as 'Online capable'.
> >>>>       
> >>>>>
> >>>>> In current x86-64 code, we define the "hotpluggable region" in hw/i386/acpi-build.c via
> >>>>>
> >>>>> 	build_srat_memory(table_data, machine->device_memory->base,
> >>>>> 			  hotpluggable_address_space_size, nb_numa_nodes - 1,
> >>>>> 			  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> >>>>>
> >>>>> So we tell the guest OS "this range is hotpluggable" and "it contains to
> >>>>> this node unless the device says something different". From both values we
> >>>>> can -- when under QEMU -- conclude the maximum possible PFN and the maximum
> >>>>> possible node. But the latter is not what Linux does: it simply maps the last
> >>>>> numa node (indicated in the memory entry) to a PXM
> >>>>> (-> drivers/acpi/numa/srat.c:acpi_numa_memory_affinity_init()).      
> >>>> yeah.  There is nothing in ACPI that says there can't be holes in the node numbering
> >>>> so Linux does a remapping as you point out.
> >>>>       
> >>>>>
> >>>>>
> >>>>> I do wonder if we could simply expose the same hotpluggable range via multiple nodes:      
> >>>>
> >>>> Fairly sure the answer to this is no.  You'd have to indicate different ranges and
> >>>> then put the virtio-mem in the right one.       
> >>>
> >>> And I repeat, this is in no way different to DIMMs/NVDIMMs. We cannot predict
> >>> the future when hotplugging DIMMS/NVDIMMs/virtio-mem/... to some node later. We only
> >>> have access to that information when coldplugging devices, but even a
> >>> hotunplug+hotplug can change that info. Whatever we expose via ACPI is moot
> >>> already and just a hint to the guest OS "maximum possible PFN".    
> >>
> >> Sure, so the solution is a large non overlapping extra node for each node on the
> >> underlying physical system.  It uses a lot of PA space, but I'm going to assume
> >> the system isn't so big that that PA space exhaustion is a problem?  For a sensible setup
> >> those node would match the actual memory present on the underlying system.
> >>
> >> For physical CCIX systems we did this with SRAT entries with XTB per node to match
> >> what the host supported.  On our particular platform those PA ranges were well separated
> >> from each other due to how the system routing worked, but the principal is the same.
> >> Those supported a huge amount of memory being hotplugged.
> >>  
> >>>
> >>> We've been abusing ACPI hotpluggable region for years for virt DIMM hotplug,
> >>> putting it to some fantasy node and having it just work with hotplug of
> >>> DIMMs/NVDIMMs. The only issue we have is empty nodes. We differ from real
> >>> HW already significantly (especially, never exposing DIMMs via e820 to
> >>> the guest, which I call a feature and not a bug).    
> >>
> >> Understood.  
> >>>     
> >>>> Now I can't actually find anywhere in the
> >>>> ACPI spec that says that but I'm 99% sure Linux won't like and I'm fairly sure if we
> >>>> query it with ACPI folks the answer will be a no you can't don't that.      
> >>>
> >>> I didn't find anything that contradicts it in the spec as well. It's not really
> >>> specified what's allowed and what's not :)
> >>>
> >>> FWIW, the code I shared works with 5.11.12-300.fc34.x86_64 inside the guest flawlessly.    
> >>
> >> Hmm. I'm surprised that works at all and my worry is there is no reason it will continue
> >> to work.  
> > 
> > I've checked with some of our firmware people and the response was very much against doing this
> > on the basis it makes no sense in any physical system to have overlapping regions.
> > 
> > I'll reach out to our ASWG representatives to see if we can get the ACPI spec clarified.
> > (Given question is from a public mailing list this should be under the code first policy).
> > 
> > My view is that a clarification should be added to state that these regions must not overlap.  
> 
> I'd really appreciate if we could instead have something that makes virt
> happy as well ("makes no sense in any physical system"), because virt is
> most probably the biggest actual consumer of ACPI memory hotplug out
> there (!).

No problem with finding such a solution - but it's an ASWG question
(be it with a code first discussion). I have no idea what other
operating systems would do with overlapping nodes today.  We need to
jump through the hoops to make sure any solution is mutually agreed.
Maybe the solution is a new type of entry or flag that makes it clear
the 'real' node mapping is not PA range based?

> 
> I mean, for virt as is we will never know which PA range will belong to
> which node upfront. All we know is that there is a PA range that could
> belong to node X-Z. Gluing a single range to a single node doesn't make
> too much sense for virt, which is why we have just been using it to
> indicate the maximum possible PFN with a fantasy node.

I'm not convinced that's true. The physical memory
is coming from somewhere (assuming RAM backed).  I would assume the ideal
if going to the effort of passing NUMA into a VM, would be to convey
the same NUMA characteristics to the VM.  So add it to the VM at
the PA range that matches the appropriate host system NUMA node.

> 
> Overlapping regions would really simplify the whole thing, and I think
> if we go down that path we should go one step further and indicate the
> hotpluggable region to all nodes that might see hotplug (QEMU -> all
> nodes). The ACPI clarification would then be that we can have
> overlapping ranges and that on overlapping ranges all indicated nodes
> would be a possible target later. That would make perfect sense to me
> and make both phys and virt happy.

One alternative I mentioned briefly earlier is don't use ACPI at all.
For the new interconnects like CXL the decision was made that it wasn't
a suitable medium so they had CDAT (which is provided by the device)
instead. It's an open question how that will be handled by the OS at the
moment, but once solved (and it will need to be soon) that provides
a means to specify all the same data you get from ACPI NUMA description,
and leaves the OS to figure out how to merge it with it's internal
representation of NUMA.

For virtio-mem / PCI at least it seems a fairly natural match.

> 
> 
> 
> Two ways to avoid overlapping regions, which aren't much better:
> 
> 1) Split the hotpluggable region up into fantasy regions and assign one
> fantasy region to each actual node.
> 
> The fantasy regions will have nothing to do with reality late (just like
> what we have right now with the last node getting assigned the whole
> hotpluggable region) and devices might overlap, but we don't really
> care, because the devices expose the actual node themselves.
> 
> 
> 2) Duplicate the hotpluggable region accross all nodes
> 
> We would have one hotpluggable region with a dedicate PA space, and
> hotplug the device into the respective node PA space.
> 
> That can be problematic, though, as we can easily run out of PA space.
> For example, my Ryzen 9 cannot address anything above 1 TiB. So if we'd
> have a hotpluggable region of 256 GiB, we'll already be in trouble with
> more than 3 nodes.

My assumption was that the reason to do this is to pass through node
mappings that line up with the underlying physical system.  If that's the case
then the hotpluggable regions for each node could be made to match what is
there.

Your Ryzen 9 would normally only have one node?

If the intent is to sue these regions for more complex purposes (maybe file
backed memory devices?) then things get more interesting, but how useful
is mapping them to conventional NUMA representations?

Thanks,

Jonathan

>
David Hildenbrand Nov. 19, 2021, 5:56 p.m. UTC | #20
>> I'd really appreciate if we could instead have something that makes virt
>> happy as well ("makes no sense in any physical system"), because virt is
>> most probably the biggest actual consumer of ACPI memory hotplug out
>> there (!).
> 
> No problem with finding such a solution - but it's an ASWG question
> (be it with a code first discussion). I have no idea what other
> operating systems would do with overlapping nodes today.  We need to
> jump through the hoops to make sure any solution is mutually agreed.
> Maybe the solution is a new type of entry or flag that makes it clear
> the 'real' node mapping is not PA range based?

Yeah, something like "we might see hotplug within this range to this
node" would clearly express what could happen as of now in QEMU.

> 
>>
>> I mean, for virt as is we will never know which PA range will belong to
>> which node upfront. All we know is that there is a PA range that could
>> belong to node X-Z. Gluing a single range to a single node doesn't make
>> too much sense for virt, which is why we have just been using it to
>> indicate the maximum possible PFN with a fantasy node.
> 
> I'm not convinced that's true. The physical memory
> is coming from somewhere (assuming RAM backed).  I would assume the ideal
> if going to the effort of passing NUMA into a VM, would be to convey
> the same NUMA characteristics to the VM.  So add it to the VM at
> the PA range that matches the appropriate host system NUMA node.

I think we only have real experience with vNUMA when passing through a
subset of real NUMA nodes -- performance differentiated memory was so
far not part of the bigger picture.

The issues start once you allow for more VM RAM than you have into your
hypervisor, simply because you can due to memory overcommit, file-backed
memory ... which can mess with the PA assumptions.

As you say, with all fully RAM backed (excluding SWAP) there is no
overcommit, there are no emulated RAM devices and things are easier.

>>
>> Overlapping regions would really simplify the whole thing, and I think
>> if we go down that path we should go one step further and indicate the
>> hotpluggable region to all nodes that might see hotplug (QEMU -> all
>> nodes). The ACPI clarification would then be that we can have
>> overlapping ranges and that on overlapping ranges all indicated nodes
>> would be a possible target later. That would make perfect sense to me
>> and make both phys and virt happy.
> 
> One alternative I mentioned briefly earlier is don't use ACPI at all.
> For the new interconnects like CXL the decision was made that it wasn't
> a suitable medium so they had CDAT (which is provided by the device)
> instead. It's an open question how that will be handled by the OS at the
> moment, but once solved (and it will need to be soon) that provides
> a means to specify all the same data you get from ACPI NUMA description,
> and leaves the OS to figure out how to merge it with it's internal
> representation of NUMA.
> 
> For virtio-mem / PCI at least it seems a fairly natural match.

Yes, for virtio-mem-pci it would be a natural match I guess. I yet have
to look into the details. I'd be happy to just use any other mechanism
than ACPI to

a) Tell the OS early about the maximum possible PFN
b) Tell the OS early about possible nodes

>>
>>
>> Two ways to avoid overlapping regions, which aren't much better:
>>
>> 1) Split the hotpluggable region up into fantasy regions and assign one
>> fantasy region to each actual node.
>>
>> The fantasy regions will have nothing to do with reality late (just like
>> what we have right now with the last node getting assigned the whole
>> hotpluggable region) and devices might overlap, but we don't really
>> care, because the devices expose the actual node themselves.
>>
>>
>> 2) Duplicate the hotpluggable region accross all nodes
>>
>> We would have one hotpluggable region with a dedicate PA space, and
>> hotplug the device into the respective node PA space.
>>
>> That can be problematic, though, as we can easily run out of PA space.
>> For example, my Ryzen 9 cannot address anything above 1 TiB. So if we'd
>> have a hotpluggable region of 256 GiB, we'll already be in trouble with
>> more than 3 nodes.
> 
> My assumption was that the reason to do this is to pass through node
> mappings that line up with the underlying physical system.  If that's the case
> then the hotpluggable regions for each node could be made to match what is
> there.
> 
> Your Ryzen 9 would normally only have one node?

Yes. I recon it would support NVDIMMs one might want to expose via a
virtual NUMA node to the VM. I assume it would not be represented via a
dedicated NUMA node to my machine.

> 
> If the intent is to sue these regions for more complex purposes (maybe file
> backed memory devices?) then things get more interesting, but how useful
> is mapping them to conventional NUMA representations?

Emulated NVDIMMs and virtio-pmem are the interesting cases I guess. The
issue is much rather that the PA layout of the real machine does no
longer hold.
diff mbox series

Patch

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 674f902652..a4c95b2f64 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -526,6 +526,7 @@  build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
     AcpiTable table = { .sig = "SRAT", .rev = 3, .oem_id = vms->oem_id,
                         .oem_table_id = vms->oem_table_id };
+    MemoryAffinityFlags flags;
 
     acpi_table_begin(&table, table_data);
     build_append_int_noprefix(table_data, 1, 4); /* Reserved */
@@ -547,12 +548,15 @@  build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
 
     mem_base = vms->memmap[VIRT_MEM].base;
     for (i = 0; i < ms->numa_state->num_nodes; ++i) {
-        if (ms->numa_state->nodes[i].node_mem > 0) {
-            build_srat_memory(table_data, mem_base,
-                              ms->numa_state->nodes[i].node_mem, i,
-                              MEM_AFFINITY_ENABLED);
-            mem_base += ms->numa_state->nodes[i].node_mem;
+        if (ms->numa_state->nodes[i].node_mem) {
+            flags = MEM_AFFINITY_ENABLED;
+        } else {
+            flags = MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE;
         }
+
+        build_srat_memory(table_data, mem_base,
+                          ms->numa_state->nodes[i].node_mem, i, flags);
+        mem_base += ms->numa_state->nodes[i].node_mem;
     }
 
     if (ms->nvdimms_state->is_enabled) {