Message ID | 20230223081401.248835-1-gshan@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | NUMA: Apply socket-NUMA-node boundary for aarch64 and RiscV machines | expand |
On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote: > For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is > used to populate the CPU topology in the Linux guest. It's required that > the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux > scheduling domain can't be sorted out, as the following warning message > indicates. To avoid the unexpected confusion, this series attempts to > rejects such kind of insane configurations. Hi Gavin, I'm not sure 'insane' is the right word, maybe 'irregular'. I've never seen, and don't really expect to find, specifications stating that arm and/or riscv must have 1:1 mappings for numa nodes and sockets. But, as QEMU machine types can choose certain configurations to support, if everyone is in agreement to restrict sbsa-ref and arm/riscv virt to 1:1 mappings, then why not. However, I suggest stating that this is simply a choice being made for these boards, since the expectation of needing "irregular" configurations is low. We should not be trying to use Linux assumptions as rationale, since other OSes may cope better with irregular configurations. Also, I suggest adding a comment to the boards, which adopt this restriction, which explains that it's only a platform choice, not something specified by the architecture. Thanks, drew > > -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ > -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ > -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ > -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ > > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910 > Modules linked in: > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1 > pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : build_sched_domains+0x284/0x910 > lr : build_sched_domains+0x184/0x910 > sp : ffff80000804bd50 > x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000 > x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840 > x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508 > x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014 > x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e > x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0 > x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041 > x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001 > x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002 > x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001 > Call trace: > build_sched_domains+0x284/0x910 > sched_init_domains+0xac/0xe0 > sched_init_smp+0x48/0xc8 > kernel_init_freeable+0x140/0x1ac > kernel_init+0x28/0x140 > ret_from_fork+0x10/0x20 > > PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary > PATCH[2] Validate the configuration if required in the NUMA subsystem > PATCH[3] Enable the validation for aarch64 machines > PATCH[4] Enable the validation for RiscV machines > > v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html > > Changelog > ========= > v2: > * Fix socket-NUMA-node boundary issues in qtests/numa-test (Gavin) > * Add helper set_numa_socket_boundary() and validate the > boundary in the generic path (Philippe) > > Gavin Shan (4): > qtest/numa-test: Follow socket-NUMA-node boundary for aarch64 > numa: Validate socket and NUMA node boundary if required > hw/arm: Validate socket and NUMA node boundary > hw/riscv: Validate socket and NUMA node boundary > > hw/arm/sbsa-ref.c | 2 ++ > hw/arm/virt.c | 2 ++ > hw/core/machine.c | 34 ++++++++++++++++++++++++++++++++++ > hw/core/numa.c | 7 +++++++ > hw/riscv/spike.c | 1 + > hw/riscv/virt.c | 1 + > include/sysemu/numa.h | 4 ++++ > tests/qtest/numa-test.c | 13 ++++++++++--- > 8 files changed, 61 insertions(+), 3 deletions(-) > > -- > 2.23.0 > >
On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote: > For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is > used to populate the CPU topology in the Linux guest. It's required that > the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux > scheduling domain can't be sorted out, as the following warning message > indicates. To avoid the unexpected confusion, this series attempts to > rejects such kind of insane configurations. > > -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ > -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ > -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ > -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ This is somewhat odd as a config, because core 2 is in socket 0 and core 3 is in socket 1, so it wouldn't make much conceptual sense to have them in the same NUMA node, while other cores within the same socket are in different NUMA nodes. Basically the split of NUMA nodes is not aligned with any level in the topology. This series, however, also rejects configurations that I would normally consider to be reasonable. I've not tested linux kernel behaviour though, but as a user I would expect to be able to ask for: -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ -numa node,nodeid=0,cpus=0,memdev=ram0 \ -numa node,nodeid=1,cpus=1,memdev=ram1 \ -numa node,nodeid=2,cpus=2,memdev=ram2 \ -numa node,nodeid=3,cpus=3,memdev=ram3 \ -numa node,nodeid=4,cpus=4,memdev=ram4 \ -numa node,nodeid=5,cpus=5,memdev=ram5 \ ie, every core gets its own NUMA node Or to aask for every cluster as a numa node: -smp 6,maxcpus=6,sockets=2,clusters=3,cores=1,threads=1 \ -numa node,nodeid=0,cpus=0,memdev=ram0 \ -numa node,nodeid=1,cpus=1,memdev=ram1 \ -numa node,nodeid=2,cpus=2,memdev=ram2 \ -numa node,nodeid=3,cpus=3,memdev=ram3 \ -numa node,nodeid=4,cpus=4,memdev=ram4 \ -numa node,nodeid=5,cpus=5,memdev=ram5 \ In both cases the NUMA split is aligned with a given level in the topology, which was not the case with your example. Rejecting these feels overly strict to me, and may risk breaking existing valid deployments, unless we can demonstrate those scenarios were unambiguously already broken ? If there was something in the hardware specs that requires this, then it is more valid to do, than if it is merely an specific guest kernel limitation that might be fixed any day. With regards, Daniel
On 2/23/23 05:13, Gavin Shan wrote: > For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is > used to populate the CPU topology in the Linux guest. It's required that > the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux > scheduling domain can't be sorted out, as the following warning message > indicates. To avoid the unexpected confusion, this series attempts to > rejects such kind of insane configurations. > > -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ > -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ > -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ > -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ And why is this a QEMU problem? This doesn't hurt ACPI. Also, this restriction impacts breaks ARM guests in the wild that are running non-Linux OSes. I don't see why we should impact use cases that has nothing to do with Linux Kernel feelings about sockets - NUMA nodes exclusivity. Thanks, Daniel > > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910 > Modules linked in: > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1 > pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : build_sched_domains+0x284/0x910 > lr : build_sched_domains+0x184/0x910 > sp : ffff80000804bd50 > x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000 > x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840 > x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508 > x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014 > x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e > x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0 > x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041 > x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001 > x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002 > x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001 > Call trace: > build_sched_domains+0x284/0x910 > sched_init_domains+0xac/0xe0 > sched_init_smp+0x48/0xc8 > kernel_init_freeable+0x140/0x1ac > kernel_init+0x28/0x140 > ret_from_fork+0x10/0x20 > > PATCH[1] Improves numa/test for aarch64 to follow socket-to-NUMA-node boundary > PATCH[2] Validate the configuration if required in the NUMA subsystem > PATCH[3] Enable the validation for aarch64 machines > PATCH[4] Enable the validation for RiscV machines > > v1: https://lists.nongnu.org/archive/html/qemu-arm/2023-02/msg00886.html > > Changelog > ========= > v2: > * Fix socket-NUMA-node boundary issues in qtests/numa-test (Gavin) > * Add helper set_numa_socket_boundary() and validate the > boundary in the generic path (Philippe) > > Gavin Shan (4): > qtest/numa-test: Follow socket-NUMA-node boundary for aarch64 > numa: Validate socket and NUMA node boundary if required > hw/arm: Validate socket and NUMA node boundary > hw/riscv: Validate socket and NUMA node boundary > > hw/arm/sbsa-ref.c | 2 ++ > hw/arm/virt.c | 2 ++ > hw/core/machine.c | 34 ++++++++++++++++++++++++++++++++++ > hw/core/numa.c | 7 +++++++ > hw/riscv/spike.c | 1 + > hw/riscv/virt.c | 1 + > include/sysemu/numa.h | 4 ++++ > tests/qtest/numa-test.c | 13 ++++++++++--- > 8 files changed, 61 insertions(+), 3 deletions(-) >
On 2/23/23 11:57 PM, Daniel P. Berrangé wrote: > On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote: >> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is >> used to populate the CPU topology in the Linux guest. It's required that >> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux >> scheduling domain can't be sorted out, as the following warning message >> indicates. To avoid the unexpected confusion, this series attempts to >> rejects such kind of insane configurations. >> >> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ >> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ >> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ >> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ > > This is somewhat odd as a config, because core 2 is in socket 0 > and core 3 is in socket 1, so it wouldn't make much conceptual > sense to have them in the same NUMA node, while other cores within > the same socket are in different NUMA nodes. Basically the split > of NUMA nodes is not aligned with any level in the topology. > > This series, however, also rejects configurations that I would > normally consider to be reasonable. I've not tested linux kernel > behaviour though, but as a user I would expect to be able to > ask for: > > -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ > -numa node,nodeid=0,cpus=0,memdev=ram0 \ > -numa node,nodeid=1,cpus=1,memdev=ram1 \ > -numa node,nodeid=2,cpus=2,memdev=ram2 \ > -numa node,nodeid=3,cpus=3,memdev=ram3 \ > -numa node,nodeid=4,cpus=4,memdev=ram4 \ > -numa node,nodeid=5,cpus=5,memdev=ram5 \ > > ie, every core gets its own NUMA node > It doesn't work to Linux guest either. As the following warning message indicates, the Multicore domain isn't a subset of DIE (CLUSTER or socket) domain. For example, Multicore domain is 0-2 while DIE domain is 0 for CPU-0. [ 0.023486] CPU-0: 36,56,0,-1 thread=0 core=0-2 cluster=0-2 llc=0 // parsed from ACPI PPTT [ 0.023490] CPU-1: 36,56,1,-1 thread=1 core=0-2 cluster=0-2 llc=1 [ 0.023492] CPU-2: 36,56,2,-1 thread=2 core=0-2 cluster=0-2 llc=2 [ 0.023494] CPU-3: 136,156,3,-1 thread=3 core=3-5 cluster=3-5 llc=3 [ 0.023495] CPU-4: 136,156,4,-1 thread=4 core=3-5 cluster=3-5 llc=4 [ 0.023497] CPU-5: 136,156,5,-1 thread=5 core=3-5 cluster=3-5 llc=5 [ 0.023499] CPU-0: SMT=0 CLUSTER=0 MULTICORE=0-2 DIE=0 CPU-OF-NODE=0 // Seen by scheduling domain [ 0.023501] CPU-1: SMT=1 CLUSTER=1 MULTICORE=0-2 DIE=1 CPU-OF-NODE=1 [ 0.023503] CPU-2: SMT=2 CLUSTER=2 MULTICORE=0-2 DIE=2 CPU-OF-NODE=2 [ 0.023504] CPU-3: SMT=3 CLUSTER=3 MULTICORE=3-5 DIE=3 CPU-OF-NODE=3 [ 0.023506] CPU-4: SMT=4 CLUSTER=4 MULTICORE=3-5 DIE=4 CPU-OF-NODE=4 [ 0.023508] CPU-5: SMT=5 CLUSTER=5 MULTICORE=3-5 DIE=5 CPU-OF_NODE=5 : [ 0.023555] BUG: arch topology borken [ 0.023556] the MC domain not a subset of the DIE domain NOTE that both DIE and CPU-OF-NODE are same since they're all returned by 'cpumask_of_node(cpu_to_node(cpu))'. > Or to aask for every cluster as a numa node: > > -smp 6,maxcpus=6,sockets=2,clusters=3,cores=1,threads=1 \ > -numa node,nodeid=0,cpus=0,memdev=ram0 \ > -numa node,nodeid=1,cpus=1,memdev=ram1 \ > -numa node,nodeid=2,cpus=2,memdev=ram2 \ > -numa node,nodeid=3,cpus=3,memdev=ram3 \ > -numa node,nodeid=4,cpus=4,memdev=ram4 \ > -numa node,nodeid=5,cpus=5,memdev=ram5 \ > This case works fine to Linux guest. [ 0.024505] CPU-0: 36,56,0,-1 thread=0 core=0-2 cluster=0 llc=0 // parsed from ACPI PPTT [ 0.024509] CPU-1: 36,96,1,-1 thread=1 core=0-2 cluster=1 llc=1 [ 0.024511] CPU-2: 36,136,2,-1 thread=2 core=0-2 cluster=2 llc=2 [ 0.024512] CPU-3: 176,196,3,-1 thread=3 core=3-5 cluster=3 llc=3 [ 0.024514] CPU-4: 176,236,4,-1 thread=4 core=3-5 cluster=4 llc=4 [ 0.024515] CPU-5: 176,276,5,-1 thread=5 core=3-5 cluster=5 llc=5 [ 0.024518] CPU-0: SMT=0 CLUSTER=0 MULTICORE=0 DIE=0 CPU-OF-NODE=0 // Seen by scheduling domain [ 0.024519] CPU-1: SMT=1 CLUSTER=1 MULTICORE=1 DIE=1 CPU-OF-NODE=1 [ 0.024521] CPU-2: SMT=2 CLUSTER=2 MULTICORE=2 DIE=2 CPU-OF-NODE=2 [ 0.024522] CPU-3: SMT=3 CLUSTER=3 MULTICORE=3 DIE=3 CPU-OF-NODE=3 [ 0.024524] CPU-4: SMT=4 CLUSTER=4 MULTICORE=4 DIE=4 CPU-OF-NODE=4 [ 0.024525] CPU-5: SMT=5 CLUSTER=5 MULTICORE=5 DIE=5 CPU-OF-NODE=5 > In both cases the NUMA split is aligned with a given level > in the topology, which was not the case with your example. > > Rejecting these feels overly strict to me, and may risk breaking > existing valid deployments, unless we can demonstrate those > scenarios were unambiguously already broken ? > > If there was something in the hardware specs that requires > this, then it is more valid to do, than if it is merely an > specific guest kernel limitation that might be fixed any day. > Yes, I agree that it's strict to have socket-to-NUMA boundary. However, it sounds not sensible to split CPUs in one cluster to differnet NUMA nodes, or to split CPUs in one core to different NUMA nodes in the baremetal environment. I think we probably need to prevent these two cases, meaning two clusters in one socket is still allowed to be associated with different NUMA nodes. I fail to get accurate information about the relation among socket/cluster/core from specs. As I can understand, the CPUs in one core are sharing L2 cache and cores in one cluster are sharing L3 cache. thread would have its own L1 cache. L3 cache is usually corresponding to NUMA node. I may be totally wrong here. Thanks, Gavin
On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote: > On 2/23/23 05:13, Gavin Shan wrote: >> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is >> used to populate the CPU topology in the Linux guest. It's required that >> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux >> scheduling domain can't be sorted out, as the following warning message >> indicates. To avoid the unexpected confusion, this series attempts to >> rejects such kind of insane configurations. >> >> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ >> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ >> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ >> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ > > > And why is this a QEMU problem? This doesn't hurt ACPI. > > Also, this restriction impacts breaks ARM guests in the wild that are running > non-Linux OSes. I don't see why we should impact use cases that has nothing to > do with Linux Kernel feelings about sockets - NUMA nodes exclusivity. > With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5 are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple CPUs in different clusters and sockets have been associated with one NUMA node. If I'm correct, the configuration isn't sensible in a baremetal environment and same Linux kernel is supposed to work well for baremetal and virtualized machine. So I think QEMU needs to emulate the topology as much as we can to match with the baremetal environment. It's the reason why I think it's a QEMU problem even it doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com> in another thread, we may need to gurantee that the CPUs in one cluster can't be split to multiple NUMA nodes, which matches with the baremetal environment, as I can understand. Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will definitely break the configurations running in the wild. Thanks, Gavin [...]
Hi Drew, On 2/23/23 11:25 PM, Andrew Jones wrote: > On Thu, Feb 23, 2023 at 04:13:57PM +0800, Gavin Shan wrote: >> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is >> used to populate the CPU topology in the Linux guest. It's required that >> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux >> scheduling domain can't be sorted out, as the following warning message >> indicates. To avoid the unexpected confusion, this series attempts to >> rejects such kind of insane configurations. > > I'm not sure 'insane' is the right word, maybe 'irregular'. I've never > seen, and don't really expect to find, specifications stating that arm > and/or riscv must have 1:1 mappings for numa nodes and sockets. But, as > QEMU machine types can choose certain configurations to support, if > everyone is in agreement to restrict sbsa-ref and arm/riscv virt to 1:1 > mappings, then why not. However, I suggest stating that this is simply a > choice being made for these boards, since the expectation of needing > "irregular" configurations is low. We should not be trying to use Linux > assumptions as rationale, since other OSes may cope better with irregular > configurations. > > Also, I suggest adding a comment to the boards, which adopt this > restriction, which explains that it's only a platform choice, not > something specified by the architecture. > [...] Thanks for your comments. Yes, "irregular" is definitely better words. I failed to find the statement that ARM64 has the requirement for 1:1 mapping between socket and NUMA node, which makes it hard to tell if the restriction is required from architectural level. However, the linux driver (drivers/base/arch_topology.c) considers both CPU-to-NUMA-association and CPU toplogy from ACPI PPTT when the CPU set in the core group is calculated. It means the CPU-to-NUMA-association can affect the CPU topology seen by Linux scheduler. With the restriction of 1:1 mapping between socket and NUMA node, or between cluster and NUMA node as I mentioned in the reply to Daniel P. Berrangé <berrange@redhat.com>, the existing configuration for other OSes can become invalid. Yes, a comment to mention it's a platform choice, not architectural requirement, definitely helps to understand the context. Thanks, Gavin
On 2/24/23 04:09, Gavin Shan wrote: > On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote: >> On 2/23/23 05:13, Gavin Shan wrote: >>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is >>> used to populate the CPU topology in the Linux guest. It's required that >>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux >>> scheduling domain can't be sorted out, as the following warning message >>> indicates. To avoid the unexpected confusion, this series attempts to >>> rejects such kind of insane configurations. >>> >>> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ >>> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ >>> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ >>> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ >> >> >> And why is this a QEMU problem? This doesn't hurt ACPI. >> >> Also, this restriction impacts breaks ARM guests in the wild that are running >> non-Linux OSes. I don't see why we should impact use cases that has nothing to >> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity. >> > > With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5 > are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and > cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple > CPUs in different clusters and sockets have been associated with one NUMA node. > > If I'm correct, the configuration isn't sensible in a baremetal environment and > same Linux kernel is supposed to work well for baremetal and virtualized machine. > So I think QEMU needs to emulate the topology as much as we can to match with the > baremetal environment. It's the reason why I think it's a QEMU problem even it > doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com> > in another thread, we may need to gurantee that the CPUs in one cluster can't be > split to multiple NUMA nodes, which matches with the baremetal environment, as I > can understand. > > Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will > definitely break the configurations running in the wild. What about a warning? If the user attempts to use an exotic NUMA configuration like the one you mentioned we can print something like: "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave" This would inform the user what might be going wrong in case Linux is crashing/error out on them and then the user is free to fix their topology (or the kernel). And at the same time we wouldn't break existing stuff that happens to be working. Thanks, Daniel > > Thanks, > Gavin > > [...] >
On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote: > On 2/24/23 04:09, Gavin Shan wrote: >> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote: >>> On 2/23/23 05:13, Gavin Shan wrote: >>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is >>>> used to populate the CPU topology in the Linux guest. It's required that >>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux >>>> scheduling domain can't be sorted out, as the following warning message >>>> indicates. To avoid the unexpected confusion, this series attempts to >>>> rejects such kind of insane configurations. >>>> >>>> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ >>>> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ >>>> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ >>>> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ >>> >>> >>> And why is this a QEMU problem? This doesn't hurt ACPI. >>> >>> Also, this restriction impacts breaks ARM guests in the wild that are running >>> non-Linux OSes. I don't see why we should impact use cases that has nothing to >>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity. >>> >> >> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5 >> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and >> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple >> CPUs in different clusters and sockets have been associated with one NUMA node. >> >> If I'm correct, the configuration isn't sensible in a baremetal environment and >> same Linux kernel is supposed to work well for baremetal and virtualized machine. >> So I think QEMU needs to emulate the topology as much as we can to match with the >> baremetal environment. It's the reason why I think it's a QEMU problem even it >> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com> >> in another thread, we may need to gurantee that the CPUs in one cluster can't be >> split to multiple NUMA nodes, which matches with the baremetal environment, as I >> can understand. >> >> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will >> definitely break the configurations running in the wild. > > > What about a warning? If the user attempts to use an exotic NUMA configuration > like the one you mentioned we can print something like: > > "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave" > > This would inform the user what might be going wrong in case Linux is crashing/error > out on them and then the user is free to fix their topology (or the kernel). And > at the same time we wouldn't break existing stuff that happens to be working. > > Yes, I think a warning message is more appropriate, so that users can fix their irregular configurations and the existing configurations aren't disconnected. It would be nice to get the agreements from Daniel P. Berrangé and Drew, before I'm going to change the code and post next revision. Thanks, Gavin
On Fri, Feb 24, 2023 at 09:16:39PM +1100, Gavin Shan wrote: > On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote: > > On 2/24/23 04:09, Gavin Shan wrote: > > > On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote: > > > > On 2/23/23 05:13, Gavin Shan wrote: > > > > > For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is > > > > > used to populate the CPU topology in the Linux guest. It's required that > > > > > the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux > > > > > scheduling domain can't be sorted out, as the following warning message > > > > > indicates. To avoid the unexpected confusion, this series attempts to > > > > > rejects such kind of insane configurations. > > > > > > > > > > -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ > > > > > -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ > > > > > -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ > > > > > -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ > > > > > > > > > > > > And why is this a QEMU problem? This doesn't hurt ACPI. > > > > > > > > Also, this restriction impacts breaks ARM guests in the wild that are running > > > > non-Linux OSes. I don't see why we should impact use cases that has nothing to > > > > do with Linux Kernel feelings about sockets - NUMA nodes exclusivity. > > > > > > > > > > With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5 > > > are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and > > > cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple > > > CPUs in different clusters and sockets have been associated with one NUMA node. > > > > > > If I'm correct, the configuration isn't sensible in a baremetal environment and > > > same Linux kernel is supposed to work well for baremetal and virtualized machine. > > > So I think QEMU needs to emulate the topology as much as we can to match with the > > > baremetal environment. It's the reason why I think it's a QEMU problem even it > > > doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com> > > > in another thread, we may need to gurantee that the CPUs in one cluster can't be > > > split to multiple NUMA nodes, which matches with the baremetal environment, as I > > > can understand. > > > > > > Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will > > > definitely break the configurations running in the wild. > > > > > > What about a warning? If the user attempts to use an exotic NUMA configuration > > like the one you mentioned we can print something like: > > > > "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave" > > > > This would inform the user what might be going wrong in case Linux is crashing/error > > out on them and then the user is free to fix their topology (or the kernel). And > > at the same time we wouldn't break existing stuff that happens to be working. > > > > > > Yes, I think a warning message is more appropriate, so that users can fix their > irregular configurations and the existing configurations aren't disconnected. > It would be nice to get the agreements from Daniel P. Berrangé and Drew, before > I'm going to change the code and post next revision. > If there's a concern that this will break non-Linux OSes on arm, then, at most, the change needs to be tied to the next machine type version, and possibly it can never be made for arm. riscv is OK, since it currently ignores smp parameters anyway. It currently derives the number of sockets from the number of NUMA nodes, using a 1:1 mapping. When smp parameters are eventually implemented for riscv, then this can be revisited. Also, it sounds like not only has the rationale for this series been changed to "platform choice", but also that the cluster <-> numa node mapping should be 1:1, not the socket <-> numa node mapping. If that's the case, then the series probably needs to be reworked for that. Thanks, drew
On Fri, 24 Feb 2023 21:16:39 +1100 Gavin Shan <gshan@redhat.com> wrote: > On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote: > > On 2/24/23 04:09, Gavin Shan wrote: > >> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote: > >>> On 2/23/23 05:13, Gavin Shan wrote: > >>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is > >>>> used to populate the CPU topology in the Linux guest. It's required that > >>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux > >>>> scheduling domain can't be sorted out, as the following warning message > >>>> indicates. To avoid the unexpected confusion, this series attempts to > >>>> rejects such kind of insane configurations. > >>>> > >>>> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ > >>>> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ > >>>> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ > >>>> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ > >>> > >>> > >>> And why is this a QEMU problem? This doesn't hurt ACPI. > >>> > >>> Also, this restriction impacts breaks ARM guests in the wild that are running > >>> non-Linux OSes. I don't see why we should impact use cases that has nothing to > >>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity. > >>> > >> > >> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5 > >> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and > >> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple > >> CPUs in different clusters and sockets have been associated with one NUMA node. > >> > >> If I'm correct, the configuration isn't sensible in a baremetal environment and > >> same Linux kernel is supposed to work well for baremetal and virtualized machine. > >> So I think QEMU needs to emulate the topology as much as we can to match with the > >> baremetal environment. It's the reason why I think it's a QEMU problem even it > >> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com> > >> in another thread, we may need to gurantee that the CPUs in one cluster can't be > >> split to multiple NUMA nodes, which matches with the baremetal environment, as I > >> can understand. > >> > >> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will > >> definitely break the configurations running in the wild. > > > > > > What about a warning? If the user attempts to use an exotic NUMA configuration > > like the one you mentioned we can print something like: > > > > "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave" > > > > This would inform the user what might be going wrong in case Linux is crashing/error > > out on them and then the user is free to fix their topology (or the kernel). And > > at the same time we wouldn't break existing stuff that happens to be working. > > > > > > Yes, I think a warning message is more appropriate, so that users can fix their > irregular configurations and the existing configurations aren't disconnected. > It would be nice to get the agreements from Daniel P. Berrangé and Drew, before > I'm going to change the code and post next revision. Honestly you (and libvirt as far as I recall) are using legacy options to assign cpus to numa nodes. With '-numa node,nodeid=0,cpus=0-1' you can't really be sure what/where in topology those cpus are. What you can do is to use '-numa cpu,...' option to assign socket/core/... to numa node, ex: "-numa cpu,node-id=1,socket-id=0 " or "-numa cpu,node-id=0,socket-id=1,core-id=0 " or "-numa cpu,node-id=0,socket-id=1,core-id=1,thread-id=0 to get your desired mapping. The problem that's so far was stopping the later adoption by libvirt (Michal CCed) is that values used by it are machine specific and to do it properly, for a concrete '-M x -smp ...' at least for the first time qemu should be started with- -preconfig option and then user should query possible cpus for those values and assign them to numa nodes via QMP. btw: on x86 we also allow 'insane' configuration incl. those that do not exist on baremetal and do not warn about it anyone (i.e. it's user's responsibility to provide topology that specific guest OS could handle, aka it's not QEMU business but upper layers). (I do occasionally try introduce stricter checks in that are, but that gets push back more often that not). I'd do check only in case of a specific board where mapping is fixed in specs of emulated machine, otherwise I wouldn't bother. > Thanks, > Gavin > >
On 2/25/23 1:20 AM, Igor Mammedov wrote: > On Fri, 24 Feb 2023 21:16:39 +1100 > Gavin Shan <gshan@redhat.com> wrote: > >> On 2/24/23 8:26 PM, Daniel Henrique Barboza wrote: >>> On 2/24/23 04:09, Gavin Shan wrote: >>>> On 2/24/23 12:18 AM, Daniel Henrique Barboza wrote: >>>>> On 2/23/23 05:13, Gavin Shan wrote: >>>>>> For arm64 and RiscV architecture, the driver (/base/arch_topology.c) is >>>>>> used to populate the CPU topology in the Linux guest. It's required that >>>>>> the CPUs in one socket can't span mutiple NUMA nodes. Otherwise, the Linux >>>>>> scheduling domain can't be sorted out, as the following warning message >>>>>> indicates. To avoid the unexpected confusion, this series attempts to >>>>>> rejects such kind of insane configurations. >>>>>> >>>>>> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ >>>>>> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ >>>>>> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ >>>>>> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ >>>>> >>>>> >>>>> And why is this a QEMU problem? This doesn't hurt ACPI. >>>>> >>>>> Also, this restriction impacts breaks ARM guests in the wild that are running >>>>> non-Linux OSes. I don't see why we should impact use cases that has nothing to >>>>> do with Linux Kernel feelings about sockets - NUMA nodes exclusivity. >>>>> >>>> >>>> With above configuration, CPU-0/1/2 are put into socket-0-cluster-0 while CPU-3/4/5 >>>> are put into socket-1-cluster-0, meaning CPU-2/3 belong to different socket and >>>> cluster. However, CPU-2/3 are associated with NUMA node-1. In summary, multiple >>>> CPUs in different clusters and sockets have been associated with one NUMA node. >>>> >>>> If I'm correct, the configuration isn't sensible in a baremetal environment and >>>> same Linux kernel is supposed to work well for baremetal and virtualized machine. >>>> So I think QEMU needs to emulate the topology as much as we can to match with the >>>> baremetal environment. It's the reason why I think it's a QEMU problem even it >>>> doesn't hurt ACPI. As I said in the reply to Daniel P. Berrangé <berrange@redhat.com> >>>> in another thread, we may need to gurantee that the CPUs in one cluster can't be >>>> split to multiple NUMA nodes, which matches with the baremetal environment, as I >>>> can understand. >>>> >>>> Right, the restriction to have socket-NUMA-node or cluster-NUMA-node boundary will >>>> definitely break the configurations running in the wild. >>> >>> >>> What about a warning? If the user attempts to use an exotic NUMA configuration >>> like the one you mentioned we can print something like: >>> >>> "Warning: NUMA topologies where a socket belongs to multiple NUMA nodes can cause OSes like Linux to misbehave" >>> >>> This would inform the user what might be going wrong in case Linux is crashing/error >>> out on them and then the user is free to fix their topology (or the kernel). And >>> at the same time we wouldn't break existing stuff that happens to be working. >>> >>> >> >> Yes, I think a warning message is more appropriate, so that users can fix their >> irregular configurations and the existing configurations aren't disconnected. >> It would be nice to get the agreements from Daniel P. Berrangé and Drew, before >> I'm going to change the code and post next revision. > > Honestly you (and libvirt as far as I recall) are using legacy options > to assign cpus to numa nodes. > With '-numa node,nodeid=0,cpus=0-1' you can't really be sure what/where > in topology those cpus are. > What you can do is to use '-numa cpu,...' option to assign socket/core/... > to numa node, ex: > "-numa cpu,node-id=1,socket-id=0 " or > "-numa cpu,node-id=0,socket-id=1,core-id=0 " or > "-numa cpu,node-id=0,socket-id=1,core-id=1,thread-id=0 > to get your desired mapping. > > The problem that's so far was stopping the later adoption by libvirt (Michal CCed) > is that values used by it are machine specific and to do it properly, for a concrete > '-M x -smp ...' at least for the first time qemu should be started with- > -preconfig option and then user should query possible cpus for those values > and assign them to numa nodes via QMP. > The issue isn't related to the legacy or modern way to configure CPU-to-NUMA association. Note that qtests/numa-test also use the legacy way. For example, the issue can be triggered with the following command lines where the modern configuration is used: /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host \ -cpu host -smp 6,sockets=2,clusters=1,cores=3,threads=1 \ -m 768M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=256M \ -object memory-backend-ram,id=mem1,size=256M \ -object memory-backend-ram,id=mem2,size=256M \ -numa node,nodeid=0,memdev=mem0 \ -numa node,nodeid=1,memdev=mem1 \ -numa node,nodeid=2,memdev=mem2 \ -numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=0,thread-id=0 \ -numa cpu,node-id=0,socket-id=0,cluster-id=0,core-id=1,thread-id=0 \ -numa cpu,node-id=1,socket-id=0,cluster-id=0,core-id=2,thread-id=0 \ -numa cpu,node-id=1,socket-id=1,cluster-id=0,core-id=0,thread-id=0 \ -numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=1,thread-id=0 \ -numa cpu,node-id=2,socket-id=1,cluster-id=0,core-id=2,thread-id=0 : : [ 0.022260] =============== sched_init_domains =================== [ 0.022263] CPU-0: 36,56,0,-1 thread=0 core=0-2 cluster=0-2 llc=0 [ 0.022267] CPU-1: 36,56,1,-1 thread=1 core=0-2 cluster=0-2 llc=1 [ 0.022269] CPU-2: 36,56,2,-1 thread=2 core=0-2 cluster=0-2 llc=2 [ 0.022270] CPU-3: 136,156,3,-1 thread=3 core=3-5 cluster=3-5 llc=3 [ 0.022272] CPU-4: 136,156,4,-1 thread=4 core=3-5 cluster=3-5 llc=4 [ 0.022273] CPU-5: 136,156,5,-1 thread=5 core=3-5 cluster=3-5 llc=5 [ 0.022275] CPU-0: SMT=0 CLUSTER=0 MULTICORE=0-2 CPUMASK=0-1 0-1 [ 0.022277] CPU-1: SMT=1 CLUSTER=1 MULTICORE=0-2 CPUMASK=0-1 0-1 [ 0.022279] CPU-2: SMT=2 CLUSTER=0-2 MULTICORE=2-3 CPUMASK=2-3 2-3 [ 0.022281] CPU-3: SMT=3 CLUSTER=3-5 MULTICORE=2-3 CPUMASK=2-3 2-3 [ 0.022283] CPU-4: SMT=4 CLUSTER=4 MULTICORE=3-5 CPUMASK=4-5 4-5 [ 0.022284] CPU-5: SMT=5 CLUSTER=5 MULTICORE=3-5 CPUMASK=4-5 4-5 [ 0.022314] build_sched_domains: CPU-0 level-SMT [ 0.022317] build_sched_domains: CPU-0 level-CLS [ 0.022318] topology_span_sane: CPU-0 0 vs CPU-2 0-2 [ 0.022322] ------------[ cut here ]------------ [ 0.022323] WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2275 build_sched_domains+0x135c/0x1660 : > btw: on x86 we also allow 'insane' configuration incl. those that do not > exist on baremetal and do not warn about it anyone (i.e. it's user's > responsibility to provide topology that specific guest OS could handle, > aka it's not QEMU business but upper layers). (I do occasionally try > introduce stricter checks in that are, but that gets push back more > often that not). > > I'd do check only in case of a specific board where mapping is fixed > in specs of emulated machine, otherwise I wouldn't bother. > Right, x86 can handle the irregular configuration well and we didn't find any triggered warning messages there. As the subject indicates, it's a ARM64 or riscv specific issue. Unfortunately, I failed to find the requirement, socket-to-NUMA-node or cluster-to-NUMA-node 1:1 mapping from specs. However, it's not sensible to split CPUs in one cluster to multiple NUMA nodes in a (ARM64 or RISCv) baremetal environment. QEMU needs to emulate the environment close to the baremetal machine if we can. In summary, a warning message when multiple CPUs in one cluster are split to multiple NUMA nodes, as Daniel suggested, sounds reasonable to me. Please let me know your thoughts, Igor :) Thanks, Gavin