Message ID | 20180824230559.32336-1-shuah@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: add NUMA emulation support | expand |
On Fri, Aug 24, 2018 at 05:05:59PM -0600, Shuah Khan (Samsung OSG) wrote: > Add NUMA emulation support to emulate NUMA on non-NUMA platforms. A new > CONFIG_NUMA_EMU option enables NUMA emulation and a new kernel command > line option "numa=fake=N" allows users to specify the configuration for > emulation. > > When NUMA emulation is enabled, a flat (non-NUMA) machine will be split > into virtual NUMA nodes when booted with "numa=fake=N", where N is the > number of nodes, the system RAM will be split into N equal chunks and > assigned to each node. > > Emulated nodes are bounded by MAX_NUMNODES and the number of memory block > count to avoid splitting memory blocks across NUMA nodes. > > If NUMA emulation init fails, it will fall back to dummy NUMA init. > > This is tested on Raspberry Pi3b+ with ltp NUMA test suite, numactl, and > numastat tools. In addition, tested in conjunction with cpuset cgroup to > verify cpuset.cpus and cpuset.mems assignments. > > Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org> > --- > arch/arm64/Kconfig | 9 +++ > arch/arm64/include/asm/numa.h | 8 +++ > arch/arm64/mm/Makefile | 1 + > arch/arm64/mm/numa.c | 4 ++ > arch/arm64/mm/numa_emu.c | 109 ++++++++++++++++++++++++++++++++++ > 5 files changed, 131 insertions(+) > create mode 100644 arch/arm64/mm/numa_emu.c Hmm, is this just for debugging and kernel development? If so, it's quite a lot of code just for that. Can't you achieve the same thing by faking up the firmware tables? Will
On 08/28/2018 11:40 AM, Will Deacon wrote: > On Fri, Aug 24, 2018 at 05:05:59PM -0600, Shuah Khan (Samsung OSG) wrote: >> Add NUMA emulation support to emulate NUMA on non-NUMA platforms. A new >> CONFIG_NUMA_EMU option enables NUMA emulation and a new kernel command >> line option "numa=fake=N" allows users to specify the configuration for >> emulation. >> >> When NUMA emulation is enabled, a flat (non-NUMA) machine will be split >> into virtual NUMA nodes when booted with "numa=fake=N", where N is the >> number of nodes, the system RAM will be split into N equal chunks and >> assigned to each node. >> >> Emulated nodes are bounded by MAX_NUMNODES and the number of memory block >> count to avoid splitting memory blocks across NUMA nodes. >> >> If NUMA emulation init fails, it will fall back to dummy NUMA init. >> >> This is tested on Raspberry Pi3b+ with ltp NUMA test suite, numactl, and >> numastat tools. In addition, tested in conjunction with cpuset cgroup to >> verify cpuset.cpus and cpuset.mems assignments. >> >> Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org> >> --- >> arch/arm64/Kconfig | 9 +++ >> arch/arm64/include/asm/numa.h | 8 +++ >> arch/arm64/mm/Makefile | 1 + >> arch/arm64/mm/numa.c | 4 ++ >> arch/arm64/mm/numa_emu.c | 109 ++++++++++++++++++++++++++++++++++ >> 5 files changed, 131 insertions(+) >> create mode 100644 arch/arm64/mm/numa_emu.c > > Hmm, is this just for debugging and kernel development? If so, it's quite a > lot of code just for that. Can't you achieve the same thing by faking up the > firmware tables? > > Will > The main intent is to use numa emulation in conjunction with cpusets for coarse memory management similar to x86_64 use-case for the same. I verified the restricted/unrestricted using cpuset cgroup to verify cpuset.cpus and cpuset.mems assignments. I could see the Restricted/Unrestricted case memory usage differences. Using this it will be possible to restrict memory usage by a class of processes or a workload or set aside memory for a workload. This adds the same feature supported by x86_64 as described in x86/x86_64/fake-numa-for-cpusets Using numa=fake and CPUSets for Resource Management I could see the Restricted/Unrestricted case memory usage differences with this patch on Raspberry Pi3b+. This can also be used to regression test higher level NUMA changes on non-NUMA as well without firmware changes. This will also provide a way to expand NUMA regression testing in kernel rings. I was able to run ltp NUMA tests on this and verify NUMA policy code on non-NUMA platform. Results below. numa01 1 TINFO: The system contains 4 nodes: numa01 1 TPASS: NUMA local node and memory affinity numa01 2 TPASS: NUMA preferred node policy numa01 3 TPASS: NUMA share memory allocated in preferred node numa01 4 TPASS: NUMA interleave policy numa01 5 TPASS: NUMA interleave policy on shared memory numa01 6 TPASS: NUMA phycpubind policy numa01 7 TPASS: NUMA local node allocation numa01 8 TPASS: NUMA MEMHOG policy numa01 9 TPASS: NUMA policy on lib NUMA_NODE_SIZE API numa01 10 TPASS: NUMA MIGRATEPAGES policy numa01 11 TCONF: hugepage is not supported grep: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory numa01 12 TCONF: THP is not supported/enabled Summary: passed 10 failed 0 skipped 2 warnings 0 thanks, -- Shuah
On Tue 28-08-18 12:09:53, Shuah Khan wrote: [...] > The main intent is to use numa emulation in conjunction with cpusets for coarse > memory management similar to x86_64 use-case for the same. Could you be more specific please? Why would you want a hack like this when you have a full featured memory cgroup controller to limit the amount of memory?
Hi Michal, Sorry for the delay in responding. I was traveling last week. On 08/29/2018 05:08 AM, Michal Hocko wrote: > On Tue 28-08-18 12:09:53, Shuah Khan wrote: > [...] >> The main intent is to use numa emulation in conjunction with cpusets for coarse >> memory management similar to x86_64 use-case for the same. > > Could you be more specific please? Why would you want a hack like this > when you have a full featured memory cgroup controller to limit the > amount of memory? > I should have given more details about the nature of memory management use-case this patch addresses. Memory cgroup allows specifying memory limits and controls memory footprint of tasks in a cgroup. However, there are some limitations - Memory isn't reserved for the cgroup and there is no guarantee that the memory will be available when it needs it. - cgroups allocate from the same system memory pool and is shared with other cgroups. Since root cgroup doesn’t have limits, it could potentially impact performance on other cgroups in high memory pressure situations. - Allocating entire memory blocks to a cgroup to ensure reservation and isolation isn't possible. Pages can be re-allocated to another processes. With NUMA emulation, memory blocks can be split and assigned to emulated nodes, both reservation and isolation can be supported. This will support the following workload requirements: - reserving one or more NUMA memory nodes for class of critical tasks that require guaranteed memory availability. - isolate memory blocks with a guaranteed exclusive access. NUMA emulation to split the flat machine into "x" number of nodes, combined with cpuset cgroup with the following example configuration will make it possible to support the above workloads on non-NUMA platforms. numa=fake=4 cpuset.mems=2 cpuset.cpus=2 cpuset.mem_exclusive=1 (enabling exclusive use of the memory nodes by a CPU set) cpuset.mem_hardwall=1 (separate the memory nodes that are allocated to different cgroups) thanks, -- Shuah
On Tue 04-09-18 15:59:34, Shuah Khan wrote: [...] > This will support the following workload requirements: > > - reserving one or more NUMA memory nodes for class of critical tasks that require > guaranteed memory availability. > - isolate memory blocks with a guaranteed exclusive access. How do you enforce kernel doesn't allocate from those reserved nodes? They will be in a fallback zonelists so once the memory gets used on all other ones then the kernel happily spills over to your reserved node. > NUMA emulation to split the flat machine into "x" number of nodes, combined with > cpuset cgroup with the following example configuration will make it possible to > support the above workloads on non-NUMA platforms. > > numa=fake=4 > > cpuset.mems=2 > cpuset.cpus=2 > cpuset.mem_exclusive=1 (enabling exclusive use of the memory nodes by a CPU set) > cpuset.mem_hardwall=1 (separate the memory nodes that are allocated to different cgroups) This will only enforce userspace to follow and I strongly suspect that tasks in the root cgroup will be allowed to allocate as well.
On 09/05/2018 12:42 AM, Michal Hocko wrote: > On Tue 04-09-18 15:59:34, Shuah Khan wrote: > [...] >> This will support the following workload requirements: >> >> - reserving one or more NUMA memory nodes for class of critical tasks that require >> guaranteed memory availability. >> - isolate memory blocks with a guaranteed exclusive access. > > How do you enforce kernel doesn't allocate from those reserved nodes? > They will be in a fallback zonelists so once the memory gets used on all > other ones then the kernel happily spills over to your reserved node. I should have clarified the "isolate memory blocks with a guaranteed exclusive access." scope. Kernel does satisfy GFP_ATOMIC at the expense of cpuset exclusive/hardwall policies to not stress the kernel. It is not the intent to make sure kernel doesn't allocate from these reserved nodes. The intent is to work within the constraints of cpuset mem.exclusive and cpuset mem.hardwall policies. > >> NUMA emulation to split the flat machine into "x" number of nodes, combined with >> cpuset cgroup with the following example configuration will make it possible to >> support the above workloads on non-NUMA platforms. >> >> numa=fake=4 >> >> cpuset.mems=2 >> cpuset.cpus=2 >> cpuset.mem_exclusive=1 (enabling exclusive use of the memory nodes by a CPU set) >> cpuset.mem_hardwall=1 (separate the memory nodes that are allocated to different cgroups) > > This will only enforce userspace to follow and I strongly suspect that > tasks in the root cgroup will be allowed to allocate as well. > A few critical allocations could be satisfied and root cgroup prevails. It is not the intent to have exclusivity at the expense of the kernel. This feature will allow a way to configure cpusets on non-NUMA for workloads that can benefit from the reservation and isolation that is available within the constraints of exclusive cpuset policies. thanks, -- Shuah
On Thu 06-09-18 15:53:34, Shuah Khan wrote: [...] > A few critical allocations could be satisfied and root cgroup prevails. It is not the > intent to have exclusivity at the expense of the kernel. Well, it is not "few critical allocations". It can be a lot of memory. Basically any GFP_KERNEL allocation. So how exactly you expect this to work when you cannot estimate how much memory will kernel eat? > > This feature will allow a way to configure cpusets on non-NUMA for workloads that can > benefit from the reservation and isolation that is available within the constraints of > exclusive cpuset policies. AFAIR this was the first approach Google took for the memory isolation and they moved over to memory cgroups. I would recommend to talk to those guys bebfore you introduce potentially a lot of code that will not really work for the workload you indend it for.
On 09/07/2018 02:34 AM, Michal Hocko wrote: > On Thu 06-09-18 15:53:34, Shuah Khan wrote: > [...] >> A few critical allocations could be satisfied and root cgroup prevails. It is not the >> intent to have exclusivity at the expense of the kernel. > > Well, it is not "few critical allocations". It can be a lot of > memory. Basically any GFP_KERNEL allocation. So how exactly you expect > this to work when you cannot estimate how much > memory will kernel eat? > >> >> This feature will allow a way to configure cpusets on non-NUMA for workloads that can >> benefit from the reservation and isolation that is available within the constraints of >> exclusive cpuset policies. > > AFAIR this was the first approach Google took for the memory isolation > and they moved over to memory cgroups. In addition to isolation, being able to reserve a block instead is one of the issues I am looking to address. Unfortunately memory cgroups won't address that issue. I would recommend to talk to > those guys bebfore you introduce potentially a lot of code that will not > really work for the workload you indend it for. > Will you be able to point me to a good contact at Goggle and/or some pointers on finding discussion on the memory isolation. My searches on lkml came up short, thanks, -- Shuah
On Fri 07-09-18 16:30:59, Shuah Khan wrote: > On 09/07/2018 02:34 AM, Michal Hocko wrote: > > On Thu 06-09-18 15:53:34, Shuah Khan wrote: > > [...] > >> A few critical allocations could be satisfied and root cgroup prevails. It is not the > >> intent to have exclusivity at the expense of the kernel. > > > > Well, it is not "few critical allocations". It can be a lot of > > memory. Basically any GFP_KERNEL allocation. So how exactly you expect > > this to work when you cannot estimate how much > > memory will kernel eat? > > > >> > >> This feature will allow a way to configure cpusets on non-NUMA for workloads that can > >> benefit from the reservation and isolation that is available within the constraints of > >> exclusive cpuset policies. > > > > AFAIR this was the first approach Google took for the memory isolation > > and they moved over to memory cgroups. > > In addition to isolation, being able to reserve a block instead is one of the > issues I am looking to address. Unfortunately memory cgroups won't address that > issue. Could you be more specific why you need reservations other than isolation. > I would recommend to talk to > > those guys bebfore you introduce potentially a lot of code that will not > > really work for the workload you indend it for. > > > > Will you be able to point me to a good contact at Goggle and/or some pointers > on finding discussion on the memory isolation. My searches on lkml came up > short, Well, Ying Han who used to work on memcg early days is working on a different project. So I am not really sure. https://lwn.net/Articles/459585/ might tell you more.
Hi Michal, On 09/10/2018 07:48 AM, Michal Hocko wrote: > On Fri 07-09-18 16:30:59, Shuah Khan wrote: >> On 09/07/2018 02:34 AM, Michal Hocko wrote: >>> On Thu 06-09-18 15:53:34, Shuah Khan wrote: [....] >> >> In addition to isolation, being able to reserve a block instead is one of the >> issues I am looking to address. Unfortunately memory cgroups won't address that >> issue. > > Could you be more specific why you need reservations other than > isolation. > Taking automotive as a specific example, there are two classes of applications: 1. critical applications that must run 2. Infotainment and misc. user-space. In this case, being able to reserve a block of memory for critical applications will ensure the memory is available for them. If a critical application has to restart and/or when an on-demand critical application starts, it might not be able to allocate memory if it is not reserved. When a flat system has multiple memory blocks, with NUMA emulation in conjunction with cpusets, one or more block can be reserved for critical applications configuring a set of cpus and one of more memory nodes for them. Memory cgroups will not support such reservation. Hope this helps explain the use-case I am trying to address with this patch. thanks, -- Shuah
On Mon 10-09-18 20:02:05, Shuah Khan wrote: > Hi Michal, > > On 09/10/2018 07:48 AM, Michal Hocko wrote: > > On Fri 07-09-18 16:30:59, Shuah Khan wrote: > >> On 09/07/2018 02:34 AM, Michal Hocko wrote: > >>> On Thu 06-09-18 15:53:34, Shuah Khan wrote: > [....] > >> > >> In addition to isolation, being able to reserve a block instead is one of the > >> issues I am looking to address. Unfortunately memory cgroups won't address that > >> issue. > > > > Could you be more specific why you need reservations other than > > isolation. > > > > Taking automotive as a specific example, there are two classes of applications: > 1. critical applications that must run > 2. Infotainment and misc. user-space. > > In this case, being able to reserve a block of memory for critical applications > will ensure the memory is available for them. If a critical application has to > restart and/or when an on-demand critical application starts, it might not be able > to allocate memory if it is not reserved. > > When a flat system has multiple memory blocks, with NUMA emulation in conjunction with > cpusets, one or more block can be reserved for critical applications configuring a set > of cpus and one of more memory nodes for them. > > Memory cgroups will not support such reservation. Hope this helps explain the use-case > I am trying to address with this patch. OK, that is more clear. I still believe that you either have to have a very good control over memory allocations or a good luck to not see unexpected kernel allocations in your reserved memory which might easily break guarantees you would like to accomplish.
On 09/11/2018 03:11 AM, Michal Hocko wrote: > On Mon 10-09-18 20:02:05, Shuah Khan wrote: >> Hi Michal, >> >> On 09/10/2018 07:48 AM, Michal Hocko wrote: >>> On Fri 07-09-18 16:30:59, Shuah Khan wrote: >>>> On 09/07/2018 02:34 AM, Michal Hocko wrote: >>>>> On Thu 06-09-18 15:53:34, Shuah Khan wrote: >> [....] >>>> >>>> In addition to isolation, being able to reserve a block instead is one of the >>>> issues I am looking to address. Unfortunately memory cgroups won't address that >>>> issue. >>> >>> Could you be more specific why you need reservations other than >>> isolation. >>> >> >> Taking automotive as a specific example, there are two classes of applications: >> 1. critical applications that must run >> 2. Infotainment and misc. user-space. >> >> In this case, being able to reserve a block of memory for critical applications >> will ensure the memory is available for them. If a critical application has to >> restart and/or when an on-demand critical application starts, it might not be able >> to allocate memory if it is not reserved. >> >> When a flat system has multiple memory blocks, with NUMA emulation in conjunction with >> cpusets, one or more block can be reserved for critical applications configuring a set >> of cpus and one of more memory nodes for them. >> >> Memory cgroups will not support such reservation. Hope this helps explain the use-case >> I am trying to address with this patch. > > OK, that is more clear. I still believe that you either have to have a > very good control over memory allocations or a good luck to not see > unexpected kernel allocations in your reserved memory which might easily > break guarantees you would like to accomplish. > Thanks. Right. I am with you on the possibility that root cgroup can eat into the reserved memory. However, with this solution I proposed, there is a guarantee that the cpuset cgroup that is configured for non-critical Infotainment and misc. user-space application will not be able to allocate from the reserved memory node. I am hoping the proposed patch will allow critical apps. reserving memory with the exception that root cgroup and kernel can still allocate from it when needed. Perhaps cpuset exclusive logic could be extended to look for non-exclusive memory nodes first if it doesn't already do that. This is inline with the current cpuset approach is that the critical kernel allocations aren't starved to ensure memory reservations. If you don't think this solution isn't ideal/good, do you have other suggestions for solving the problem? If not would it be okay to start with what I proposed and build on top of as needed? thanks, -- Shuah
On Tue, Sep 11, 2018 at 09:27:49AM -0600, Shuah Khan wrote: > On 09/11/2018 03:11 AM, Michal Hocko wrote: > > On Mon 10-09-18 20:02:05, Shuah Khan wrote: > >> Hi Michal, > >> > >> On 09/10/2018 07:48 AM, Michal Hocko wrote: > >>> On Fri 07-09-18 16:30:59, Shuah Khan wrote: > >>>> On 09/07/2018 02:34 AM, Michal Hocko wrote: > >>>>> On Thu 06-09-18 15:53:34, Shuah Khan wrote: > >> [....] > >>>> > >>>> In addition to isolation, being able to reserve a block instead is one of the > >>>> issues I am looking to address. Unfortunately memory cgroups won't address that > >>>> issue. > >>> > >>> Could you be more specific why you need reservations other than > >>> isolation. > >>> > >> > >> Taking automotive as a specific example, there are two classes of applications: > >> 1. critical applications that must run > >> 2. Infotainment and misc. user-space. > >> > >> In this case, being able to reserve a block of memory for critical applications > >> will ensure the memory is available for them. If a critical application has to > >> restart and/or when an on-demand critical application starts, it might not be able > >> to allocate memory if it is not reserved. > >> > >> When a flat system has multiple memory blocks, with NUMA emulation in conjunction with > >> cpusets, one or more block can be reserved for critical applications configuring a set > >> of cpus and one of more memory nodes for them. > >> > >> Memory cgroups will not support such reservation. Hope this helps explain the use-case > >> I am trying to address with this patch. > > > > OK, that is more clear. I still believe that you either have to have a > > very good control over memory allocations or a good luck to not see > > unexpected kernel allocations in your reserved memory which might easily > > break guarantees you would like to accomplish. > > > > Thanks. Right. I am with you on the possibility that root cgroup can eat into > the reserved memory. However, with this solution I proposed, there is a guarantee > that the cpuset cgroup that is configured for non-critical Infotainment and misc. > user-space application will not be able to allocate from the reserved memory node. > > I am hoping the proposed patch will allow critical apps. reserving memory with the > exception that root cgroup and kernel can still allocate from it when needed. Perhaps > cpuset exclusive logic could be extended to look for non-exclusive memory nodes first > if it doesn't already do that. This is inline with the current cpuset approach is that > the critical kernel allocations aren't starved to ensure memory reservations. > > If you don't think this solution isn't ideal/good, do you have other suggestions > for solving the problem? If not would it be okay to start with what I proposed and > build on top of as needed? I still don't understand why this can't be achieved by faking up some NUMA entries in the firmware table and just using the existing NUMA code that we have. Will
On 09/11/2018 10:50 AM, Will Deacon wrote: > On Tue, Sep 11, 2018 at 09:27:49AM -0600, Shuah Khan wrote: >> On 09/11/2018 03:11 AM, Michal Hocko wrote: >>> On Mon 10-09-18 20:02:05, Shuah Khan wrote: >>>> Hi Michal, >>>> >>>> On 09/10/2018 07:48 AM, Michal Hocko wrote: >>>>> On Fri 07-09-18 16:30:59, Shuah Khan wrote: >>>>>> On 09/07/2018 02:34 AM, Michal Hocko wrote: >>>>>>> On Thu 06-09-18 15:53:34, Shuah Khan wrote: >>>> [....] >>>>>> >>>>>> In addition to isolation, being able to reserve a block instead is one of the >>>>>> issues I am looking to address. Unfortunately memory cgroups won't address that >>>>>> issue. >>>>> >>>>> Could you be more specific why you need reservations other than >>>>> isolation. >>>>> >>>> >>>> Taking automotive as a specific example, there are two classes of applications: >>>> 1. critical applications that must run >>>> 2. Infotainment and misc. user-space. >>>> >>>> In this case, being able to reserve a block of memory for critical applications >>>> will ensure the memory is available for them. If a critical application has to >>>> restart and/or when an on-demand critical application starts, it might not be able >>>> to allocate memory if it is not reserved. >>>> >>>> When a flat system has multiple memory blocks, with NUMA emulation in conjunction with >>>> cpusets, one or more block can be reserved for critical applications configuring a set >>>> of cpus and one of more memory nodes for them. >>>> >>>> Memory cgroups will not support such reservation. Hope this helps explain the use-case >>>> I am trying to address with this patch. >>> >>> OK, that is more clear. I still believe that you either have to have a >>> very good control over memory allocations or a good luck to not see >>> unexpected kernel allocations in your reserved memory which might easily >>> break guarantees you would like to accomplish. >>> >> >> Thanks. Right. I am with you on the possibility that root cgroup can eat into >> the reserved memory. However, with this solution I proposed, there is a guarantee >> that the cpuset cgroup that is configured for non-critical Infotainment and misc. >> user-space application will not be able to allocate from the reserved memory node. >> >> I am hoping the proposed patch will allow critical apps. reserving memory with the >> exception that root cgroup and kernel can still allocate from it when needed. Perhaps >> cpuset exclusive logic could be extended to look for non-exclusive memory nodes first >> if it doesn't already do that. This is inline with the current cpuset approach is that >> the critical kernel allocations aren't starved to ensure memory reservations. >> >> If you don't think this solution isn't ideal/good, do you have other suggestions >> for solving the problem? If not would it be okay to start with what I proposed and >> build on top of as needed? > > I still don't understand why this can't be achieved by faking up some NUMA > entries in the firmware table and just using the existing NUMA code that we > have. > That is what is this patch is doing in some ways. Instead of hacking the firmware tables, it provides a command line option to split the flat machine into specified number of NUMA memory nodes. In addition to the new config option and new command line handling, I added one init routine that handles the NUMA emulation and after that normal NUMA code is leveraged. The only change is the following added to arm64_numa_init() if (!numa_init(arm64_numa_emu_init)) + return; arm64_numa_emu_init() does nothing unless the kernel is booted with the new "numa=fake=N" Please note that I am not adding any new NUMA code other than just this init routine. When the command line is specified, instead of going down the dummy_numa_init() path it will create NUMA emulation nodes. I was very careful in identifying the minimal amount of code needed to add this support. The change is limited to two existing routines: numa_parse_early_param() and arm64_numa_init(). numa_init() which is common for all the variants including the fallback dummy_numa_init() This allows a cleaner way to split the memory and leverage all of the NUMA code. This makes it easier to debug problems as opposed to hacked firmware tables. thanks, -- Shuah
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 29e75b47becd..6e74d9995d24 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -740,6 +740,15 @@ config NUMA local memory of the CPU and add some more NUMA awareness to the kernel. +config NUMA_EMU + bool "NUMA emulation" + depends on NUMA + help + Enable NUMA emulation. A flat machine will be split into virtual + nodes when booted with "numa=fake=N", where N is the number of + nodes, the system RAM will be split into N equal chunks, and + assigned to each node. + config NODES_SHIFT int "Maximum NUMA Nodes (as a power of 2)" range 1 10 diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h index 626ad01e83bf..16e8cc035872 100644 --- a/arch/arm64/include/asm/numa.h +++ b/arch/arm64/include/asm/numa.h @@ -29,6 +29,14 @@ static inline const struct cpumask *cpumask_of_node(int node) } #endif +#ifdef CONFIG_NUMA_EMU +void arm64_numa_emu_cmdline(char *str); +extern int arm64_numa_emu_init(void); +#else +static inline void arm64_numa_emu_cmdline(char *str) {} +static inline int arm64_numa_emu_init(void) { return -1; } +#endif /* CONFIG_NUMA_EMU */ + void __init arm64_numa_init(void); int __init numa_add_memblk(int nodeid, u64 start, u64 end); void __init numa_set_distance(int from, int to, int distance); diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile index 849c1df3d214..2c8634daeffa 100644 --- a/arch/arm64/mm/Makefile +++ b/arch/arm64/mm/Makefile @@ -8,6 +8,7 @@ obj-$(CONFIG_ARM64_PTDUMP_CORE) += dump.o obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS) += ptdump_debugfs.o obj-$(CONFIG_NUMA) += numa.o obj-$(CONFIG_DEBUG_VIRTUAL) += physaddr.o +obj-$(CONFIG_NUMA_EMU) += numa_emu.o KASAN_SANITIZE_physaddr.o += n obj-$(CONFIG_KASAN) += kasan_init.o diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index 146c04ceaa51..9232f18e3992 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -43,6 +43,8 @@ static __init int numa_parse_early_param(char *opt) return -EINVAL; if (!strncmp(opt, "off", 3)) numa_off = true; + if (!strncmp(opt, "fake=", 5)) + arm64_numa_emu_cmdline(opt + 5); return 0; } @@ -460,6 +462,8 @@ void __init arm64_numa_init(void) return; if (acpi_disabled && !numa_init(of_numa_init)) return; + if (!numa_init(arm64_numa_emu_init)) + return; } numa_init(dummy_numa_init); diff --git a/arch/arm64/mm/numa_emu.c b/arch/arm64/mm/numa_emu.c new file mode 100644 index 000000000000..97217adb029e --- /dev/null +++ b/arch/arm64/mm/numa_emu.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * NUMA Emulation for non-NUMA platforms. + */ + +#include <linux/numa.h> +#include <linux/nodemask.h> +#include <linux/pfn.h> +#include <linux/bootmem.h> +#include <linux/memblock.h> + +#include <asm/numa.h> + +static char *emu_cmdline __initdata; + +/* + * arm64_numa_emu_cmdline - parse early NUMA Emulation params. + */ +void __init arm64_numa_emu_cmdline(char *str) +{ + emu_cmdline = str; +} + +/* + * arm64_numa_emu_init - Initialize NUMA Emulation + * + * Used when NUMA Emulation is enabled on a platform without underlying + * NUMA architecture. + */ +int __init arm64_numa_emu_init(void) +{ + u64 node_size; + int node_cnt = 0; + int mblk_cnt = 0; + int node = 0; + struct memblock_region *mblk; + bool split = false; + int ret; + + pr_info("NUMA emulation init begin\n"); + + if (!emu_cmdline) + return -EINVAL; + /* + * Split the system RAM into N equal chunks. + */ + ret = kstrtoint(emu_cmdline, 0, &node_cnt); + if (ret || node_cnt <= 0) + return -EINVAL; + + if (node_cnt > MAX_NUMNODES) + node_cnt = MAX_NUMNODES; + + node_size = PFN_PHYS(max_pfn) / node_cnt; + pr_info("NUMA emu: Node Size = %#018Lx Node = %d\n", + node_size, node_cnt); + + for_each_memblock(memory, mblk) + mblk_cnt++; + + /* + * Size the node count to match the memory block count to avoid + * splitting memory blocks across nodes. If there is only one + * memory block split it. + */ + if (mblk_cnt <= node_cnt) { + pr_info("NUMA emu: Nodes (%d) >= Memblocks (%d)\n", + node_cnt, mblk_cnt); + if (mblk_cnt == 1) { + split = true; + pr_info("NUMA emu: Splitting single Memory Block\n"); + } else { + node_cnt = mblk_cnt; + pr_info("NUMA emu: Adjust Nodes = Memory Blocks\n"); + } + } + + for_each_memblock(memory, mblk) { + + if (split) { + for (node = 0; node < node_cnt; node++) { + u64 start, end; + + start = mblk->base + node * node_size; + end = start + node_size; + pr_info("Adding an emulation node %d for [mem %#018Lx-%#018Lx]\n", + node, start, end); + ret = numa_add_memblk(node, start, end); + if (!ret) + continue; + pr_err("NUMA emulation init failed\n"); + return ret; + } + break; + } + pr_info("Adding a emulation node %d for [mem %#018Lx-%#018Lx]\n", + node, mblk->base, mblk->base + mblk->size); + ret = numa_add_memblk(node, mblk->base, + mblk->base + mblk->size); + if (!ret) + continue; + pr_err("NUMA emulation init failed\n"); + return ret; + } + pr_info("NUMA: added %d emulation nodes of %#018Lx size each\n", + node_cnt, node_size); + + return 0; +}
Add NUMA emulation support to emulate NUMA on non-NUMA platforms. A new CONFIG_NUMA_EMU option enables NUMA emulation and a new kernel command line option "numa=fake=N" allows users to specify the configuration for emulation. When NUMA emulation is enabled, a flat (non-NUMA) machine will be split into virtual NUMA nodes when booted with "numa=fake=N", where N is the number of nodes, the system RAM will be split into N equal chunks and assigned to each node. Emulated nodes are bounded by MAX_NUMNODES and the number of memory block count to avoid splitting memory blocks across NUMA nodes. If NUMA emulation init fails, it will fall back to dummy NUMA init. This is tested on Raspberry Pi3b+ with ltp NUMA test suite, numactl, and numastat tools. In addition, tested in conjunction with cpuset cgroup to verify cpuset.cpus and cpuset.mems assignments. Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org> --- arch/arm64/Kconfig | 9 +++ arch/arm64/include/asm/numa.h | 8 +++ arch/arm64/mm/Makefile | 1 + arch/arm64/mm/numa.c | 4 ++ arch/arm64/mm/numa_emu.c | 109 ++++++++++++++++++++++++++++++++++ 5 files changed, 131 insertions(+) create mode 100644 arch/arm64/mm/numa_emu.c