diff mbox series

arm64: add NUMA emulation support

Message ID 20180824230559.32336-1-shuah@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64: add NUMA emulation support | expand

Commit Message

Shuah Aug. 24, 2018, 11:05 p.m. UTC
Add NUMA emulation support to emulate NUMA on non-NUMA platforms. A new
CONFIG_NUMA_EMU option enables NUMA emulation and a new kernel command
line option "numa=fake=N" allows users to specify the configuration for
emulation.

When NUMA emulation is enabled, a flat (non-NUMA) machine will be split
into virtual NUMA nodes when booted with "numa=fake=N", where N is the
number of nodes, the system RAM will be split into N equal chunks and
assigned to each node.

Emulated nodes are bounded by MAX_NUMNODES and the number of memory block
count to avoid splitting memory blocks across NUMA nodes.

If NUMA emulation init fails, it will fall back to dummy NUMA init.

This is tested on Raspberry Pi3b+ with ltp NUMA test suite, numactl, and
numastat tools. In addition, tested in conjunction with cpuset cgroup to
verify cpuset.cpus and cpuset.mems assignments.

Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org>
---
 arch/arm64/Kconfig            |   9 +++
 arch/arm64/include/asm/numa.h |   8 +++
 arch/arm64/mm/Makefile        |   1 +
 arch/arm64/mm/numa.c          |   4 ++
 arch/arm64/mm/numa_emu.c      | 109 ++++++++++++++++++++++++++++++++++
 5 files changed, 131 insertions(+)
 create mode 100644 arch/arm64/mm/numa_emu.c

Comments

Will Deacon Aug. 28, 2018, 5:40 p.m. UTC | #1
On Fri, Aug 24, 2018 at 05:05:59PM -0600, Shuah Khan (Samsung OSG) wrote:
> Add NUMA emulation support to emulate NUMA on non-NUMA platforms. A new
> CONFIG_NUMA_EMU option enables NUMA emulation and a new kernel command
> line option "numa=fake=N" allows users to specify the configuration for
> emulation.
> 
> When NUMA emulation is enabled, a flat (non-NUMA) machine will be split
> into virtual NUMA nodes when booted with "numa=fake=N", where N is the
> number of nodes, the system RAM will be split into N equal chunks and
> assigned to each node.
> 
> Emulated nodes are bounded by MAX_NUMNODES and the number of memory block
> count to avoid splitting memory blocks across NUMA nodes.
> 
> If NUMA emulation init fails, it will fall back to dummy NUMA init.
> 
> This is tested on Raspberry Pi3b+ with ltp NUMA test suite, numactl, and
> numastat tools. In addition, tested in conjunction with cpuset cgroup to
> verify cpuset.cpus and cpuset.mems assignments.
> 
> Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org>
> ---
>  arch/arm64/Kconfig            |   9 +++
>  arch/arm64/include/asm/numa.h |   8 +++
>  arch/arm64/mm/Makefile        |   1 +
>  arch/arm64/mm/numa.c          |   4 ++
>  arch/arm64/mm/numa_emu.c      | 109 ++++++++++++++++++++++++++++++++++
>  5 files changed, 131 insertions(+)
>  create mode 100644 arch/arm64/mm/numa_emu.c

Hmm, is this just for debugging and kernel development? If so, it's quite a
lot of code just for that. Can't you achieve the same thing by faking up the
firmware tables?

Will
Shuah Aug. 28, 2018, 6:09 p.m. UTC | #2
On 08/28/2018 11:40 AM, Will Deacon wrote:
> On Fri, Aug 24, 2018 at 05:05:59PM -0600, Shuah Khan (Samsung OSG) wrote:
>> Add NUMA emulation support to emulate NUMA on non-NUMA platforms. A new
>> CONFIG_NUMA_EMU option enables NUMA emulation and a new kernel command
>> line option "numa=fake=N" allows users to specify the configuration for
>> emulation.
>>
>> When NUMA emulation is enabled, a flat (non-NUMA) machine will be split
>> into virtual NUMA nodes when booted with "numa=fake=N", where N is the
>> number of nodes, the system RAM will be split into N equal chunks and
>> assigned to each node.
>>
>> Emulated nodes are bounded by MAX_NUMNODES and the number of memory block
>> count to avoid splitting memory blocks across NUMA nodes.
>>
>> If NUMA emulation init fails, it will fall back to dummy NUMA init.
>>
>> This is tested on Raspberry Pi3b+ with ltp NUMA test suite, numactl, and
>> numastat tools. In addition, tested in conjunction with cpuset cgroup to
>> verify cpuset.cpus and cpuset.mems assignments.
>>
>> Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org>
>> ---
>>  arch/arm64/Kconfig            |   9 +++
>>  arch/arm64/include/asm/numa.h |   8 +++
>>  arch/arm64/mm/Makefile        |   1 +
>>  arch/arm64/mm/numa.c          |   4 ++
>>  arch/arm64/mm/numa_emu.c      | 109 ++++++++++++++++++++++++++++++++++
>>  5 files changed, 131 insertions(+)
>>  create mode 100644 arch/arm64/mm/numa_emu.c
> 
> Hmm, is this just for debugging and kernel development? If so, it's quite a
> lot of code just for that. Can't you achieve the same thing by faking up the
> firmware tables?
> 
> Will
> 

The main intent is to use numa emulation in conjunction with cpusets for coarse
memory management similar to x86_64 use-case for the same.

I verified the restricted/unrestricted using cpuset cgroup to verify cpuset.cpus
and cpuset.mems assignments. I could see the Restricted/Unrestricted case memory
usage differences. Using this it will be possible to restrict memory usage by a
class of processes or a workload or set aside memory for a workload.

This adds the same feature supported by x86_64 as described in

x86/x86_64/fake-numa-for-cpusets
Using numa=fake and CPUSets for Resource Management

I could see the Restricted/Unrestricted case memory usage differences with this
patch on Raspberry Pi3b+.

This can also be used to regression test higher level NUMA changes on non-NUMA
as well without firmware changes. This will also provide a way to expand NUMA
regression testing in kernel rings.

I was able to run ltp NUMA tests on this and verify NUMA policy code on non-NUMA
platform. Results below.

numa01 1 TINFO: The system contains 4 nodes: 
numa01 1 TPASS: NUMA local node and memory affinity
numa01 2 TPASS: NUMA preferred node policy
numa01 3 TPASS: NUMA share memory allocated in preferred node
numa01 4 TPASS: NUMA interleave policy
numa01 5 TPASS: NUMA interleave policy on shared memory
numa01 6 TPASS: NUMA phycpubind policy
numa01 7 TPASS: NUMA local node allocation
numa01 8 TPASS: NUMA MEMHOG policy
numa01 9 TPASS: NUMA policy on lib NUMA_NODE_SIZE API
numa01 10 TPASS: NUMA MIGRATEPAGES policy
numa01 11 TCONF: hugepage is not supported
grep: /sys/kernel/mm/transparent_hugepage/enabled: No such file or directory
numa01 12 TCONF: THP is not supported/enabled

Summary:
passed   10
failed   0
skipped  2
warnings 0

thanks,
-- Shuah
Michal Hocko Aug. 29, 2018, 11:08 a.m. UTC | #3
On Tue 28-08-18 12:09:53, Shuah Khan wrote:
[...]
> The main intent is to use numa emulation in conjunction with cpusets for coarse
> memory management similar to x86_64 use-case for the same.

Could you be more specific please? Why would you want a hack like this
when you have a full featured memory cgroup controller to limit the
amount of memory?
Shuah Sept. 4, 2018, 9:59 p.m. UTC | #4
Hi Michal,

Sorry for the delay in responding. I was traveling last week.

On 08/29/2018 05:08 AM, Michal Hocko wrote:
> On Tue 28-08-18 12:09:53, Shuah Khan wrote:
> [...]
>> The main intent is to use numa emulation in conjunction with cpusets for coarse
>> memory management similar to x86_64 use-case for the same.
> 
> Could you be more specific please? Why would you want a hack like this
> when you have a full featured memory cgroup controller to limit the
> amount of memory?
> 

I should have given more details about the nature of memory management use-case
this patch addresses.

Memory cgroup allows specifying memory limits and controls memory footprint of
tasks in a cgroup.

However, there are some limitations

- Memory isn't reserved for the cgroup and there is no guarantee that the memory will
  be available when it needs it.

- cgroups allocate from the same system memory pool and is shared with other cgroups.
  Since root cgroup doesn’t have limits, it could potentially impact performance on
  other cgroups in high memory pressure situations.

- Allocating entire memory blocks to a cgroup to ensure reservation and isolation isn't
  possible. Pages can be re-allocated to another processes.

With NUMA emulation, memory blocks can be split and assigned to emulated nodes, both
reservation and isolation can be supported.

This will support the following workload requirements:

- reserving one or more NUMA memory nodes for class of critical tasks that require
  guaranteed memory availability.
- isolate memory blocks with a guaranteed exclusive access.

NUMA emulation to split the flat machine into "x" number of nodes, combined with
cpuset cgroup with the following example configuration will make it possible to
support the above workloads on non-NUMA platforms.

numa=fake=4

cpuset.mems=2
cpuset.cpus=2
cpuset.mem_exclusive=1 (enabling exclusive use of the memory nodes by a CPU set)
cpuset.mem_hardwall=1  (separate the memory nodes that are allocated to different cgroups)

thanks,
-- Shuah
Michal Hocko Sept. 5, 2018, 6:42 a.m. UTC | #5
On Tue 04-09-18 15:59:34, Shuah Khan wrote:
[...]
> This will support the following workload requirements:
> 
> - reserving one or more NUMA memory nodes for class of critical tasks that require
>   guaranteed memory availability.
> - isolate memory blocks with a guaranteed exclusive access.

How do you enforce kernel doesn't allocate from those reserved nodes?
They will be in a fallback zonelists so once the memory gets used on all
other ones then the kernel happily spills over to your reserved node.

> NUMA emulation to split the flat machine into "x" number of nodes, combined with
> cpuset cgroup with the following example configuration will make it possible to
> support the above workloads on non-NUMA platforms.
> 
> numa=fake=4
> 
> cpuset.mems=2
> cpuset.cpus=2
> cpuset.mem_exclusive=1 (enabling exclusive use of the memory nodes by a CPU set)
> cpuset.mem_hardwall=1  (separate the memory nodes that are allocated to different cgroups)

This will only enforce userspace to follow and I strongly suspect that
tasks in the root cgroup will be allowed to allocate as well.
Shuah Sept. 6, 2018, 9:53 p.m. UTC | #6
On 09/05/2018 12:42 AM, Michal Hocko wrote:
> On Tue 04-09-18 15:59:34, Shuah Khan wrote:
> [...]
>> This will support the following workload requirements:
>>
>> - reserving one or more NUMA memory nodes for class of critical tasks that require
>>   guaranteed memory availability.
>> - isolate memory blocks with a guaranteed exclusive access.
> 
> How do you enforce kernel doesn't allocate from those reserved nodes?
> They will be in a fallback zonelists so once the memory gets used on all
> other ones then the kernel happily spills over to your reserved node.

I should have clarified the "isolate memory blocks with a guaranteed exclusive
access." scope.

Kernel does satisfy GFP_ATOMIC at the expense of cpuset exclusive/hardwall policies
to not stress the kernel.

It is not the intent to make sure kernel doesn't allocate from these reserved
nodes. The intent is to work within the constraints of cpuset mem.exclusive and
cpuset mem.hardwall policies.

> 
>> NUMA emulation to split the flat machine into "x" number of nodes, combined with
>> cpuset cgroup with the following example configuration will make it possible to
>> support the above workloads on non-NUMA platforms.
>>
>> numa=fake=4
>>
>> cpuset.mems=2
>> cpuset.cpus=2
>> cpuset.mem_exclusive=1 (enabling exclusive use of the memory nodes by a CPU set)
>> cpuset.mem_hardwall=1  (separate the memory nodes that are allocated to different cgroups)
> 
> This will only enforce userspace to follow and I strongly suspect that
> tasks in the root cgroup will be allowed to allocate as well.
> 

A few critical allocations could be satisfied and root cgroup prevails. It is not the
intent to have exclusivity at the expense of the kernel.

This feature will allow a way to configure cpusets on non-NUMA for workloads that can
benefit from the reservation and isolation that is available within the constraints of
exclusive cpuset policies.

thanks,
-- Shuah
Michal Hocko Sept. 7, 2018, 8:34 a.m. UTC | #7
On Thu 06-09-18 15:53:34, Shuah Khan wrote:
[...]
> A few critical allocations could be satisfied and root cgroup prevails. It is not the
> intent to have exclusivity at the expense of the kernel.

Well, it is not "few critical allocations". It can be a lot of
memory. Basically any GFP_KERNEL allocation. So how exactly you expect
this to work when you cannot estimate how much
memory will kernel eat?

> 
> This feature will allow a way to configure cpusets on non-NUMA for workloads that can
> benefit from the reservation and isolation that is available within the constraints of
> exclusive cpuset policies.

AFAIR this was the first approach Google took for the memory isolation
and they moved over to memory cgroups. I would recommend to talk to
those guys bebfore you introduce potentially a lot of code that will not
really work for the workload you indend it for.
Shuah Sept. 7, 2018, 10:30 p.m. UTC | #8
On 09/07/2018 02:34 AM, Michal Hocko wrote:
> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
> [...]
>> A few critical allocations could be satisfied and root cgroup prevails. It is not the
>> intent to have exclusivity at the expense of the kernel.
> 
> Well, it is not "few critical allocations". It can be a lot of
> memory. Basically any GFP_KERNEL allocation. So how exactly you expect
> this to work when you cannot estimate how much
> memory will kernel eat?
> 
>>
>> This feature will allow a way to configure cpusets on non-NUMA for workloads that can
>> benefit from the reservation and isolation that is available within the constraints of
>> exclusive cpuset policies.
> 
> AFAIR this was the first approach Google took for the memory isolation
> and they moved over to memory cgroups. 

In addition to isolation, being able to reserve a block instead is one of the
issues I am looking to address. Unfortunately memory cgroups won't address that
issue.

I would recommend to talk to
> those guys bebfore you introduce potentially a lot of code that will not
> really work for the workload you indend it for.
> 

Will you be able to point me to a good contact at Goggle and/or some pointers
on finding discussion on the memory isolation. My searches on lkml came up
short,

thanks,
-- Shuah
Michal Hocko Sept. 10, 2018, 1:48 p.m. UTC | #9
On Fri 07-09-18 16:30:59, Shuah Khan wrote:
> On 09/07/2018 02:34 AM, Michal Hocko wrote:
> > On Thu 06-09-18 15:53:34, Shuah Khan wrote:
> > [...]
> >> A few critical allocations could be satisfied and root cgroup prevails. It is not the
> >> intent to have exclusivity at the expense of the kernel.
> > 
> > Well, it is not "few critical allocations". It can be a lot of
> > memory. Basically any GFP_KERNEL allocation. So how exactly you expect
> > this to work when you cannot estimate how much
> > memory will kernel eat?
> > 
> >>
> >> This feature will allow a way to configure cpusets on non-NUMA for workloads that can
> >> benefit from the reservation and isolation that is available within the constraints of
> >> exclusive cpuset policies.
> > 
> > AFAIR this was the first approach Google took for the memory isolation
> > and they moved over to memory cgroups. 
> 
> In addition to isolation, being able to reserve a block instead is one of the
> issues I am looking to address. Unfortunately memory cgroups won't address that
> issue.

Could you be more specific why you need reservations other than
isolation.

> I would recommend to talk to
> > those guys bebfore you introduce potentially a lot of code that will not
> > really work for the workload you indend it for.
> > 
> 
> Will you be able to point me to a good contact at Goggle and/or some pointers
> on finding discussion on the memory isolation. My searches on lkml came up
> short,

Well, Ying Han who used to work on memcg early days is working on a
different project. So I am not really sure.
https://lwn.net/Articles/459585/ might tell you more.
Shuah Sept. 11, 2018, 2:02 a.m. UTC | #10
Hi Michal,

On 09/10/2018 07:48 AM, Michal Hocko wrote:
> On Fri 07-09-18 16:30:59, Shuah Khan wrote:
>> On 09/07/2018 02:34 AM, Michal Hocko wrote:
>>> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
[....]
>>
>> In addition to isolation, being able to reserve a block instead is one of the
>> issues I am looking to address. Unfortunately memory cgroups won't address that
>> issue.
> 
> Could you be more specific why you need reservations other than
> isolation.
> 

Taking automotive as a specific example, there are two classes of applications:
1. critical applications that must run
2. Infotainment and misc. user-space.

In this case, being able to reserve a block of memory for critical applications
will ensure the memory is available for them. If a critical application has to
restart and/or when an on-demand critical application starts, it might not be able
to allocate memory if it is not reserved.

When a flat system has multiple memory blocks, with NUMA emulation in conjunction with
cpusets, one or more block can be reserved for critical applications configuring a set
of cpus and one of more memory nodes for them.

Memory cgroups will not support such reservation. Hope this helps explain the use-case
I am trying to address with this patch.

thanks,
-- Shuah
Michal Hocko Sept. 11, 2018, 9:11 a.m. UTC | #11
On Mon 10-09-18 20:02:05, Shuah Khan wrote:
> Hi Michal,
> 
> On 09/10/2018 07:48 AM, Michal Hocko wrote:
> > On Fri 07-09-18 16:30:59, Shuah Khan wrote:
> >> On 09/07/2018 02:34 AM, Michal Hocko wrote:
> >>> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
> [....]
> >>
> >> In addition to isolation, being able to reserve a block instead is one of the
> >> issues I am looking to address. Unfortunately memory cgroups won't address that
> >> issue.
> > 
> > Could you be more specific why you need reservations other than
> > isolation.
> > 
> 
> Taking automotive as a specific example, there are two classes of applications:
> 1. critical applications that must run
> 2. Infotainment and misc. user-space.
> 
> In this case, being able to reserve a block of memory for critical applications
> will ensure the memory is available for them. If a critical application has to
> restart and/or when an on-demand critical application starts, it might not be able
> to allocate memory if it is not reserved.
> 
> When a flat system has multiple memory blocks, with NUMA emulation in conjunction with
> cpusets, one or more block can be reserved for critical applications configuring a set
> of cpus and one of more memory nodes for them.
> 
> Memory cgroups will not support such reservation. Hope this helps explain the use-case
> I am trying to address with this patch.

OK, that is more clear. I still believe that you either have to have a
very good control over memory allocations or a good luck to not see
unexpected kernel allocations in your reserved memory which might easily
break guarantees you would like to accomplish.
Shuah Sept. 11, 2018, 3:27 p.m. UTC | #12
On 09/11/2018 03:11 AM, Michal Hocko wrote:
> On Mon 10-09-18 20:02:05, Shuah Khan wrote:
>> Hi Michal,
>>
>> On 09/10/2018 07:48 AM, Michal Hocko wrote:
>>> On Fri 07-09-18 16:30:59, Shuah Khan wrote:
>>>> On 09/07/2018 02:34 AM, Michal Hocko wrote:
>>>>> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
>> [....]
>>>>
>>>> In addition to isolation, being able to reserve a block instead is one of the
>>>> issues I am looking to address. Unfortunately memory cgroups won't address that
>>>> issue.
>>>
>>> Could you be more specific why you need reservations other than
>>> isolation.
>>>
>>
>> Taking automotive as a specific example, there are two classes of applications:
>> 1. critical applications that must run
>> 2. Infotainment and misc. user-space.
>>
>> In this case, being able to reserve a block of memory for critical applications
>> will ensure the memory is available for them. If a critical application has to
>> restart and/or when an on-demand critical application starts, it might not be able
>> to allocate memory if it is not reserved.
>>
>> When a flat system has multiple memory blocks, with NUMA emulation in conjunction with
>> cpusets, one or more block can be reserved for critical applications configuring a set
>> of cpus and one of more memory nodes for them.
>>
>> Memory cgroups will not support such reservation. Hope this helps explain the use-case
>> I am trying to address with this patch.
> 
> OK, that is more clear. I still believe that you either have to have a
> very good control over memory allocations or a good luck to not see
> unexpected kernel allocations in your reserved memory which might easily
> break guarantees you would like to accomplish.
> 

Thanks. Right. I am with you on the possibility that root cgroup can eat into
the reserved memory. However, with this solution I proposed, there is a guarantee
that the cpuset cgroup that is configured for non-critical Infotainment and misc.
user-space application will not be able to allocate from the reserved memory node.

I am hoping the proposed patch will allow critical apps. reserving memory with the
exception that root cgroup and kernel can still allocate from it when needed. Perhaps
cpuset exclusive logic could be extended to look for non-exclusive memory nodes first
if it doesn't already do that. This is inline with the current cpuset approach is that
the critical kernel allocations aren't starved to ensure memory reservations.

If you don't think this solution isn't ideal/good, do you have other suggestions
for solving the problem? If not would it be okay to start with what I proposed and
build on top of as needed?

thanks,
-- Shuah
Will Deacon Sept. 11, 2018, 4:50 p.m. UTC | #13
On Tue, Sep 11, 2018 at 09:27:49AM -0600, Shuah Khan wrote:
> On 09/11/2018 03:11 AM, Michal Hocko wrote:
> > On Mon 10-09-18 20:02:05, Shuah Khan wrote:
> >> Hi Michal,
> >>
> >> On 09/10/2018 07:48 AM, Michal Hocko wrote:
> >>> On Fri 07-09-18 16:30:59, Shuah Khan wrote:
> >>>> On 09/07/2018 02:34 AM, Michal Hocko wrote:
> >>>>> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
> >> [....]
> >>>>
> >>>> In addition to isolation, being able to reserve a block instead is one of the
> >>>> issues I am looking to address. Unfortunately memory cgroups won't address that
> >>>> issue.
> >>>
> >>> Could you be more specific why you need reservations other than
> >>> isolation.
> >>>
> >>
> >> Taking automotive as a specific example, there are two classes of applications:
> >> 1. critical applications that must run
> >> 2. Infotainment and misc. user-space.
> >>
> >> In this case, being able to reserve a block of memory for critical applications
> >> will ensure the memory is available for them. If a critical application has to
> >> restart and/or when an on-demand critical application starts, it might not be able
> >> to allocate memory if it is not reserved.
> >>
> >> When a flat system has multiple memory blocks, with NUMA emulation in conjunction with
> >> cpusets, one or more block can be reserved for critical applications configuring a set
> >> of cpus and one of more memory nodes for them.
> >>
> >> Memory cgroups will not support such reservation. Hope this helps explain the use-case
> >> I am trying to address with this patch.
> > 
> > OK, that is more clear. I still believe that you either have to have a
> > very good control over memory allocations or a good luck to not see
> > unexpected kernel allocations in your reserved memory which might easily
> > break guarantees you would like to accomplish.
> > 
> 
> Thanks. Right. I am with you on the possibility that root cgroup can eat into
> the reserved memory. However, with this solution I proposed, there is a guarantee
> that the cpuset cgroup that is configured for non-critical Infotainment and misc.
> user-space application will not be able to allocate from the reserved memory node.
> 
> I am hoping the proposed patch will allow critical apps. reserving memory with the
> exception that root cgroup and kernel can still allocate from it when needed. Perhaps
> cpuset exclusive logic could be extended to look for non-exclusive memory nodes first
> if it doesn't already do that. This is inline with the current cpuset approach is that
> the critical kernel allocations aren't starved to ensure memory reservations.
> 
> If you don't think this solution isn't ideal/good, do you have other suggestions
> for solving the problem? If not would it be okay to start with what I proposed and
> build on top of as needed?

I still don't understand why this can't be achieved by faking up some NUMA
entries in the firmware table and just using the existing NUMA code that we
have.

Will
Shuah Sept. 11, 2018, 7:53 p.m. UTC | #14
On 09/11/2018 10:50 AM, Will Deacon wrote:
> On Tue, Sep 11, 2018 at 09:27:49AM -0600, Shuah Khan wrote:
>> On 09/11/2018 03:11 AM, Michal Hocko wrote:
>>> On Mon 10-09-18 20:02:05, Shuah Khan wrote:
>>>> Hi Michal,
>>>>
>>>> On 09/10/2018 07:48 AM, Michal Hocko wrote:
>>>>> On Fri 07-09-18 16:30:59, Shuah Khan wrote:
>>>>>> On 09/07/2018 02:34 AM, Michal Hocko wrote:
>>>>>>> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
>>>> [....]
>>>>>>
>>>>>> In addition to isolation, being able to reserve a block instead is one of the
>>>>>> issues I am looking to address. Unfortunately memory cgroups won't address that
>>>>>> issue.
>>>>>
>>>>> Could you be more specific why you need reservations other than
>>>>> isolation.
>>>>>
>>>>
>>>> Taking automotive as a specific example, there are two classes of applications:
>>>> 1. critical applications that must run
>>>> 2. Infotainment and misc. user-space.
>>>>
>>>> In this case, being able to reserve a block of memory for critical applications
>>>> will ensure the memory is available for them. If a critical application has to
>>>> restart and/or when an on-demand critical application starts, it might not be able
>>>> to allocate memory if it is not reserved.
>>>>
>>>> When a flat system has multiple memory blocks, with NUMA emulation in conjunction with
>>>> cpusets, one or more block can be reserved for critical applications configuring a set
>>>> of cpus and one of more memory nodes for them.
>>>>
>>>> Memory cgroups will not support such reservation. Hope this helps explain the use-case
>>>> I am trying to address with this patch.
>>>
>>> OK, that is more clear. I still believe that you either have to have a
>>> very good control over memory allocations or a good luck to not see
>>> unexpected kernel allocations in your reserved memory which might easily
>>> break guarantees you would like to accomplish.
>>>
>>
>> Thanks. Right. I am with you on the possibility that root cgroup can eat into
>> the reserved memory. However, with this solution I proposed, there is a guarantee
>> that the cpuset cgroup that is configured for non-critical Infotainment and misc.
>> user-space application will not be able to allocate from the reserved memory node.
>>
>> I am hoping the proposed patch will allow critical apps. reserving memory with the
>> exception that root cgroup and kernel can still allocate from it when needed. Perhaps
>> cpuset exclusive logic could be extended to look for non-exclusive memory nodes first
>> if it doesn't already do that. This is inline with the current cpuset approach is that
>> the critical kernel allocations aren't starved to ensure memory reservations.
>>
>> If you don't think this solution isn't ideal/good, do you have other suggestions
>> for solving the problem? If not would it be okay to start with what I proposed and
>> build on top of as needed?
> 
> I still don't understand why this can't be achieved by faking up some NUMA
> entries in the firmware table and just using the existing NUMA code that we
> have.
> 

That is what is this patch is doing in some ways. Instead of hacking the firmware
tables, it provides a command line option to split the flat machine into specified
number of NUMA memory nodes.

In addition to the new config option and new command line handling, I added one
init routine that handles the NUMA emulation and after that normal NUMA code is
leveraged.

The only change is the following added to arm64_numa_init()

if (!numa_init(arm64_numa_emu_init))
+                       return;


arm64_numa_emu_init() does nothing unless the kernel is booted with the new
"numa=fake=N"

Please note that I am not adding any new NUMA code other than just this init
routine. When the command line is specified, instead of going down the
dummy_numa_init() path it will create NUMA emulation nodes.

I was very careful in identifying the minimal amount of code needed to add
this support. The change is limited to two existing routines:
numa_parse_early_param() and arm64_numa_init().

numa_init() which is common for all the variants including the fallback
dummy_numa_init()

This allows a cleaner way to split the memory and leverage all of the NUMA
code. This makes it easier to debug problems as opposed to hacked firmware
tables.

thanks,
-- Shuah
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 29e75b47becd..6e74d9995d24 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -740,6 +740,15 @@  config NUMA
 	  local memory of the CPU and add some more
 	  NUMA awareness to the kernel.
 
+config NUMA_EMU
+	bool "NUMA emulation"
+	depends on NUMA
+	help
+	  Enable NUMA emulation. A flat machine will be split into virtual
+	  nodes when booted with "numa=fake=N", where N is the number of
+	  nodes, the system RAM will be split into N equal chunks, and
+	  assigned to each node.
+
 config NODES_SHIFT
 	int "Maximum NUMA Nodes (as a power of 2)"
 	range 1 10
diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
index 626ad01e83bf..16e8cc035872 100644
--- a/arch/arm64/include/asm/numa.h
+++ b/arch/arm64/include/asm/numa.h
@@ -29,6 +29,14 @@  static inline const struct cpumask *cpumask_of_node(int node)
 }
 #endif
 
+#ifdef CONFIG_NUMA_EMU
+void arm64_numa_emu_cmdline(char *str);
+extern int arm64_numa_emu_init(void);
+#else
+static inline void arm64_numa_emu_cmdline(char *str) {}
+static inline int arm64_numa_emu_init(void) { return -1; }
+#endif /* CONFIG_NUMA_EMU */
+
 void __init arm64_numa_init(void);
 int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 void __init numa_set_distance(int from, int to, int distance);
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..2c8634daeffa 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -8,6 +8,7 @@  obj-$(CONFIG_ARM64_PTDUMP_CORE)	+= dump.o
 obj-$(CONFIG_ARM64_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
 obj-$(CONFIG_NUMA)		+= numa.o
 obj-$(CONFIG_DEBUG_VIRTUAL)	+= physaddr.o
+obj-$(CONFIG_NUMA_EMU)		+= numa_emu.o
 KASAN_SANITIZE_physaddr.o	+= n
 
 obj-$(CONFIG_KASAN)		+= kasan_init.o
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 146c04ceaa51..9232f18e3992 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -43,6 +43,8 @@  static __init int numa_parse_early_param(char *opt)
 		return -EINVAL;
 	if (!strncmp(opt, "off", 3))
 		numa_off = true;
+	if (!strncmp(opt, "fake=", 5))
+		arm64_numa_emu_cmdline(opt + 5);
 
 	return 0;
 }
@@ -460,6 +462,8 @@  void __init arm64_numa_init(void)
 			return;
 		if (acpi_disabled && !numa_init(of_numa_init))
 			return;
+		if (!numa_init(arm64_numa_emu_init))
+			return;
 	}
 
 	numa_init(dummy_numa_init);
diff --git a/arch/arm64/mm/numa_emu.c b/arch/arm64/mm/numa_emu.c
new file mode 100644
index 000000000000..97217adb029e
--- /dev/null
+++ b/arch/arm64/mm/numa_emu.c
@@ -0,0 +1,109 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NUMA Emulation for non-NUMA platforms.
+ */
+
+#include <linux/numa.h>
+#include <linux/nodemask.h>
+#include <linux/pfn.h>
+#include <linux/bootmem.h>
+#include <linux/memblock.h>
+
+#include <asm/numa.h>
+
+static char *emu_cmdline __initdata;
+
+/*
+ * arm64_numa_emu_cmdline - parse early NUMA Emulation params.
+ */
+void __init arm64_numa_emu_cmdline(char *str)
+{
+	emu_cmdline = str;
+}
+
+/*
+ * arm64_numa_emu_init - Initialize NUMA Emulation
+ *
+ * Used when NUMA Emulation is enabled on a platform without underlying
+ * NUMA architecture.
+ */
+int __init arm64_numa_emu_init(void)
+{
+	u64 node_size;
+	int node_cnt = 0;
+	int mblk_cnt = 0;
+	int node = 0;
+	struct memblock_region *mblk;
+	bool split = false;
+	int ret;
+
+	pr_info("NUMA emulation init begin\n");
+
+	if (!emu_cmdline)
+		return -EINVAL;
+	/*
+	 * Split the system RAM into N equal chunks.
+	 */
+	ret = kstrtoint(emu_cmdline, 0, &node_cnt);
+	if (ret || node_cnt <= 0)
+		return -EINVAL;
+
+	if (node_cnt > MAX_NUMNODES)
+		node_cnt = MAX_NUMNODES;
+
+	node_size = PFN_PHYS(max_pfn) / node_cnt;
+	pr_info("NUMA emu: Node Size = %#018Lx Node = %d\n",
+		node_size, node_cnt);
+
+	for_each_memblock(memory, mblk)
+		mblk_cnt++;
+
+	/*
+	 * Size the node count to match the memory block count to avoid
+	 * splitting memory blocks across nodes. If there is only one
+	 * memory block split it.
+	 */
+	if (mblk_cnt <= node_cnt) {
+		pr_info("NUMA emu: Nodes (%d) >= Memblocks (%d)\n",
+			node_cnt, mblk_cnt);
+		if (mblk_cnt == 1) {
+			split = true;
+			pr_info("NUMA emu: Splitting single Memory Block\n");
+		} else {
+			node_cnt = mblk_cnt;
+			pr_info("NUMA emu: Adjust Nodes = Memory Blocks\n");
+		}
+	}
+
+	for_each_memblock(memory, mblk) {
+
+		if (split) {
+			for (node = 0; node < node_cnt; node++) {
+				u64 start, end;
+
+				start = mblk->base + node * node_size;
+				end = start + node_size;
+				pr_info("Adding an emulation node %d for [mem %#018Lx-%#018Lx]\n",
+					node, start, end);
+				ret = numa_add_memblk(node, start, end);
+				if (!ret)
+					continue;
+				pr_err("NUMA emulation init failed\n");
+				return ret;
+			}
+			break;
+		}
+		pr_info("Adding a emulation node %d for [mem %#018Lx-%#018Lx]\n",
+			node, mblk->base, mblk->base + mblk->size);
+		ret = numa_add_memblk(node, mblk->base,
+				      mblk->base + mblk->size);
+		if (!ret)
+			continue;
+		pr_err("NUMA emulation init failed\n");
+		return ret;
+	}
+	pr_info("NUMA: added %d emulation nodes of %#018Lx size each\n",
+		node_cnt, node_size);
+
+	return 0;
+}