[RFC,v0,0/3] CPU hotplug awareness in percpu allocator

Message ID	20210601065147.53735-1-bharata@linux.ibm.com (mailing list archive)
Headers	show Return-Path: <SRS0=6Zi/=K3=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E35A661159 From: Bharata B Rao <bharata@linux.ibm.com> To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, aneesh.kumar@linux.ibm.com, dennis@kernel.org, tj@kernel.org, cl@linux.com, akpm@linux-foundation.org, amakhalov@vmware.com, guro@fb.com, vbabka@suse.cz, srikar@linux.vnet.ibm.com, psampat@linux.ibm.com, ego@linux.vnet.ibm.com, Bharata B Rao <bharata@linux.ibm.com> Subject: [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator Date: Tue, 1 Jun 2021 12:21:44 +0530 Message-Id: <20210601065147.53735-1-bharata@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	CPU hotplug awareness in percpu allocator \| expand [RFC,v0,0/3] CPU hotplug awareness in percpu allocator [RFC,v0,1/3] percpu: CPU hotplug support for alloc_percpu() [RFC,v0,2/3] percpu: Limit percpu allocator to online cpus [RFC,v0,3/3] percpu: Avoid using percpu ptrs of non-existing cpus

Message ID

20210601065147.53735-1-bharata@linux.ibm.com (mailing list archive)

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E35A661159
From: Bharata B Rao <bharata@linux.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, aneesh.kumar@linux.ibm.com, dennis@kernel.org,
        tj@kernel.org, cl@linux.com, akpm@linux-foundation.org,
        amakhalov@vmware.com, guro@fb.com, vbabka@suse.cz,
        srikar@linux.vnet.ibm.com, psampat@linux.ibm.com,
        ego@linux.vnet.ibm.com, Bharata B Rao <bharata@linux.ibm.com>
Subject: [RFC PATCH v0 0/3] CPU hotplug awareness in percpu allocator
Date: Tue,  1 Jun 2021 12:21:44 +0530
Message-Id: <20210601065147.53735-1-bharata@linux.ibm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

CPU hotplug awareness in percpu allocator | expand

Message

Bharata B Rao June 1, 2021, 6:51 a.m. UTC

Hi,

This is an attempt to make the percpu allocator CPU hotplug aware.
Currently the percpu allocator allocates memory for all the possible
CPUs. This can lead to wastage of memory when possible number of CPUs
is significantly higher than the number of online CPUs. This can be
avoided if the percpu allocator were to allocate only for the online
CPUs and extend the allocation for other CPUs as and when they become
online. 

This early RFC work shows some good memory savings for a powerpc
KVM guest that is booted with 16 online and 1024 possible CPUs.
Here is the comparision of Percpu memory consumption from
/proc/meminfo before and after creating 1000 memcgs.

			W/o patch		W/ patch
Before			1441792 kB		22528 kB
After 1000 memcgs	4390912 kB		68608 kB

Note that the Percpu reporting in meminfo has been changed in
the patchset to reflect the allocation for online CPUs only.

More details about the approach are present in the patch
descriptions.

Bharata B Rao (3):
  percpu: CPU hotplug support for alloc_percpu()
  percpu: Limit percpu allocator to online cpus
  percpu: Avoid using percpu ptrs of non-existing cpus

 fs/namespace.c             |   4 +-
 include/linux/cpuhotplug.h |   2 +
 include/linux/percpu.h     |  15 +++
 kernel/cgroup/rstat.c      |  20 +++-
 kernel/sched/cpuacct.c     |  10 +-
 kernel/sched/psi.c         |  14 ++-
 lib/percpu-refcount.c      |   4 +-
 lib/percpu_counter.c       |   2 +-
 mm/percpu-internal.h       |   9 ++
 mm/percpu-vm.c             | 211 +++++++++++++++++++++++++++++++++-
 mm/percpu.c                | 229 +++++++++++++++++++++++++++++++++++--
 net/ipv4/fib_semantics.c   |   2 +-
 net/ipv6/route.c           |   6 +-
 13 files changed, 490 insertions(+), 38 deletions(-)

Comments

Dennis Zhou June 2, 2021, 3:01 p.m. UTC | #1

Hello,

On Tue, Jun 01, 2021 at 12:21:44PM +0530, Bharata B Rao wrote:
> Hi,
> 
> This is an attempt to make the percpu allocator CPU hotplug aware.
> Currently the percpu allocator allocates memory for all the possible
> CPUs. This can lead to wastage of memory when possible number of CPUs
> is significantly higher than the number of online CPUs. This can be
> avoided if the percpu allocator were to allocate only for the online
> CPUs and extend the allocation for other CPUs as and when they become
> online. 
> 
> This early RFC work shows some good memory savings for a powerpc
> KVM guest that is booted with 16 online and 1024 possible CPUs.
> Here is the comparision of Percpu memory consumption from
> /proc/meminfo before and after creating 1000 memcgs.
> 
> 			W/o patch		W/ patch
> Before			1441792 kB		22528 kB
> After 1000 memcgs	4390912 kB		68608 kB
> 
> Note that the Percpu reporting in meminfo has been changed in
> the patchset to reflect the allocation for online CPUs only.
> 
> More details about the approach are present in the patch
> descriptions.
> 
> Bharata B Rao (3):
>   percpu: CPU hotplug support for alloc_percpu()
>   percpu: Limit percpu allocator to online cpus
>   percpu: Avoid using percpu ptrs of non-existing cpus
> 
>  fs/namespace.c             |   4 +-
>  include/linux/cpuhotplug.h |   2 +
>  include/linux/percpu.h     |  15 +++
>  kernel/cgroup/rstat.c      |  20 +++-
>  kernel/sched/cpuacct.c     |  10 +-
>  kernel/sched/psi.c         |  14 ++-
>  lib/percpu-refcount.c      |   4 +-
>  lib/percpu_counter.c       |   2 +-
>  mm/percpu-internal.h       |   9 ++
>  mm/percpu-vm.c             | 211 +++++++++++++++++++++++++++++++++-
>  mm/percpu.c                | 229 +++++++++++++++++++++++++++++++++++--
>  net/ipv4/fib_semantics.c   |   2 +-
>  net/ipv6/route.c           |   6 +-
>  13 files changed, 490 insertions(+), 38 deletions(-)
> 
> -- 
> 2.31.1
> 

I have thought about this for a day now and to be honest my thoughts
haven't really changed since the last discussion in [1].

I struggle here for a few reasons:
1. We're intertwining cpu and memory for hotplug.
  - What does it mean if we don't have enough memory?
  - How hard do we try to reclaim memory?
  - Partially allocated cpus? Do we free it all and try again?
2. We're now blocking the whole system on the percpu mutex which can
   cause terrible side effects. If there is a large amount of percpu
   memory already in use, this means we've accumulated a substantial
   number of callbacks.
3. While I did mention a callback approach would work. I'm not thrilled
   by the additional complexity of it as it can be error prone.

Beyond the above. I still don't believe it's the most well motivated
problem. I struggle to see a world where it makes sense to let someone
scale from 16 cpus to 1024. As in my mind you would also need to scale
memory to some degree too (not necessarily linearly but a 1024 core
machine with say like 16 gigs of ram would be pretty funny).

Would it be that bad to use cold migration points and eat a little bit
of overhead for what I understand to be a relatively uncommon use case?

[1] https://lore.kernel.org/linux-mm/8E7F3D98-CB68-4418-8E0E-7287E8273DA9@vmware.com/

Thanks,
Dennis

Bharata B Rao June 4, 2021, 5:01 a.m. UTC | #2

On Wed, Jun 02, 2021 at 03:01:04PM +0000, Dennis Zhou wrote:
> Hello,
> 
> On Tue, Jun 01, 2021 at 12:21:44PM +0530, Bharata B Rao wrote:
> > Hi,
> > 
> > This is an attempt to make the percpu allocator CPU hotplug aware.
> > Currently the percpu allocator allocates memory for all the possible
> > CPUs. This can lead to wastage of memory when possible number of CPUs
> > is significantly higher than the number of online CPUs. This can be
> > avoided if the percpu allocator were to allocate only for the online
> > CPUs and extend the allocation for other CPUs as and when they become
> > online. 
> > 
> > This early RFC work shows some good memory savings for a powerpc
> > KVM guest that is booted with 16 online and 1024 possible CPUs.
> > Here is the comparision of Percpu memory consumption from
> > /proc/meminfo before and after creating 1000 memcgs.
> > 
> > 			W/o patch		W/ patch
> > Before			1441792 kB		22528 kB
> > After 1000 memcgs	4390912 kB		68608 kB
> > 
> 
> I have thought about this for a day now and to be honest my thoughts
> haven't really changed since the last discussion in [1].
> 
> I struggle here for a few reasons:
> 1. We're intertwining cpu and memory for hotplug.
>   - What does it mean if we don't have enough memory?

That means CPU hotplug will fail, but...

>   - How hard do we try to reclaim memory?
>   - Partially allocated cpus? Do we free it all and try again?

... yes these are some difficult questions. We should check if
roll back can be done cleanly and efficiently. You can see that
I am registering separate hotplug callbacks for the hotplug core
and for init routines of alloc_percpu() callers. Rolling back the
former should be fairly straight forward, but have to see how
desirable and feasible it is to undo the entire CPU hotplug when
one of the alloc_percpu callbacks fails, especially if there are
hundreds of registered alloc_percpu callbacks.

> 2. We're now blocking the whole system on the percpu mutex which can
>    cause terrible side effects. If there is a large amount of percpu
>    memory already in use, this means we've accumulated a substantial
>    number of callbacks.

I am yet to look at each caller in detail and see which of them
really need init/free callbacks and which can do without it. After
this we will have to measure the overhead all this is putting on the
hotplug path. Given that hotplug is a slow path, I wonder if some
overhead is tolerable here.

CPU hotplug already happens with cpu_hotplug_lock held, so when you
mention that this callback holding percpu mutex can have terrible
effects, are you specifically worried about blocking all the
percpu allocation requests during hotplug? Or is it something else?

> 3. While I did mention a callback approach would work. I'm not thrilled
>    by the additional complexity of it as it can be error prone.

Fair enough, the callback for the percpu allocator core seems fine
to me but since I haven't yet looked at all callers in detail, I
don't know if we would run into some issues/dependencies in any
specific callback handlers that increases the overall complexity.

Other than the callbacks, I am also bit worried about the complexity
and the overhead involved in memcg charging and uncharging at CPU
hotplug time. In my environment (powerpc kvm guest), I see that each
chunk can have a maximum of 180224 obj_cgroups. Now checking for the
valid/used one out of that, determining the allocation size and
charging/uncharging to the right memcg could be an expensive task.

> 
> Beyond the above. I still don't believe it's the most well motivated
> problem. I struggle to see a world where it makes sense to let someone
> scale from 16 cpus to 1024. As in my mind you would also need to scale
> memory to some degree too (not necessarily linearly but a 1024 core
> machine with say like 16 gigs of ram would be pretty funny).

Well the platform here provides the capability of scaling and until
that scaling happens, why consume memory for not-present CPUs is
the motivation here. But as you note, it definetely is a question of
whether any real application is making use of this scaling now
and the associated complexity.

Even if we consider the scaling from 16 to 1024 CPUs as unrealistic
for now, the usecase and the numbers from the production scenario that
Alexey mentioned in [1] (2 to 128 CPUs) is certainly a good motivator?

Alexey - You did mention about creating a huge number of memcgs and
observing VMs consuming 16-20 GB percpu memory in production. So
how any memcgs are we talking about here?

> 
> Would it be that bad to use cold migration points and eat a little bit
> of overhead for what I understand to be a relatively uncommon use case?

[1] https://lore.kernel.org/linux-mm/8E7F3D98-CB68-4418-8E0E-7287E8273DA9@vmware.com/

Regards,
Bharata.