[v2,0/5] mm: memcg accounting of percpu memory

Message ID	20200608230819.832349-1-guro@fb.com (mailing list archive)
Headers	show Return-Path: <SRS0=PqO3=7V=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0E02720C56 Smtp-Origin-Hostprefix: devvm From: Roman Gushchin <guro@fb.com> Smtp-Origin-Hostname: devvm1291.vll0.facebook.com To: Andrew Morton <akpm@linux-foundation.org>, Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com> CC: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, <linux-mm@kvack.org>, <kernel-team@fb.com>, <linux-kernel@vger.kernel.org>, Roman Gushchin <guro@fb.com> Smtp-Origin-Cluster: vll0c01 Subject: [PATCH v2 0/5] mm: memcg accounting of percpu memory Date: Mon, 8 Jun 2020 16:08:14 -0700 Message-ID: <20200608230819.832349-1-guro@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcg accounting of percpu memory \| expand [v2,0/5] mm: memcg accounting of percpu memory [v2,1/5] percpu: return number of released bytes from pcpu_free_area() [v2,2/5] mm: memcg/percpu: account percpu memory to memory cgroups [v2,3/5] mm: memcg/percpu: per-memcg percpu memory statistics [v2,4/5] mm: memcg: charge memcg percpu memory to the parent cgroup [v2,5/5] kselftests: cgroup: add perpcu memory accounting test

Message ID

20200608230819.832349-1-guro@fb.com (mailing list archive)

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0E02720C56
Smtp-Origin-Hostprefix: devvm
From: Roman Gushchin <guro@fb.com>
Smtp-Origin-Hostname: devvm1291.vll0.facebook.com
To: Andrew Morton <akpm@linux-foundation.org>,
        Dennis Zhou
	<dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
        Christoph Lameter
	<cl@linux.com>
CC: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
        Shakeel Butt <shakeelb@google.com>, <linux-mm@kvack.org>,
        <kernel-team@fb.com>, <linux-kernel@vger.kernel.org>,
        Roman Gushchin
	<guro@fb.com>
Smtp-Origin-Cluster: vll0c01
Subject: [PATCH v2 0/5] mm: memcg accounting of percpu memory
Date: Mon, 8 Jun 2020 16:08:14 -0700
Message-ID: <20200608230819.832349-1-guro@fb.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: memcg accounting of percpu memory | expand

Message

Roman Gushchin June 8, 2020, 11:08 p.m. UTC

This patchset adds percpu memory accounting to memory cgroups.
It's based on the rework of the slab controller and reuses concepts
and features introduced for the per-object slab accounting.

Percpu memory is becoming more and more widely used by various
subsystems, and the total amount of memory controlled by the percpu
allocator can make a good part of the total memory.

As an example, bpf maps can consume a lot of percpu memory,
and they are created by a user. Also, some cgroup internals
(e.g. memory controller statistics) can be quite large.
On a machine with many CPUs and big number of cgroups they
can consume hundreds of megabytes.

So the lack of memcg accounting is creating a breach in the memory
isolation. Similar to the slab memory, percpu memory should be
accounted by default.

Percpu allocations by their nature are scattered over multiple pages,
so they can't be tracked on the per-page basis. So the per-object
tracking introduced by the new slab controller is reused.

The patchset implements charging of percpu allocations, adds
memcg-level statistics, enables accounting for percpu allocations made
by memory cgroup internals and provides some basic tests.

To implement the accounting of percpu memory without a significant
memory and performance overhead the following approach is used:
all accounted allocations are placed into a separate percpu chunk
(or chunks). These chunks are similar to default chunks, except
that they do have an attached vector of pointers to obj_cgroup objects,
which is big enough to save a pointer for each allocated object.
On the allocation, if the allocation has to be accounted
(__GFP_ACCOUNT is passed, the allocating process belongs to a non-root
memory cgroup, etc), the memory cgroup is getting charged and if the maximum
limit is not exceeded the allocation is performed using a memcg-aware
chunk. Otherwise -ENOMEM is returned or the allocation is forced over
the limit, depending on gfp (as any other kernel memory allocation).
The memory cgroup information is saved in the obj_cgroup vector
at the corresponding offset. On the release time the memcg
information is restored from the vector and the cgroup is getting
uncharged.
Unaccounted allocations (at this point the absolute majority
of all percpu allocations) are performed in the old way, so no
additional overhead is expected.

To avoid pinning dying memory cgroups by outstanding allocations,
obj_cgroup API is used instead of directly saving memory cgroup pointers.
obj_cgroup is basically a pointer to a memory cgroup with a standalone
reference counter. The trick is that it can be atomically swapped
to point at the parent cgroup, so that the original memory cgroup
can be released prior to all objects, which has been charged to it.
Because all charges and statistics are fully recursive, it's perfectly
correct to uncharge the parent cgroup instead. This scheme is used
in the slab memory accounting, and percpu memory can just follow
the scheme.

This version is based on top of v6 of the new slab controller
patchset. The following patches are actually required by this series:
  mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
  mm: memcg: prepare for byte-sized vmstat items
  mm: memcg: convert vmstat slab counters to bytes
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcontrol: decouple reference counting from page accounting
  mm: memcg/slab: obj_cgroup API

The whole series can be found here:
https://github.com/rgushchin/linux/pull/new/percpu_acc.1

v2:
  1) minor cosmetic fixes (Dennis)
  2) rebased on top of v6 of the slab controller patchset

v1:
  1) fixed a bug with gfp flags handling (Dennis)
  2) added some comments (Tejun and Dennis)
  3) rebased on top of v5 of the slab controller patchset

RFC:
  https://lore.kernel.org/linux-mm/20200519201806.2308480-1-guro@fb.com/T/#t


Roman Gushchin (5):
  percpu: return number of released bytes from pcpu_free_area()
  mm: memcg/percpu: account percpu memory to memory cgroups
  mm: memcg/percpu: per-memcg percpu memory statistics
  mm: memcg: charge memcg percpu memory to the parent cgroup
  kselftests: cgroup: add perpcu memory accounting test

 Documentation/admin-guide/cgroup-v2.rst    |   4 +
 include/linux/memcontrol.h                 |   8 +
 mm/memcontrol.c                            |  18 +-
 mm/percpu-internal.h                       |  55 +++++-
 mm/percpu-km.c                             |   5 +-
 mm/percpu-stats.c                          |  36 ++--
 mm/percpu-vm.c                             |   5 +-
 mm/percpu.c                                | 206 ++++++++++++++++++---
 tools/testing/selftests/cgroup/test_kmem.c |  70 ++++++-
 9 files changed, 358 insertions(+), 49 deletions(-)

Comments

Roman Gushchin June 16, 2020, 9:19 p.m. UTC | #1

On Mon, Jun 08, 2020 at 04:08:14PM -0700, Roman Gushchin wrote:
> This patchset adds percpu memory accounting to memory cgroups.
> It's based on the rework of the slab controller and reuses concepts
> and features introduced for the per-object slab accounting.
> 
> Percpu memory is becoming more and more widely used by various
> subsystems, and the total amount of memory controlled by the percpu
> allocator can make a good part of the total memory.
> 
> As an example, bpf maps can consume a lot of percpu memory,
> and they are created by a user. Also, some cgroup internals
> (e.g. memory controller statistics) can be quite large.
> On a machine with many CPUs and big number of cgroups they
> can consume hundreds of megabytes.
> 
> So the lack of memcg accounting is creating a breach in the memory
> isolation. Similar to the slab memory, percpu memory should be
> accounted by default.
> 
> Percpu allocations by their nature are scattered over multiple pages,
> so they can't be tracked on the per-page basis. So the per-object
> tracking introduced by the new slab controller is reused.
> 
> The patchset implements charging of percpu allocations, adds
> memcg-level statistics, enables accounting for percpu allocations made
> by memory cgroup internals and provides some basic tests.
> 
> To implement the accounting of percpu memory without a significant
> memory and performance overhead the following approach is used:
> all accounted allocations are placed into a separate percpu chunk
> (or chunks). These chunks are similar to default chunks, except
> that they do have an attached vector of pointers to obj_cgroup objects,
> which is big enough to save a pointer for each allocated object.
> On the allocation, if the allocation has to be accounted
> (__GFP_ACCOUNT is passed, the allocating process belongs to a non-root
> memory cgroup, etc), the memory cgroup is getting charged and if the maximum
> limit is not exceeded the allocation is performed using a memcg-aware
> chunk. Otherwise -ENOMEM is returned or the allocation is forced over
> the limit, depending on gfp (as any other kernel memory allocation).
> The memory cgroup information is saved in the obj_cgroup vector
> at the corresponding offset. On the release time the memcg
> information is restored from the vector and the cgroup is getting
> uncharged.
> Unaccounted allocations (at this point the absolute majority
> of all percpu allocations) are performed in the old way, so no
> additional overhead is expected.
> 
> To avoid pinning dying memory cgroups by outstanding allocations,
> obj_cgroup API is used instead of directly saving memory cgroup pointers.
> obj_cgroup is basically a pointer to a memory cgroup with a standalone
> reference counter. The trick is that it can be atomically swapped
> to point at the parent cgroup, so that the original memory cgroup
> can be released prior to all objects, which has been charged to it.
> Because all charges and statistics are fully recursive, it's perfectly
> correct to uncharge the parent cgroup instead. This scheme is used
> in the slab memory accounting, and percpu memory can just follow
> the scheme.
> 
> This version is based on top of v6 of the new slab controller
> patchset. The following patches are actually required by this series:
>   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
>   mm: memcg: prepare for byte-sized vmstat items
>   mm: memcg: convert vmstat slab counters to bytes
>   mm: slub: implement SLUB version of obj_to_index()
>   mm: memcontrol: decouple reference counting from page accounting
>   mm: memcg/slab: obj_cgroup API

Hello, Andrew!

How this patchset should be routed: through the mm or percpu tree?

It has been acked by Dennis (the percpu maintainer), but it does depend
on first several patches from the slab controller rework patchset.

The slab controller rework is ready to be merged: as in v6 most patches
in the series were acked by Johannes and/or Vlastimil and no questions
or concerns were raised after v6.

Please, let me know if you want me to resend both patchsets.

Thank you!

Roman

Andrew Morton June 17, 2020, 8:39 p.m. UTC | #2

On Tue, 16 Jun 2020 14:19:01 -0700 Roman Gushchin <guro@fb.com> wrote:

> > 
> > This version is based on top of v6 of the new slab controller
> > patchset. The following patches are actually required by this series:
> >   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
> >   mm: memcg: prepare for byte-sized vmstat items
> >   mm: memcg: convert vmstat slab counters to bytes
> >   mm: slub: implement SLUB version of obj_to_index()
> >   mm: memcontrol: decouple reference counting from page accounting
> >   mm: memcg/slab: obj_cgroup API
> 
> Hello, Andrew!
> 
> How this patchset should be routed: through the mm or percpu tree?
> 
> It has been acked by Dennis (the percpu maintainer), but it does depend
> on first several patches from the slab controller rework patchset.

I can grab both.

> The slab controller rework is ready to be merged: as in v6 most patches
> in the series were acked by Johannes and/or Vlastimil and no questions
> or concerns were raised after v6.
> 
> Please, let me know if you want me to resend both patchsets.

There was quite a bit of valuable discussion in response to [0/n] which
really should have been in the changelog[s] from day one. 
slab-vs-slub, performance testing, etc.

So, umm, I'll take a look at both series now but I do think an enhanced
[0/n] description is warranted?

Roman Gushchin June 17, 2020, 8:47 p.m. UTC | #3

On Wed, Jun 17, 2020 at 01:39:49PM -0700, Andrew Morton wrote:
> On Tue, 16 Jun 2020 14:19:01 -0700 Roman Gushchin <guro@fb.com> wrote:
> 
> > > 
> > > This version is based on top of v6 of the new slab controller
> > > patchset. The following patches are actually required by this series:
> > >   mm: memcg: factor out memcg- and lruvec-level changes out of __mod_lruvec_state()
> > >   mm: memcg: prepare for byte-sized vmstat items
> > >   mm: memcg: convert vmstat slab counters to bytes
> > >   mm: slub: implement SLUB version of obj_to_index()
> > >   mm: memcontrol: decouple reference counting from page accounting
> > >   mm: memcg/slab: obj_cgroup API
> > 
> > Hello, Andrew!
> > 
> > How this patchset should be routed: through the mm or percpu tree?
> > 
> > It has been acked by Dennis (the percpu maintainer), but it does depend
> > on first several patches from the slab controller rework patchset.
> 
> I can grab both.

Perfect, thanks!

> 
> > The slab controller rework is ready to be merged: as in v6 most patches
> > in the series were acked by Johannes and/or Vlastimil and no questions
> > or concerns were raised after v6.
> > 
> > Please, let me know if you want me to resend both patchsets.
> 
> There was quite a bit of valuable discussion in response to [0/n] which
> really should have been in the changelog[s] from day one. 
> slab-vs-slub, performance testing, etc.
> 
> So, umm, I'll take a look at both series now but I do think an enhanced
> [0/n] description is warranted?
> 

Yes, I'm running suggested tests right now, and will update on the results.

Thanks!