mbox series

[00/16] The new slab memory controller

Message ID 20191018002820.307763-1-guro@fb.com (mailing list archive)
Headers show
Series The new slab memory controller | expand

Message

Roman Gushchin Oct. 18, 2019, 12:28 a.m. UTC
The existing slab memory controller is based on the idea of replicating
slab allocator internals for each memory cgroup. This approach promises
a low memory overhead (one pointer per page), and isn't adding too much
code on hot allocation and release paths. But is has a very serious flaw:
it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and changing
memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are destroyed
with root counterparts. It allows to remove a good amount of code and make
things generally simpler.

The patchset contains a couple of semi-independent parts, which can find their
usage outside of the slab memory controller too:
1) subpage charging API, which can be used in the future for accounting of
   other non-page-sized objects, e.g. percpu allocations.
2) mem_cgroup_ptr API (refcounted pointers to a memcg, can be reused
   for the efficient reparenting of other objects, e.g. pagecache.

The patchset has been tested on a number of different workloads in our
production. In all cases, it saved hefty amounts of memory:
1) web frontend, 650-700 Mb, ~42% of slab memory
2) database cache, 750-800 Mb, ~35% of slab memory
3) dns server, 700 Mb, ~36% of slab memory

(These numbers were received used a backport of this patchset to the kernel
version used in fb production. But similar numbers can be obtained on
a vanilla kernel. If used on a modern systemd-based distributive,
e.g. Fedora 30, the patched kernel shows the same order of slab memory
savings just after system start).

So far I haven't found any regression on all tested workloads, but
potential CPU regression caused by more precise accounting is a concern.

Obviously the amount of saved memory depend on the number of memory cgroups,
uptime and specific workloads, but overall it feels like the new controller
saves 30-40% of slab memory, sometimes more. Additionally, it should lead
to a lower memory fragmentation, just because of a smaller number of
non-movable pages and also because there is no more need to move all
slab objects to a new set of pages when a workload is restarted in a new
memory cgroup.

* https://github.com/osandov/drgn


v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Roman Gushchin (16):
  mm: memcg: introduce mem_cgroup_ptr
  mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  mm: vmstat: convert slab vmstat counter to bytes
  mm: memcg/slab: allocate space for memcg ownership data for non-root
    slabs
  mm: slub: implement SLUB version of obj_to_index()
  mm: memcg/slab: save memcg ownership data for non-root slab objects
  mm: memcg: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg: introduce __mod_lruvec_memcg_state()
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg: move get_mem_cgroup_from_current() to memcontrol.h
  mm: memcg/slab: replace memcg_from_slab_page() with
    memcg_from_slab_obj()
  tools/cgroup: add slabinfo.py tool
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: use one set of kmem_caches for all memory cgroups
  tools/cgroup: make slabinfo.py compatible with new slab controller
  mm: slab: remove redundant check in memcg_accumulate_slabinfo()

 drivers/base/node.c        |  14 +-
 fs/proc/meminfo.c          |   4 +-
 include/linux/memcontrol.h |  98 +++++++-
 include/linux/mm_types.h   |   5 +-
 include/linux/mmzone.h     |  12 +-
 include/linux/slab.h       |   3 +-
 include/linux/slub_def.h   |   9 +
 include/linux/vmstat.h     |   8 +
 kernel/power/snapshot.c    |   2 +-
 mm/list_lru.c              |  12 +-
 mm/memcontrol.c            | 302 ++++++++++++-------------
 mm/oom_kill.c              |   2 +-
 mm/page_alloc.c            |   8 +-
 mm/slab.c                  |  37 ++-
 mm/slab.h                  | 300 ++++++++++++------------
 mm/slab_common.c           | 452 ++++---------------------------------
 mm/slob.c                  |  12 +-
 mm/slub.c                  |  63 ++----
 mm/vmscan.c                |   3 +-
 mm/vmstat.c                |  37 ++-
 mm/workingset.c            |   6 +-
 tools/cgroup/slabinfo.py   | 250 ++++++++++++++++++++
 22 files changed, 816 insertions(+), 823 deletions(-)
 create mode 100755 tools/cgroup/slabinfo.py

Comments

Waiman Long Oct. 18, 2019, 5:03 p.m. UTC | #1
On 10/17/19 8:28 PM, Roman Gushchin wrote:
> The existing slab memory controller is based on the idea of replicating
> slab allocator internals for each memory cgroup. This approach promises
> a low memory overhead (one pointer per page), and isn't adding too much
> code on hot allocation and release paths. But is has a very serious flaw:
                                               ^it^
> it leads to a low slab utilization.
>
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
>
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.

In the case of slub memory allocator, it is not just unused space within
a slab. It is also the use of per-cpu slabs that can hold up a lot of
memory, especially if the tasks jump around to different cpus. The
problem is compounded if a lot of memcgs are being used. Memory
utilization can improve quite significantly if per-cpu slabs are
disabled. Of course, it comes with a performance cost.

Cheers,
Longman
Roman Gushchin Oct. 18, 2019, 5:12 p.m. UTC | #2
On Fri, Oct 18, 2019 at 01:03:54PM -0400, Waiman Long wrote:
> On 10/17/19 8:28 PM, Roman Gushchin wrote:
> > The existing slab memory controller is based on the idea of replicating
> > slab allocator internals for each memory cgroup. This approach promises
> > a low memory overhead (one pointer per page), and isn't adding too much
> > code on hot allocation and release paths. But is has a very serious flaw:
>                                                ^it^
> > it leads to a low slab utilization.
> >
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> >
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> 
> In the case of slub memory allocator, it is not just unused space within
> a slab. It is also the use of per-cpu slabs that can hold up a lot of
> memory, especially if the tasks jump around to different cpus. The
> problem is compounded if a lot of memcgs are being used. Memory
> utilization can improve quite significantly if per-cpu slabs are
> disabled. Of course, it comes with a performance cost.

Right, but it's basically the same problem: if slabs can be used exclusively
by a single memory cgroup, slab utilization is low. Per-cpu slabs are just
making the problem worse by increasing the number of mostly empty slabs
proportionally to the number of CPUs.

With the disabled memory cgroup accounting slab utilization is quite high
even with per-slabs. So the problem isn't in per-cpu slabs by themselves,
they just were not designed to exist in so many copies.

Thanks!
Michal Hocko Oct. 22, 2019, 1:22 p.m. UTC | #3
On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
[...]
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.

How much of the memory are we talking about here? Also is there any
pattern for specific caches that tend to utilize much worse than others?
Michal Hocko Oct. 22, 2019, 1:28 p.m. UTC | #4
On Tue 22-10-19 15:22:06, Michal Hocko wrote:
> On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> [...]
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> 
> How much of the memory are we talking about here?

Just to be more specific. Your cover mentions several hundreds of MBs
but there is no scale to the overal charged memory. How much of that is
the actual kmem accounted memory.

> Also is there any pattern for specific caches that tend to utilize
> much worse than others?
Michal Hocko Oct. 22, 2019, 1:31 p.m. UTC | #5
On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
> 
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
> 
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and changing
> memcg pointer to the parent.
> 
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are destroyed
> with root counterparts. It allows to remove a good amount of code and make
> things generally simpler.

What is the performance impact? Also what is the effect on the memory
reclaim side and the isolation. I would expect that mixing objects from
different cgroups would have a negative/unpredictable impact on the
memcg slab shrinking.
Roman Gushchin Oct. 22, 2019, 3:48 p.m. UTC | #6
On Tue, Oct 22, 2019 at 03:28:00PM +0200, Michal Hocko wrote:
> On Tue 22-10-19 15:22:06, Michal Hocko wrote:
> > On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> > [...]
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > 
> > How much of the memory are we talking about here?
> 
> Just to be more specific. Your cover mentions several hundreds of MBs
> but there is no scale to the overal charged memory. How much of that is
> the actual kmem accounted memory.

As I wrote, on average it saves 30-45% of slab memory.
The smallest number I've seen was about 15%, the largest over 60%.

The amount of slab memory isn't a very stable metrics in general: it heavily
depends on workload pattern, memory pressure, uptime etc.
In absolute numbers I've seen savings from ~60 Mb for an empty vm to
more than 2 Gb for some production workloads.

Btw, please note that after a recent change from Vlastimil
6a486c0ad4dc ("mm, sl[ou]b: improve memory accounting")
slab counters are including large allocations which are passed
directly to the page allocator. It will makes memory savings
smaller in percents, but of course not in absolute numbers.

> 
> > Also is there any pattern for specific caches that tend to utilize
> > much worse than others?

Caches which usually have many objects (e.g. inodes) initially
have a better utilization, but as some of them are getting reclaimed
the utilization drops. And if the cgroup is already dead, no one can
reuse these mostly slab empty pages, so it's pretty wasteful.

So I don't think the problem is specific to any cache, it's pretty general.
Roman Gushchin Oct. 22, 2019, 3:59 p.m. UTC | #7
On Tue, Oct 22, 2019 at 03:31:48PM +0200, Michal Hocko wrote:
> On Thu 17-10-19 17:28:04, Roman Gushchin wrote:
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> > 
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> > 
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and changing
> > memcg pointer to the parent.
> > 
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are destroyed
> > with root counterparts. It allows to remove a good amount of code and make
> > things generally simpler.
> 
> What is the performance impact?

As I wrote, so far we haven't found any regression on any real world workload.
Of course, it's pretty easy to come up with a synthetic test which will show
some performance hit: e.g. allocate and free a large number of objects from a
single cache from a single cgroup. The reason is simple: stats and accounting
are more precise, so it requires more work. But I don't think it's a real
problem.

On the other hand I expect to see some positive effects from the significantly
reduced number of unmovable pages: memory fragmentation should become lower.
And all kernel objects will reside on a smaller number of pages, so we can
expect a better cache utilization.

> Also what is the effect on the memory
> reclaim side and the isolation. I would expect that mixing objects from
> different cgroups would have a negative/unpredictable impact on the
> memcg slab shrinking.

Slab shrinking is already working on per-object basis, so no changes here.

Quite opposite: now the freed space can be reused by other cgroups, where
previously it was often a useless operation, as nobody can reuse the space
unless all objects will be freed and the page can be returned to the page
allocator.

Thanks!