mbox series

[v2,00/28] The new cgroup slab memory controller

Message ID 20200127173453.2089565-1-guro@fb.com (mailing list archive)
Headers show
Series The new cgroup slab memory controller | expand

Message

Roman Gushchin Jan. 27, 2020, 5:34 p.m. UTC
The existing cgroup slab memory controller is based on the idea of
replicating slab allocator internals for each memory cgroup.
This approach promises a low memory overhead (one pointer per page),
and isn't adding too much code on hot allocation and release paths.
But is has a very serious flaw: it leads to a low slab utilization.

Using a drgn* script I've got an estimation of slab utilization on
a number of machines running different production workloads. In most
cases it was between 45% and 65%, and the best number I've seen was
around 85%. Turning kmem accounting off brings it to high 90s. Also
it brings back 30-50% of slab memory. It means that the real price
of the existing slab memory controller is way bigger than a pointer
per page.

The real reason why the existing design leads to a low slab utilization
is simple: slab pages are used exclusively by one memory cgroup.
If there are only few allocations of certain size made by a cgroup,
or if some active objects (e.g. dentries) are left after the cgroup is
deleted, or the cgroup contains a single-threaded application which is
barely allocating any kernel objects, but does it every time on a new CPU:
in all these cases the resulting slab utilization is very low.
If kmem accounting is off, the kernel is able to use free space
on slab pages for other allocations.

Arguably it wasn't an issue back to days when the kmem controller was
introduced and was an opt-in feature, which had to be turned on
individually for each memory cgroup. But now it's turned on by default
on both cgroup v1 and v2. And modern systemd-based systems tend to
create a large number of cgroups.

This patchset provides a new implementation of the slab memory controller,
which aims to reach a much better slab utilization by sharing slab pages
between multiple memory cgroups. Below is the short description of the new
design (more details in commit messages).

Accounting is performed per-object instead of per-page. Slab-related
vmstat counters are converted to bytes. Charging is performed on page-basis,
with rounding up and remembering leftovers.

Memcg ownership data is stored in a per-slab-page vector: for each slab page
a vector of corresponding size is allocated. To keep slab memory reparenting
working, instead of saving a pointer to the memory cgroup directly an
intermediate object is used. It's simply a pointer to a memcg (which can be
easily changed to the parent) with a built-in reference counter. This scheme
allows to reparent all allocated objects without walking them over and
changing memcg pointer to the parent.

Instead of creating an individual set of kmem_caches for each memory cgroup,
two global sets are used: the root set for non-accounted and root-cgroup
allocations and the second set for all other allocations. This allows to
simplify the lifetime management of individual kmem_caches: they are
destroyed with root counterparts. It allows to remove a good amount of code
and make things generally simpler.

The patchset* has been tested on a number of different workloads in our
production. In all cases it saved significant amount of memory, measured
from high hundreds of MBs to single GBs per host. On average, the size
of slab memory has been reduced by 35-45%.

(* These numbers were received used a backport of this patchset to the kernel
version used in fb production. But similar numbers can be obtained on
a vanilla kernel. On my personal desktop with 8-cores CPU and 16 GB of RAM
running Fedora 31 the new slab controller saves ~45-50% of slab memory,
measured just after loading of the system).

Additionally, it should lead to a lower memory fragmentation, just because
of a smaller number of non-movable pages and also because there is no more
need to move all slab objects to a new set of pages when a workload is
restarted in a new memory cgroup.

The patchset consists of several blocks:
patches (1)-(6) clean up the existing kmem accounting API,
patches (7)-(13) prepare vmstat to count individual slab objects,
patches (14)-(21) implement the main idea of the patchset,
patches (22)-(25) are following clean-ups of the memcg/slab code,
patches (26)-(27) implement a drgn-based replacement for per-memcg slabinfo,
patch (28) add kselftests covering kernel memory accounting functionality.


* https://github.com/osandov/drgn

v2:
  1) implemented re-layering and renaming suggested by Johannes,
    added his patch to the set. Thanks!
  2) fixed the issue discovered by Bharata B Rao. Thanks!
  3) added kmem API clean up part
  4) added slab/memcg follow-up clean up part
  5) fixed a couple of issues discovered by internal testing on FB fleet.
  6) added kselftests
  7) included metadata into the charge calculation
  8) refreshed commit logs, regrouped patches, rebased onto mm tree, etc

v1:
  1) fixed a bug in zoneinfo_show_print()
  2) added some comments to the subpage charging API, a minor fix
  3) separated memory.kmem.slabinfo deprecation into a separate patch,
     provided a drgn-based replacement
  4) rebased on top of the current mm tree

RFC:
  https://lwn.net/Articles/798605/


Johannes Weiner (1):
  mm: memcontrol: decouple reference counting from page accounting

Roman Gushchin (27):
  mm: kmem: cleanup (__)memcg_kmem_charge_memcg() arguments
  mm: kmem: cleanup memcg_kmem_uncharge_memcg() arguments
  mm: kmem: rename memcg_kmem_(un)charge() into
    memcg_kmem_(un)charge_page()
  mm: kmem: switch to nr_pages in (__)memcg_kmem_charge_memcg()
  mm: memcg/slab: cache page number in memcg_(un)charge_slab()
  mm: kmem: rename (__)memcg_kmem_(un)charge_memcg() to
    __memcg_kmem_(un)charge()
  mm: memcg/slab: introduce mem_cgroup_from_obj()
  mm: fork: fix kernel_stack memcg stats for various stack
    implementations
  mm: memcg/slab: rename __mod_lruvec_slab_state() into
    __mod_lruvec_obj_state()
  mm: memcg: introduce mod_lruvec_memcg_state()
  mm: slub: implement SLUB version of obj_to_index()
  mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
  mm: vmstat: convert slab vmstat counter to bytes
  mm: memcg/slab: obj_cgroup API
  mm: memcg/slab: allocate obj_cgroups for non-root slab pages
  mm: memcg/slab: save obj_cgroup for non-root slab objects
  mm: memcg/slab: charge individual slab objects instead of pages
  mm: memcg/slab: deprecate memory.kmem.slabinfo
  mm: memcg/slab: move memcg_kmem_bypass() to memcontrol.h
  mm: memcg/slab: use a single set of kmem_caches for all memory cgroups
  mm: memcg/slab: simplify memcg cache creation
  mm: memcg/slab: deprecate memcg_kmem_get_cache()
  mm: memcg/slab: deprecate slab_root_caches
  mm: memcg/slab: remove redundant check in memcg_accumulate_slabinfo()
  tools/cgroup: add slabinfo.py tool
  tools/cgroup: make slabinfo.py compatible with new slab controller
  kselftests: cgroup: add kernel memory accounting tests

 drivers/base/node.c                        |  14 +-
 fs/pipe.c                                  |   2 +-
 fs/proc/meminfo.c                          |   4 +-
 include/linux/memcontrol.h                 | 147 ++++-
 include/linux/mm.h                         |  25 +-
 include/linux/mm_types.h                   |   5 +-
 include/linux/mmzone.h                     |  12 +-
 include/linux/slab.h                       |   5 +-
 include/linux/slub_def.h                   |   9 +
 include/linux/vmstat.h                     |   8 +
 kernel/fork.c                              |  13 +-
 kernel/power/snapshot.c                    |   2 +-
 mm/list_lru.c                              |  12 +-
 mm/memcontrol.c                            | 638 +++++++++++++--------
 mm/oom_kill.c                              |   2 +-
 mm/page_alloc.c                            |  12 +-
 mm/slab.c                                  |  36 +-
 mm/slab.h                                  | 346 +++++------
 mm/slab_common.c                           | 513 ++---------------
 mm/slob.c                                  |  12 +-
 mm/slub.c                                  |  62 +-
 mm/vmscan.c                                |   3 +-
 mm/vmstat.c                                |  37 +-
 mm/workingset.c                            |   6 +-
 tools/cgroup/slabinfo.py                   | 220 +++++++
 tools/testing/selftests/cgroup/.gitignore  |   1 +
 tools/testing/selftests/cgroup/Makefile    |   2 +
 tools/testing/selftests/cgroup/test_kmem.c | 380 ++++++++++++
 28 files changed, 1505 insertions(+), 1023 deletions(-)
 create mode 100755 tools/cgroup/slabinfo.py
 create mode 100644 tools/testing/selftests/cgroup/test_kmem.c

Comments

Bharata B Rao Jan. 30, 2020, 2:06 a.m. UTC | #1
On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> The existing cgroup slab memory controller is based on the idea of
> replicating slab allocator internals for each memory cgroup.
> This approach promises a low memory overhead (one pointer per page),
> and isn't adding too much code on hot allocation and release paths.
> But is has a very serious flaw: it leads to a low slab utilization.
> 
> Using a drgn* script I've got an estimation of slab utilization on
> a number of machines running different production workloads. In most
> cases it was between 45% and 65%, and the best number I've seen was
> around 85%. Turning kmem accounting off brings it to high 90s. Also
> it brings back 30-50% of slab memory. It means that the real price
> of the existing slab memory controller is way bigger than a pointer
> per page.
> 
> The real reason why the existing design leads to a low slab utilization
> is simple: slab pages are used exclusively by one memory cgroup.
> If there are only few allocations of certain size made by a cgroup,
> or if some active objects (e.g. dentries) are left after the cgroup is
> deleted, or the cgroup contains a single-threaded application which is
> barely allocating any kernel objects, but does it every time on a new CPU:
> in all these cases the resulting slab utilization is very low.
> If kmem accounting is off, the kernel is able to use free space
> on slab pages for other allocations.
> 
> Arguably it wasn't an issue back to days when the kmem controller was
> introduced and was an opt-in feature, which had to be turned on
> individually for each memory cgroup. But now it's turned on by default
> on both cgroup v1 and v2. And modern systemd-based systems tend to
> create a large number of cgroups.
> 
> This patchset provides a new implementation of the slab memory controller,
> which aims to reach a much better slab utilization by sharing slab pages
> between multiple memory cgroups. Below is the short description of the new
> design (more details in commit messages).
> 
> Accounting is performed per-object instead of per-page. Slab-related
> vmstat counters are converted to bytes. Charging is performed on page-basis,
> with rounding up and remembering leftovers.
> 
> Memcg ownership data is stored in a per-slab-page vector: for each slab page
> a vector of corresponding size is allocated. To keep slab memory reparenting
> working, instead of saving a pointer to the memory cgroup directly an
> intermediate object is used. It's simply a pointer to a memcg (which can be
> easily changed to the parent) with a built-in reference counter. This scheme
> allows to reparent all allocated objects without walking them over and
> changing memcg pointer to the parent.
> 
> Instead of creating an individual set of kmem_caches for each memory cgroup,
> two global sets are used: the root set for non-accounted and root-cgroup
> allocations and the second set for all other allocations. This allows to
> simplify the lifetime management of individual kmem_caches: they are
> destroyed with root counterparts. It allows to remove a good amount of code
> and make things generally simpler.
> 
> The patchset* has been tested on a number of different workloads in our
> production. In all cases it saved significant amount of memory, measured
> from high hundreds of MBs to single GBs per host. On average, the size
> of slab memory has been reduced by 35-45%.

Here are some numbers from multiple runs of sysbench and kernel compilation
with this patchset on a 10 core POWER8 host:

==========================================================================
Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
of a mem cgroup (Sampling every 5s)
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes	15859712	4456448		72
memory.usage_in_bytes		337510400	335806464	.5
Slab: (kB)			814336		607296		25

memory.kmem.usage_in_bytes	16187392	4653056		71
memory.usage_in_bytes		318832640	300154880	5
Slab: (kB)			789888		559744		29
--------------------------------------------------------------------------


Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
meminfo:Slab for kernel compilation (make -s -j64) Compilation was
done from bash that is in a memory cgroup. (Sampling every 5s)
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
memory.kmem.usage_in_bytes	338493440	231931904	31
memory.usage_in_bytes		7368015872	6275923968	15
Slab: (kB)			1139072		785408		31

memory.kmem.usage_in_bytes	341835776	236453888	30
memory.usage_in_bytes		6540427264	6072893440	7
Slab: (kB)			1074304		761280		29

memory.kmem.usage_in_bytes	340525056	233570304	31
memory.usage_in_bytes		6406209536	6177357824	3
Slab: (kB)			1244288		739712		40
--------------------------------------------------------------------------

Slab consumption right after boot
--------------------------------------------------------------------------
				5.5.0-rc7-mm1	+slab patch	%reduction
--------------------------------------------------------------------------
Slab: (kB)			821888		583424		29
==========================================================================

Summary:

With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
around 70% and 30% reduction consistently.

Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
kernel compilation.

Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
same is seen right after boot too.

Regards,
Bharata.
Roman Gushchin Jan. 30, 2020, 2:41 a.m. UTC | #2
On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > The existing cgroup slab memory controller is based on the idea of
> > replicating slab allocator internals for each memory cgroup.
> > This approach promises a low memory overhead (one pointer per page),
> > and isn't adding too much code on hot allocation and release paths.
> > But is has a very serious flaw: it leads to a low slab utilization.
> > 
> > Using a drgn* script I've got an estimation of slab utilization on
> > a number of machines running different production workloads. In most
> > cases it was between 45% and 65%, and the best number I've seen was
> > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > it brings back 30-50% of slab memory. It means that the real price
> > of the existing slab memory controller is way bigger than a pointer
> > per page.
> > 
> > The real reason why the existing design leads to a low slab utilization
> > is simple: slab pages are used exclusively by one memory cgroup.
> > If there are only few allocations of certain size made by a cgroup,
> > or if some active objects (e.g. dentries) are left after the cgroup is
> > deleted, or the cgroup contains a single-threaded application which is
> > barely allocating any kernel objects, but does it every time on a new CPU:
> > in all these cases the resulting slab utilization is very low.
> > If kmem accounting is off, the kernel is able to use free space
> > on slab pages for other allocations.
> > 
> > Arguably it wasn't an issue back to days when the kmem controller was
> > introduced and was an opt-in feature, which had to be turned on
> > individually for each memory cgroup. But now it's turned on by default
> > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > create a large number of cgroups.
> > 
> > This patchset provides a new implementation of the slab memory controller,
> > which aims to reach a much better slab utilization by sharing slab pages
> > between multiple memory cgroups. Below is the short description of the new
> > design (more details in commit messages).
> > 
> > Accounting is performed per-object instead of per-page. Slab-related
> > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > with rounding up and remembering leftovers.
> > 
> > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > a vector of corresponding size is allocated. To keep slab memory reparenting
> > working, instead of saving a pointer to the memory cgroup directly an
> > intermediate object is used. It's simply a pointer to a memcg (which can be
> > easily changed to the parent) with a built-in reference counter. This scheme
> > allows to reparent all allocated objects without walking them over and
> > changing memcg pointer to the parent.
> > 
> > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > two global sets are used: the root set for non-accounted and root-cgroup
> > allocations and the second set for all other allocations. This allows to
> > simplify the lifetime management of individual kmem_caches: they are
> > destroyed with root counterparts. It allows to remove a good amount of code
> > and make things generally simpler.
> > 
> > The patchset* has been tested on a number of different workloads in our
> > production. In all cases it saved significant amount of memory, measured
> > from high hundreds of MBs to single GBs per host. On average, the size
> > of slab memory has been reduced by 35-45%.
> 
> Here are some numbers from multiple runs of sysbench and kernel compilation
> with this patchset on a 10 core POWER8 host:
> 
> ==========================================================================
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> of a mem cgroup (Sampling every 5s)
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes	15859712	4456448		72
> memory.usage_in_bytes		337510400	335806464	.5
> Slab: (kB)			814336		607296		25
> 
> memory.kmem.usage_in_bytes	16187392	4653056		71
> memory.usage_in_bytes		318832640	300154880	5
> Slab: (kB)			789888		559744		29
> --------------------------------------------------------------------------
> 
> 
> Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> done from bash that is in a memory cgroup. (Sampling every 5s)
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> memory.kmem.usage_in_bytes	338493440	231931904	31
> memory.usage_in_bytes		7368015872	6275923968	15
> Slab: (kB)			1139072		785408		31
> 
> memory.kmem.usage_in_bytes	341835776	236453888	30
> memory.usage_in_bytes		6540427264	6072893440	7
> Slab: (kB)			1074304		761280		29
> 
> memory.kmem.usage_in_bytes	340525056	233570304	31
> memory.usage_in_bytes		6406209536	6177357824	3
> Slab: (kB)			1244288		739712		40
> --------------------------------------------------------------------------
> 
> Slab consumption right after boot
> --------------------------------------------------------------------------
> 				5.5.0-rc7-mm1	+slab patch	%reduction
> --------------------------------------------------------------------------
> Slab: (kB)			821888		583424		29
> ==========================================================================
> 
> Summary:
> 
> With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> around 70% and 30% reduction consistently.
> 
> Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> kernel compilation.
> 
> Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> same is seen right after boot too.

That's just perfect!

memory.usage_in_bytes was most likely the same because the freed space
was taken by pagecache.

Thank you very much for testing!

Roman
Pasha Tatashin Aug. 12, 2020, 11:16 p.m. UTC | #3
Guys,

There is a convoluted deadlock that I just root caused, and that is
fixed by this work (at least based on my code inspection it appears to
be fixed); but the deadlock exists in older and stable kernels, and I
am not sure whether to create a separate patch for it, or backport
this whole thing.

Thread #1: Hot-removes memory
device_offline
  memory_subsys_offline
    offline_pages
      __offline_pages
        mem_hotplug_lock <- write access
      waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
migrate it.

Thread #2: ccs killer kthread
   css_killed_work_fn
     cgroup_mutex  <- Grab this Mutex
     mem_cgroup_css_offline
       memcg_offline_kmem.part
          memcg_deactivate_kmem_caches
            get_online_mems
              mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: crashing userland program
do_coredump
  elf_core_dump
      get_dump_page() -> get page with pfn#9e5113, and increment refcnt
      dump_emit
        __kernel_write
          __vfs_write
            new_sync_write
              pipe_write
                pipe_wait   -> waits for Thread #4 systemd-coredump to
read the pipe

Thread #4: systemd-coredump
ksys_read
  vfs_read
    __vfs_read
      seq_read
        proc_single_show
          proc_cgroup_show
            cgroup_mutex -> waits from Thread #2 for this lock.

In Summary:
Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
waits for Thread#1 for mem_hotplug_lock rwlock.

This work appears to fix this deadlock because cgroup_mutex is not
called anymore before mem_hotplug_lock (unless I am missing it), as it
removes memcg_deactivate_kmem_caches.

Thank you,
Pasha

On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > The existing cgroup slab memory controller is based on the idea of
> > > replicating slab allocator internals for each memory cgroup.
> > > This approach promises a low memory overhead (one pointer per page),
> > > and isn't adding too much code on hot allocation and release paths.
> > > But is has a very serious flaw: it leads to a low slab utilization.
> > >
> > > Using a drgn* script I've got an estimation of slab utilization on
> > > a number of machines running different production workloads. In most
> > > cases it was between 45% and 65%, and the best number I've seen was
> > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > it brings back 30-50% of slab memory. It means that the real price
> > > of the existing slab memory controller is way bigger than a pointer
> > > per page.
> > >
> > > The real reason why the existing design leads to a low slab utilization
> > > is simple: slab pages are used exclusively by one memory cgroup.
> > > If there are only few allocations of certain size made by a cgroup,
> > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > deleted, or the cgroup contains a single-threaded application which is
> > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > in all these cases the resulting slab utilization is very low.
> > > If kmem accounting is off, the kernel is able to use free space
> > > on slab pages for other allocations.
> > >
> > > Arguably it wasn't an issue back to days when the kmem controller was
> > > introduced and was an opt-in feature, which had to be turned on
> > > individually for each memory cgroup. But now it's turned on by default
> > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > create a large number of cgroups.
> > >
> > > This patchset provides a new implementation of the slab memory controller,
> > > which aims to reach a much better slab utilization by sharing slab pages
> > > between multiple memory cgroups. Below is the short description of the new
> > > design (more details in commit messages).
> > >
> > > Accounting is performed per-object instead of per-page. Slab-related
> > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > with rounding up and remembering leftovers.
> > >
> > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > working, instead of saving a pointer to the memory cgroup directly an
> > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > easily changed to the parent) with a built-in reference counter. This scheme
> > > allows to reparent all allocated objects without walking them over and
> > > changing memcg pointer to the parent.
> > >
> > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > two global sets are used: the root set for non-accounted and root-cgroup
> > > allocations and the second set for all other allocations. This allows to
> > > simplify the lifetime management of individual kmem_caches: they are
> > > destroyed with root counterparts. It allows to remove a good amount of code
> > > and make things generally simpler.
> > >
> > > The patchset* has been tested on a number of different workloads in our
> > > production. In all cases it saved significant amount of memory, measured
> > > from high hundreds of MBs to single GBs per host. On average, the size
> > > of slab memory has been reduced by 35-45%.
> >
> > Here are some numbers from multiple runs of sysbench and kernel compilation
> > with this patchset on a 10 core POWER8 host:
> >
> > ==========================================================================
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > of a mem cgroup (Sampling every 5s)
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes    15859712        4456448         72
> > memory.usage_in_bytes         337510400       335806464       .5
> > Slab: (kB)                    814336          607296          25
> >
> > memory.kmem.usage_in_bytes    16187392        4653056         71
> > memory.usage_in_bytes         318832640       300154880       5
> > Slab: (kB)                    789888          559744          29
> > --------------------------------------------------------------------------
> >
> >
> > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > done from bash that is in a memory cgroup. (Sampling every 5s)
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > memory.kmem.usage_in_bytes    338493440       231931904       31
> > memory.usage_in_bytes         7368015872      6275923968      15
> > Slab: (kB)                    1139072         785408          31
> >
> > memory.kmem.usage_in_bytes    341835776       236453888       30
> > memory.usage_in_bytes         6540427264      6072893440      7
> > Slab: (kB)                    1074304         761280          29
> >
> > memory.kmem.usage_in_bytes    340525056       233570304       31
> > memory.usage_in_bytes         6406209536      6177357824      3
> > Slab: (kB)                    1244288         739712          40
> > --------------------------------------------------------------------------
> >
> > Slab consumption right after boot
> > --------------------------------------------------------------------------
> >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > --------------------------------------------------------------------------
> > Slab: (kB)                    821888          583424          29
> > ==========================================================================
> >
> > Summary:
> >
> > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > around 70% and 30% reduction consistently.
> >
> > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > kernel compilation.
> >
> > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > same is seen right after boot too.
>
> That's just perfect!
>
> memory.usage_in_bytes was most likely the same because the freed space
> was taken by pagecache.
>
> Thank you very much for testing!
>
> Roman
Pasha Tatashin Aug. 12, 2020, 11:18 p.m. UTC | #4
BTW, I replied to a wrong version of this work. I intended to reply to
version 7:
https://lore.kernel.org/lkml/20200623174037.3951353-1-guro@fb.com/

Nevertheless, the problem is the same.

Thank you,
Pasha

On Wed, Aug 12, 2020 at 7:16 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> Guys,
>
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.
>
> Thread #1: Hot-removes memory
> device_offline
>   memory_subsys_offline
>     offline_pages
>       __offline_pages
>         mem_hotplug_lock <- write access
>       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
>
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
>
> Thread #3: crashing userland program
> do_coredump
>   elf_core_dump
>       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>       dump_emit
>         __kernel_write
>           __vfs_write
>             new_sync_write
>               pipe_write
>                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> read the pipe
>
> Thread #4: systemd-coredump
> ksys_read
>   vfs_read
>     __vfs_read
>       seq_read
>         proc_single_show
>           proc_cgroup_show
>             cgroup_mutex -> waits from Thread #2 for this lock.
>
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
>
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
>
> Thank you,
> Pasha
>
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > memory.usage_in_bytes         337510400       335806464       .5
> > > Slab: (kB)                    814336          607296          25
> > >
> > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > memory.usage_in_bytes         318832640       300154880       5
> > > Slab: (kB)                    789888          559744          29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > memory.usage_in_bytes         7368015872      6275923968      15
> > > Slab: (kB)                    1139072         785408          31
> > >
> > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > memory.usage_in_bytes         6540427264      6072893440      7
> > > Slab: (kB)                    1074304         761280          29
> > >
> > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > memory.usage_in_bytes         6406209536      6177357824      3
> > > Slab: (kB)                    1244288         739712          40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB)                    821888          583424          29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman
Roman Gushchin Aug. 13, 2020, 12:04 a.m. UTC | #5
On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> Guys,
> 
> There is a convoluted deadlock that I just root caused, and that is
> fixed by this work (at least based on my code inspection it appears to
> be fixed); but the deadlock exists in older and stable kernels, and I
> am not sure whether to create a separate patch for it, or backport
> this whole thing.

Hi Pavel,

wow, it's a quite complicated deadlock. Thank you for providing
a perfect analysis!

Unfortunately, backporting the whole new slab controller isn't an option:
it's way too big and invasive.
Do you already have a standalone fix?

Thanks!


> 
> Thread #1: Hot-removes memory
> device_offline
>   memory_subsys_offline
>     offline_pages
>       __offline_pages
>         mem_hotplug_lock <- write access
>       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> migrate it.
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: crashing userland program
> do_coredump
>   elf_core_dump
>       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>       dump_emit
>         __kernel_write
>           __vfs_write
>             new_sync_write
>               pipe_write
>                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> read the pipe
> 
> Thread #4: systemd-coredump
> ksys_read
>   vfs_read
>     __vfs_read
>       seq_read
>         proc_single_show
>           proc_cgroup_show
>             cgroup_mutex -> waits from Thread #2 for this lock.

> 
> In Summary:
> Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> waits for Thread#1 for mem_hotplug_lock rwlock.
> 
> This work appears to fix this deadlock because cgroup_mutex is not
> called anymore before mem_hotplug_lock (unless I am missing it), as it
> removes memcg_deactivate_kmem_caches.
> 
> Thank you,
> Pasha
> 
> On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > The existing cgroup slab memory controller is based on the idea of
> > > > replicating slab allocator internals for each memory cgroup.
> > > > This approach promises a low memory overhead (one pointer per page),
> > > > and isn't adding too much code on hot allocation and release paths.
> > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > >
> > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > a number of machines running different production workloads. In most
> > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > it brings back 30-50% of slab memory. It means that the real price
> > > > of the existing slab memory controller is way bigger than a pointer
> > > > per page.
> > > >
> > > > The real reason why the existing design leads to a low slab utilization
> > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > If there are only few allocations of certain size made by a cgroup,
> > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > deleted, or the cgroup contains a single-threaded application which is
> > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > in all these cases the resulting slab utilization is very low.
> > > > If kmem accounting is off, the kernel is able to use free space
> > > > on slab pages for other allocations.
> > > >
> > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > introduced and was an opt-in feature, which had to be turned on
> > > > individually for each memory cgroup. But now it's turned on by default
> > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > create a large number of cgroups.
> > > >
> > > > This patchset provides a new implementation of the slab memory controller,
> > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > between multiple memory cgroups. Below is the short description of the new
> > > > design (more details in commit messages).
> > > >
> > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > with rounding up and remembering leftovers.
> > > >
> > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > allows to reparent all allocated objects without walking them over and
> > > > changing memcg pointer to the parent.
> > > >
> > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > allocations and the second set for all other allocations. This allows to
> > > > simplify the lifetime management of individual kmem_caches: they are
> > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > and make things generally simpler.
> > > >
> > > > The patchset* has been tested on a number of different workloads in our
> > > > production. In all cases it saved significant amount of memory, measured
> > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > of slab memory has been reduced by 35-45%.
> > >
> > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > with this patchset on a 10 core POWER8 host:
> > >
> > > ==========================================================================
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > of a mem cgroup (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > memory.usage_in_bytes         337510400       335806464       .5
> > > Slab: (kB)                    814336          607296          25
> > >
> > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > memory.usage_in_bytes         318832640       300154880       5
> > > Slab: (kB)                    789888          559744          29
> > > --------------------------------------------------------------------------
> > >
> > >
> > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > memory.usage_in_bytes         7368015872      6275923968      15
> > > Slab: (kB)                    1139072         785408          31
> > >
> > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > memory.usage_in_bytes         6540427264      6072893440      7
> > > Slab: (kB)                    1074304         761280          29
> > >
> > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > memory.usage_in_bytes         6406209536      6177357824      3
> > > Slab: (kB)                    1244288         739712          40
> > > --------------------------------------------------------------------------
> > >
> > > Slab consumption right after boot
> > > --------------------------------------------------------------------------
> > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > --------------------------------------------------------------------------
> > > Slab: (kB)                    821888          583424          29
> > > ==========================================================================
> > >
> > > Summary:
> > >
> > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > around 70% and 30% reduction consistently.
> > >
> > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > kernel compilation.
> > >
> > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > same is seen right after boot too.
> >
> > That's just perfect!
> >
> > memory.usage_in_bytes was most likely the same because the freed space
> > was taken by pagecache.
> >
> > Thank you very much for testing!
> >
> > Roman
Pasha Tatashin Aug. 13, 2020, 12:31 a.m. UTC | #6
On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
>
> On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > Guys,
> >
> > There is a convoluted deadlock that I just root caused, and that is
> > fixed by this work (at least based on my code inspection it appears to
> > be fixed); but the deadlock exists in older and stable kernels, and I
> > am not sure whether to create a separate patch for it, or backport
> > this whole thing.
>

Hi Roman,

> Hi Pavel,
>
> wow, it's a quite complicated deadlock. Thank you for providing
> a perfect analysis!

Thank you, it indeed took me a while to fully grasp the deadlock.

>
> Unfortunately, backporting the whole new slab controller isn't an option:
> it's way too big and invasive.

This is what I thought as well, this is why I want to figure out what
is the best way forward.

> Do you already have a standalone fix?

Not yet, I do not have a standalone fix. I suspect the best fix would
be to address fix css_killed_work_fn() stack so we never have:
cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
the order would work. If you have suggestions since you worked on this
code recently, please let me know.

Thank you,
Pasha

>
> Thanks!
>
>
> >
> > Thread #1: Hot-removes memory
> > device_offline
> >   memory_subsys_offline
> >     offline_pages
> >       __offline_pages
> >         mem_hotplug_lock <- write access
> >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > migrate it.
> >
> > Thread #2: ccs killer kthread
> >    css_killed_work_fn
> >      cgroup_mutex  <- Grab this Mutex
> >      mem_cgroup_css_offline
> >        memcg_offline_kmem.part
> >           memcg_deactivate_kmem_caches
> >             get_online_mems
> >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: crashing userland program
> > do_coredump
> >   elf_core_dump
> >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> >       dump_emit
> >         __kernel_write
> >           __vfs_write
> >             new_sync_write
> >               pipe_write
> >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> > read the pipe
> >
> > Thread #4: systemd-coredump
> > ksys_read
> >   vfs_read
> >     __vfs_read
> >       seq_read
> >         proc_single_show
> >           proc_cgroup_show
> >             cgroup_mutex -> waits from Thread #2 for this lock.
>
> >
> > In Summary:
> > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > waits for Thread#1 for mem_hotplug_lock rwlock.
> >
> > This work appears to fix this deadlock because cgroup_mutex is not
> > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > removes memcg_deactivate_kmem_caches.
> >
> > Thank you,
> > Pasha
> >
> > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > replicating slab allocator internals for each memory cgroup.
> > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > >
> > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > a number of machines running different production workloads. In most
> > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > per page.
> > > > >
> > > > > The real reason why the existing design leads to a low slab utilization
> > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > in all these cases the resulting slab utilization is very low.
> > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > on slab pages for other allocations.
> > > > >
> > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > create a large number of cgroups.
> > > > >
> > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > design (more details in commit messages).
> > > > >
> > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > with rounding up and remembering leftovers.
> > > > >
> > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > allows to reparent all allocated objects without walking them over and
> > > > > changing memcg pointer to the parent.
> > > > >
> > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > allocations and the second set for all other allocations. This allows to
> > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > and make things generally simpler.
> > > > >
> > > > > The patchset* has been tested on a number of different workloads in our
> > > > > production. In all cases it saved significant amount of memory, measured
> > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > of slab memory has been reduced by 35-45%.
> > > >
> > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > with this patchset on a 10 core POWER8 host:
> > > >
> > > > ==========================================================================
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > of a mem cgroup (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > > memory.usage_in_bytes         337510400       335806464       .5
> > > > Slab: (kB)                    814336          607296          25
> > > >
> > > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > > memory.usage_in_bytes         318832640       300154880       5
> > > > Slab: (kB)                    789888          559744          29
> > > > --------------------------------------------------------------------------
> > > >
> > > >
> > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > > memory.usage_in_bytes         7368015872      6275923968      15
> > > > Slab: (kB)                    1139072         785408          31
> > > >
> > > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > > memory.usage_in_bytes         6540427264      6072893440      7
> > > > Slab: (kB)                    1074304         761280          29
> > > >
> > > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > > memory.usage_in_bytes         6406209536      6177357824      3
> > > > Slab: (kB)                    1244288         739712          40
> > > > --------------------------------------------------------------------------
> > > >
> > > > Slab consumption right after boot
> > > > --------------------------------------------------------------------------
> > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > --------------------------------------------------------------------------
> > > > Slab: (kB)                    821888          583424          29
> > > > ==========================================================================
> > > >
> > > > Summary:
> > > >
> > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > > around 70% and 30% reduction consistently.
> > > >
> > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > kernel compilation.
> > > >
> > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > same is seen right after boot too.
> > >
> > > That's just perfect!
> > >
> > > memory.usage_in_bytes was most likely the same because the freed space
> > > was taken by pagecache.
> > >
> > > Thank you very much for testing!
> > >
> > > Roman
Pasha Tatashin Aug. 28, 2020, 4:47 p.m. UTC | #7
There appears to be another problem that is related to the
cgroup_mutex -> mem_hotplug_lock deadlock described above.

In the original deadlock that I described, the workaround is to
replace crash dump from piping to Linux traditional save to files
method. However, after trying this workaround, I still observed
hardware watchdog resets during machine  shutdown.

The new problem occurs for the following reason: upon shutdown systemd
calls a service that hot-removes memory, and if hot-removing fails for
some reason systemd kills that service after timeout. However, systemd
is never able to kill the service, and we get hardware reset caused by
watchdog or a hang during shutdown:

Thread #1: memory hot-remove systemd service
Loops indefinitely, because if there is something still to be migrated
this loop never terminates. However, this loop can be terminated via
signal from systemd after timeout.
__offline_pages()
      do {
          pfn = scan_movable_pages(pfn, end_pfn);
                  # Returns 0, meaning there is nothing available to
                  # migrate, no page is PageLRU(page)
          ...
          ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
                                            NULL, check_pages_isolated_cb);
                  # Returns -EBUSY, meaning there is at least one PFN that
                  # still has to be migrated.
      } while (ret);

Thread #2: ccs killer kthread
   css_killed_work_fn
     cgroup_mutex  <- Grab this Mutex
     mem_cgroup_css_offline
       memcg_offline_kmem.part
          memcg_deactivate_kmem_caches
            get_online_mems
              mem_hotplug_lock <- waits for Thread#1 to get read access

Thread #3: systemd
ksys_read
 vfs_read
   __vfs_read
     seq_read
       proc_single_show
         proc_cgroup_show
           mutex_lock -> wait for cgroup_mutex that is owned by Thread #2

Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
to thread #1.

The proper fix for both of the problems is to avoid cgroup_mutex ->
mem_hotplug_lock ordering that was recently fixed in the mainline but
still present in all stable branches. Unfortunately, I do not see a
simple fix in how to remove mem_hotplug_lock from
memcg_deactivate_kmem_caches without using Roman's series that is too
big for stable.

Thanks,
Pasha

On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
> > > Guys,
> > >
> > > There is a convoluted deadlock that I just root caused, and that is
> > > fixed by this work (at least based on my code inspection it appears to
> > > be fixed); but the deadlock exists in older and stable kernels, and I
> > > am not sure whether to create a separate patch for it, or backport
> > > this whole thing.
> >
>
> Hi Roman,
>
> > Hi Pavel,
> >
> > wow, it's a quite complicated deadlock. Thank you for providing
> > a perfect analysis!
>
> Thank you, it indeed took me a while to fully grasp the deadlock.
>
> >
> > Unfortunately, backporting the whole new slab controller isn't an option:
> > it's way too big and invasive.
>
> This is what I thought as well, this is why I want to figure out what
> is the best way forward.
>
> > Do you already have a standalone fix?
>
> Not yet, I do not have a standalone fix. I suspect the best fix would
> be to address fix css_killed_work_fn() stack so we never have:
> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
> the order would work. If you have suggestions since you worked on this
> code recently, please let me know.
>
> Thank you,
> Pasha
>
> >
> > Thanks!
> >
> >
> > >
> > > Thread #1: Hot-removes memory
> > > device_offline
> > >   memory_subsys_offline
> > >     offline_pages
> > >       __offline_pages
> > >         mem_hotplug_lock <- write access
> > >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
> > > migrate it.
> > >
> > > Thread #2: ccs killer kthread
> > >    css_killed_work_fn
> > >      cgroup_mutex  <- Grab this Mutex
> > >      mem_cgroup_css_offline
> > >        memcg_offline_kmem.part
> > >           memcg_deactivate_kmem_caches
> > >             get_online_mems
> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: crashing userland program
> > > do_coredump
> > >   elf_core_dump
> > >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
> > >       dump_emit
> > >         __kernel_write
> > >           __vfs_write
> > >             new_sync_write
> > >               pipe_write
> > >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
> > > read the pipe
> > >
> > > Thread #4: systemd-coredump
> > > ksys_read
> > >   vfs_read
> > >     __vfs_read
> > >       seq_read
> > >         proc_single_show
> > >           proc_cgroup_show
> > >             cgroup_mutex -> waits from Thread #2 for this lock.
> >
> > >
> > > In Summary:
> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
> > > waits for Thread#1 for mem_hotplug_lock rwlock.
> > >
> > > This work appears to fix this deadlock because cgroup_mutex is not
> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
> > > removes memcg_deactivate_kmem_caches.
> > >
> > > Thank you,
> > > Pasha
> > >
> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
> > > > > > The existing cgroup slab memory controller is based on the idea of
> > > > > > replicating slab allocator internals for each memory cgroup.
> > > > > > This approach promises a low memory overhead (one pointer per page),
> > > > > > and isn't adding too much code on hot allocation and release paths.
> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
> > > > > >
> > > > > > Using a drgn* script I've got an estimation of slab utilization on
> > > > > > a number of machines running different production workloads. In most
> > > > > > cases it was between 45% and 65%, and the best number I've seen was
> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
> > > > > > it brings back 30-50% of slab memory. It means that the real price
> > > > > > of the existing slab memory controller is way bigger than a pointer
> > > > > > per page.
> > > > > >
> > > > > > The real reason why the existing design leads to a low slab utilization
> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
> > > > > > If there are only few allocations of certain size made by a cgroup,
> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
> > > > > > deleted, or the cgroup contains a single-threaded application which is
> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
> > > > > > in all these cases the resulting slab utilization is very low.
> > > > > > If kmem accounting is off, the kernel is able to use free space
> > > > > > on slab pages for other allocations.
> > > > > >
> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
> > > > > > introduced and was an opt-in feature, which had to be turned on
> > > > > > individually for each memory cgroup. But now it's turned on by default
> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
> > > > > > create a large number of cgroups.
> > > > > >
> > > > > > This patchset provides a new implementation of the slab memory controller,
> > > > > > which aims to reach a much better slab utilization by sharing slab pages
> > > > > > between multiple memory cgroups. Below is the short description of the new
> > > > > > design (more details in commit messages).
> > > > > >
> > > > > > Accounting is performed per-object instead of per-page. Slab-related
> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
> > > > > > with rounding up and remembering leftovers.
> > > > > >
> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
> > > > > > working, instead of saving a pointer to the memory cgroup directly an
> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
> > > > > > allows to reparent all allocated objects without walking them over and
> > > > > > changing memcg pointer to the parent.
> > > > > >
> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
> > > > > > allocations and the second set for all other allocations. This allows to
> > > > > > simplify the lifetime management of individual kmem_caches: they are
> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
> > > > > > and make things generally simpler.
> > > > > >
> > > > > > The patchset* has been tested on a number of different workloads in our
> > > > > > production. In all cases it saved significant amount of memory, measured
> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
> > > > > > of slab memory has been reduced by 35-45%.
> > > > >
> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
> > > > > with this patchset on a 10 core POWER8 host:
> > > > >
> > > > > ==========================================================================
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
> > > > > of a mem cgroup (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes    15859712        4456448         72
> > > > > memory.usage_in_bytes         337510400       335806464       .5
> > > > > Slab: (kB)                    814336          607296          25
> > > > >
> > > > > memory.kmem.usage_in_bytes    16187392        4653056         71
> > > > > memory.usage_in_bytes         318832640       300154880       5
> > > > > Slab: (kB)                    789888          559744          29
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > >
> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > memory.kmem.usage_in_bytes    338493440       231931904       31
> > > > > memory.usage_in_bytes         7368015872      6275923968      15
> > > > > Slab: (kB)                    1139072         785408          31
> > > > >
> > > > > memory.kmem.usage_in_bytes    341835776       236453888       30
> > > > > memory.usage_in_bytes         6540427264      6072893440      7
> > > > > Slab: (kB)                    1074304         761280          29
> > > > >
> > > > > memory.kmem.usage_in_bytes    340525056       233570304       31
> > > > > memory.usage_in_bytes         6406209536      6177357824      3
> > > > > Slab: (kB)                    1244288         739712          40
> > > > > --------------------------------------------------------------------------
> > > > >
> > > > > Slab consumption right after boot
> > > > > --------------------------------------------------------------------------
> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
> > > > > --------------------------------------------------------------------------
> > > > > Slab: (kB)                    821888          583424          29
> > > > > ==========================================================================
> > > > >
> > > > > Summary:
> > > > >
> > > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
> > > > > around 70% and 30% reduction consistently.
> > > > >
> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
> > > > > kernel compilation.
> > > > >
> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
> > > > > same is seen right after boot too.
> > > >
> > > > That's just perfect!
> > > >
> > > > memory.usage_in_bytes was most likely the same because the freed space
> > > > was taken by pagecache.
> > > >
> > > > Thank you very much for testing!
> > > >
> > > > Roman
Bharata B Rao Sept. 1, 2020, 5:28 a.m. UTC | #8
On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> 
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine  shutdown.
> 
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for
> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
> 
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
>       do {
>           pfn = scan_movable_pages(pfn, end_pfn);
>                   # Returns 0, meaning there is nothing available to
>                   # migrate, no page is PageLRU(page)
>           ...
>           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>                                             NULL, check_pages_isolated_cb);
>                   # Returns -EBUSY, meaning there is at least one PFN that
>                   # still has to be migrated.
>       } while (ret);
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: systemd
> ksys_read
>  vfs_read
>    __vfs_read
>      seq_read
>        proc_single_show
>          proc_cgroup_show
>            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> 
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
> 
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.

We too are seeing this on Power systems when stress-testing memory
hotplug, but with the following call trace (from hung task timer)
instead of Thread #2 above:

__switch_to
__schedule
schedule
percpu_rwsem_wait
__percpu_down_read
get_online_mems
memcg_create_kmem_cache
memcg_kmem_cache_create_func
process_one_work
worker_thread
kthread
ret_from_kernel_thread

While I understand that Roman's new slab controller patchset will fix
this, I also wonder if infinitely looping in the memory unplug path
with mem_hotplug_lock held is the right thing to do? Earlier we had
a few other exit possibilities in this path (like max retries etc)
but those were removed by commits:

72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory

Or, is the user-space test is expected to induce a signal back-off when
unplug doesn't complete within a reasonable amount of time?

Regards,
Bharata.
Pasha Tatashin Sept. 1, 2020, 12:52 p.m. UTC | #9
On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@linux.ibm.com> wrote:
>
> On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine  shutdown.
> >
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> >
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> >       do {
> >           pfn = scan_movable_pages(pfn, end_pfn);
> >                   # Returns 0, meaning there is nothing available to
> >                   # migrate, no page is PageLRU(page)
> >           ...
> >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                             NULL, check_pages_isolated_cb);
> >                   # Returns -EBUSY, meaning there is at least one PFN that
> >                   # still has to be migrated.
> >       } while (ret);
> >
> > Thread #2: ccs killer kthread
> >    css_killed_work_fn
> >      cgroup_mutex  <- Grab this Mutex
> >      mem_cgroup_css_offline
> >        memcg_offline_kmem.part
> >           memcg_deactivate_kmem_caches
> >             get_online_mems
> >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > Thread #3: systemd
> > ksys_read
> >  vfs_read
> >    __vfs_read
> >      seq_read
> >        proc_single_show
> >          proc_cgroup_show
> >            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> >
> > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > to thread #1.
> >
> > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > still present in all stable branches. Unfortunately, I do not see a
> > simple fix in how to remove mem_hotplug_lock from
> > memcg_deactivate_kmem_caches without using Roman's series that is too
> > big for stable.
>
> We too are seeing this on Power systems when stress-testing memory
> hotplug, but with the following call trace (from hung task timer)
> instead of Thread #2 above:
>
> __switch_to
> __schedule
> schedule
> percpu_rwsem_wait
> __percpu_down_read
> get_online_mems
> memcg_create_kmem_cache
> memcg_kmem_cache_create_func
> process_one_work
> worker_thread
> kthread
> ret_from_kernel_thread
>
> While I understand that Roman's new slab controller patchset will fix
> this, I also wonder if infinitely looping in the memory unplug path
> with mem_hotplug_lock held is the right thing to do? Earlier we had
> a few other exit possibilities in this path (like max retries etc)
> but those were removed by commits:
>
> 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
>
> Or, is the user-space test is expected to induce a signal back-off when
> unplug doesn't complete within a reasonable amount of time?

Hi Bharata,

Thank you for your input, it looks like you are experiencing the same
problems that I observed.

What I found is that the reason why our machines did not complete
hot-remove within the given time is because of this bug:
https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com

Could you please try it and see if that helps for your case?

Thank you,
Pasha
Bharata B Rao Sept. 2, 2020, 6:23 a.m. UTC | #10
On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote:
> On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao <bharata@linux.ibm.com> wrote:
> >
> > On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote:
> > > There appears to be another problem that is related to the
> > > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >
> > > In the original deadlock that I described, the workaround is to
> > > replace crash dump from piping to Linux traditional save to files
> > > method. However, after trying this workaround, I still observed
> > > hardware watchdog resets during machine  shutdown.
> > >
> > > The new problem occurs for the following reason: upon shutdown systemd
> > > calls a service that hot-removes memory, and if hot-removing fails for
> > > some reason systemd kills that service after timeout. However, systemd
> > > is never able to kill the service, and we get hardware reset caused by
> > > watchdog or a hang during shutdown:
> > >
> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > >       do {
> > >           pfn = scan_movable_pages(pfn, end_pfn);
> > >                   # Returns 0, meaning there is nothing available to
> > >                   # migrate, no page is PageLRU(page)
> > >           ...
> > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > >                                             NULL, check_pages_isolated_cb);
> > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > >                   # still has to be migrated.
> > >       } while (ret);
> > >
> > > Thread #2: ccs killer kthread
> > >    css_killed_work_fn
> > >      cgroup_mutex  <- Grab this Mutex
> > >      mem_cgroup_css_offline
> > >        memcg_offline_kmem.part
> > >           memcg_deactivate_kmem_caches
> > >             get_online_mems
> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> > >
> > > Thread #3: systemd
> > > ksys_read
> > >  vfs_read
> > >    __vfs_read
> > >      seq_read
> > >        proc_single_show
> > >          proc_cgroup_show
> > >            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> > >
> > > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> > > to thread #1.
> > >
> > > The proper fix for both of the problems is to avoid cgroup_mutex ->
> > > mem_hotplug_lock ordering that was recently fixed in the mainline but
> > > still present in all stable branches. Unfortunately, I do not see a
> > > simple fix in how to remove mem_hotplug_lock from
> > > memcg_deactivate_kmem_caches without using Roman's series that is too
> > > big for stable.
> >
> > We too are seeing this on Power systems when stress-testing memory
> > hotplug, but with the following call trace (from hung task timer)
> > instead of Thread #2 above:
> >
> > __switch_to
> > __schedule
> > schedule
> > percpu_rwsem_wait
> > __percpu_down_read
> > get_online_mems
> > memcg_create_kmem_cache
> > memcg_kmem_cache_create_func
> > process_one_work
> > worker_thread
> > kthread
> > ret_from_kernel_thread
> >
> > While I understand that Roman's new slab controller patchset will fix
> > this, I also wonder if infinitely looping in the memory unplug path
> > with mem_hotplug_lock held is the right thing to do? Earlier we had
> > a few other exit possibilities in this path (like max retries etc)
> > but those were removed by commits:
> >
> > 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early
> > ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory
> >
> > Or, is the user-space test is expected to induce a signal back-off when
> > unplug doesn't complete within a reasonable amount of time?
> 
> Hi Bharata,
> 
> Thank you for your input, it looks like you are experiencing the same
> problems that I observed.
> 
> What I found is that the reason why our machines did not complete
> hot-remove within the given time is because of this bug:
> https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com
> 
> Could you please try it and see if that helps for your case?

I am on an old codebase that already has the fix that you are proposing,
so I might be seeing someother issue which I will debug further.

So looks like the loop in __offline_pages() had a call to
drain_all_pages() before it was removed by

c52e75935f8d: mm: remove extra drain pages on pcp list

Regards,
Bharata.
Vlastimil Babka Sept. 2, 2020, 9:53 a.m. UTC | #11
On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> There appears to be another problem that is related to the
> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> 
> In the original deadlock that I described, the workaround is to
> replace crash dump from piping to Linux traditional save to files
> method. However, after trying this workaround, I still observed
> hardware watchdog resets during machine  shutdown.
> 
> The new problem occurs for the following reason: upon shutdown systemd
> calls a service that hot-removes memory, and if hot-removing fails for

Why is that hotremove even needed if we're shutting down? Are there any
(virtualization?) platforms where it makes some difference over plain
shutdown/restart?

> some reason systemd kills that service after timeout. However, systemd
> is never able to kill the service, and we get hardware reset caused by
> watchdog or a hang during shutdown:
> 
> Thread #1: memory hot-remove systemd service
> Loops indefinitely, because if there is something still to be migrated
> this loop never terminates. However, this loop can be terminated via
> signal from systemd after timeout.
> __offline_pages()
>       do {
>           pfn = scan_movable_pages(pfn, end_pfn);
>                   # Returns 0, meaning there is nothing available to
>                   # migrate, no page is PageLRU(page)
>           ...
>           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>                                             NULL, check_pages_isolated_cb);
>                   # Returns -EBUSY, meaning there is at least one PFN that
>                   # still has to be migrated.
>       } while (ret);
> 
> Thread #2: ccs killer kthread
>    css_killed_work_fn
>      cgroup_mutex  <- Grab this Mutex
>      mem_cgroup_css_offline
>        memcg_offline_kmem.part
>           memcg_deactivate_kmem_caches
>             get_online_mems
>               mem_hotplug_lock <- waits for Thread#1 to get read access
> 
> Thread #3: systemd
> ksys_read
>  vfs_read
>    __vfs_read
>      seq_read
>        proc_single_show
>          proc_cgroup_show
>            mutex_lock -> wait for cgroup_mutex that is owned by Thread #2
> 
> Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt
> to thread #1.
> 
> The proper fix for both of the problems is to avoid cgroup_mutex ->
> mem_hotplug_lock ordering that was recently fixed in the mainline but
> still present in all stable branches. Unfortunately, I do not see a
> simple fix in how to remove mem_hotplug_lock from
> memcg_deactivate_kmem_caches without using Roman's series that is too
> big for stable.
> 
> Thanks,
> Pasha
> 
> On Wed, Aug 12, 2020 at 8:31 PM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
>> On Wed, Aug 12, 2020 at 8:04 PM Roman Gushchin <guro@fb.com> wrote:
>> >
>> > On Wed, Aug 12, 2020 at 07:16:08PM -0400, Pavel Tatashin wrote:
>> > > Guys,
>> > >
>> > > There is a convoluted deadlock that I just root caused, and that is
>> > > fixed by this work (at least based on my code inspection it appears to
>> > > be fixed); but the deadlock exists in older and stable kernels, and I
>> > > am not sure whether to create a separate patch for it, or backport
>> > > this whole thing.
>> >
>>
>> Hi Roman,
>>
>> > Hi Pavel,
>> >
>> > wow, it's a quite complicated deadlock. Thank you for providing
>> > a perfect analysis!
>>
>> Thank you, it indeed took me a while to fully grasp the deadlock.
>>
>> >
>> > Unfortunately, backporting the whole new slab controller isn't an option:
>> > it's way too big and invasive.
>>
>> This is what I thought as well, this is why I want to figure out what
>> is the best way forward.
>>
>> > Do you already have a standalone fix?
>>
>> Not yet, I do not have a standalone fix. I suspect the best fix would
>> be to address fix css_killed_work_fn() stack so we never have:
>> cgroup_mutex -> mem_hotplug_lock. Either decoupling them or reverse
>> the order would work. If you have suggestions since you worked on this
>> code recently, please let me know.
>>
>> Thank you,
>> Pasha
>>
>> >
>> > Thanks!
>> >
>> >
>> > >
>> > > Thread #1: Hot-removes memory
>> > > device_offline
>> > >   memory_subsys_offline
>> > >     offline_pages
>> > >       __offline_pages
>> > >         mem_hotplug_lock <- write access
>> > >       waits for Thread #3 refcnt for pfn 9e5113 to get to 1 so it can
>> > > migrate it.
>> > >
>> > > Thread #2: ccs killer kthread
>> > >    css_killed_work_fn
>> > >      cgroup_mutex  <- Grab this Mutex
>> > >      mem_cgroup_css_offline
>> > >        memcg_offline_kmem.part
>> > >           memcg_deactivate_kmem_caches
>> > >             get_online_mems
>> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
>> > >
>> > > Thread #3: crashing userland program
>> > > do_coredump
>> > >   elf_core_dump
>> > >       get_dump_page() -> get page with pfn#9e5113, and increment refcnt
>> > >       dump_emit
>> > >         __kernel_write
>> > >           __vfs_write
>> > >             new_sync_write
>> > >               pipe_write
>> > >                 pipe_wait   -> waits for Thread #4 systemd-coredump to
>> > > read the pipe
>> > >
>> > > Thread #4: systemd-coredump
>> > > ksys_read
>> > >   vfs_read
>> > >     __vfs_read
>> > >       seq_read
>> > >         proc_single_show
>> > >           proc_cgroup_show
>> > >             cgroup_mutex -> waits from Thread #2 for this lock.
>> >
>> > >
>> > > In Summary:
>> > > Thread#1 waits for Thread#3 for refcnt, Thread#3 waits for Thread#4 to
>> > > read pipe. Thread#4 waits for Thread#2 for cgroup_mutex lock; Thread#2
>> > > waits for Thread#1 for mem_hotplug_lock rwlock.
>> > >
>> > > This work appears to fix this deadlock because cgroup_mutex is not
>> > > called anymore before mem_hotplug_lock (unless I am missing it), as it
>> > > removes memcg_deactivate_kmem_caches.
>> > >
>> > > Thank you,
>> > > Pasha
>> > >
>> > > On Wed, Jan 29, 2020 at 9:42 PM Roman Gushchin <guro@fb.com> wrote:
>> > > >
>> > > > On Thu, Jan 30, 2020 at 07:36:26AM +0530, Bharata B Rao wrote:
>> > > > > On Mon, Jan 27, 2020 at 09:34:25AM -0800, Roman Gushchin wrote:
>> > > > > > The existing cgroup slab memory controller is based on the idea of
>> > > > > > replicating slab allocator internals for each memory cgroup.
>> > > > > > This approach promises a low memory overhead (one pointer per page),
>> > > > > > and isn't adding too much code on hot allocation and release paths.
>> > > > > > But is has a very serious flaw: it leads to a low slab utilization.
>> > > > > >
>> > > > > > Using a drgn* script I've got an estimation of slab utilization on
>> > > > > > a number of machines running different production workloads. In most
>> > > > > > cases it was between 45% and 65%, and the best number I've seen was
>> > > > > > around 85%. Turning kmem accounting off brings it to high 90s. Also
>> > > > > > it brings back 30-50% of slab memory. It means that the real price
>> > > > > > of the existing slab memory controller is way bigger than a pointer
>> > > > > > per page.
>> > > > > >
>> > > > > > The real reason why the existing design leads to a low slab utilization
>> > > > > > is simple: slab pages are used exclusively by one memory cgroup.
>> > > > > > If there are only few allocations of certain size made by a cgroup,
>> > > > > > or if some active objects (e.g. dentries) are left after the cgroup is
>> > > > > > deleted, or the cgroup contains a single-threaded application which is
>> > > > > > barely allocating any kernel objects, but does it every time on a new CPU:
>> > > > > > in all these cases the resulting slab utilization is very low.
>> > > > > > If kmem accounting is off, the kernel is able to use free space
>> > > > > > on slab pages for other allocations.
>> > > > > >
>> > > > > > Arguably it wasn't an issue back to days when the kmem controller was
>> > > > > > introduced and was an opt-in feature, which had to be turned on
>> > > > > > individually for each memory cgroup. But now it's turned on by default
>> > > > > > on both cgroup v1 and v2. And modern systemd-based systems tend to
>> > > > > > create a large number of cgroups.
>> > > > > >
>> > > > > > This patchset provides a new implementation of the slab memory controller,
>> > > > > > which aims to reach a much better slab utilization by sharing slab pages
>> > > > > > between multiple memory cgroups. Below is the short description of the new
>> > > > > > design (more details in commit messages).
>> > > > > >
>> > > > > > Accounting is performed per-object instead of per-page. Slab-related
>> > > > > > vmstat counters are converted to bytes. Charging is performed on page-basis,
>> > > > > > with rounding up and remembering leftovers.
>> > > > > >
>> > > > > > Memcg ownership data is stored in a per-slab-page vector: for each slab page
>> > > > > > a vector of corresponding size is allocated. To keep slab memory reparenting
>> > > > > > working, instead of saving a pointer to the memory cgroup directly an
>> > > > > > intermediate object is used. It's simply a pointer to a memcg (which can be
>> > > > > > easily changed to the parent) with a built-in reference counter. This scheme
>> > > > > > allows to reparent all allocated objects without walking them over and
>> > > > > > changing memcg pointer to the parent.
>> > > > > >
>> > > > > > Instead of creating an individual set of kmem_caches for each memory cgroup,
>> > > > > > two global sets are used: the root set for non-accounted and root-cgroup
>> > > > > > allocations and the second set for all other allocations. This allows to
>> > > > > > simplify the lifetime management of individual kmem_caches: they are
>> > > > > > destroyed with root counterparts. It allows to remove a good amount of code
>> > > > > > and make things generally simpler.
>> > > > > >
>> > > > > > The patchset* has been tested on a number of different workloads in our
>> > > > > > production. In all cases it saved significant amount of memory, measured
>> > > > > > from high hundreds of MBs to single GBs per host. On average, the size
>> > > > > > of slab memory has been reduced by 35-45%.
>> > > > >
>> > > > > Here are some numbers from multiple runs of sysbench and kernel compilation
>> > > > > with this patchset on a 10 core POWER8 host:
>> > > > >
>> > > > > ==========================================================================
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for Sysbench oltp_read_write with mysqld running as part
>> > > > > of a mem cgroup (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes    15859712        4456448         72
>> > > > > memory.usage_in_bytes         337510400       335806464       .5
>> > > > > Slab: (kB)                    814336          607296          25
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    16187392        4653056         71
>> > > > > memory.usage_in_bytes         318832640       300154880       5
>> > > > > Slab: (kB)                    789888          559744          29
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > >
>> > > > > Peak usage of memory.kmem.usage_in_bytes, memory.usage_in_bytes and
>> > > > > meminfo:Slab for kernel compilation (make -s -j64) Compilation was
>> > > > > done from bash that is in a memory cgroup. (Sampling every 5s)
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > memory.kmem.usage_in_bytes    338493440       231931904       31
>> > > > > memory.usage_in_bytes         7368015872      6275923968      15
>> > > > > Slab: (kB)                    1139072         785408          31
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    341835776       236453888       30
>> > > > > memory.usage_in_bytes         6540427264      6072893440      7
>> > > > > Slab: (kB)                    1074304         761280          29
>> > > > >
>> > > > > memory.kmem.usage_in_bytes    340525056       233570304       31
>> > > > > memory.usage_in_bytes         6406209536      6177357824      3
>> > > > > Slab: (kB)                    1244288         739712          40
>> > > > > --------------------------------------------------------------------------
>> > > > >
>> > > > > Slab consumption right after boot
>> > > > > --------------------------------------------------------------------------
>> > > > >                               5.5.0-rc7-mm1   +slab patch     %reduction
>> > > > > --------------------------------------------------------------------------
>> > > > > Slab: (kB)                    821888          583424          29
>> > > > > ==========================================================================
>> > > > >
>> > > > > Summary:
>> > > > >
>> > > > > With sysbench and kernel compilation,  memory.kmem.usage_in_bytes shows
>> > > > > around 70% and 30% reduction consistently.
>> > > > >
>> > > > > Didn't see consistent reduction of memory.usage_in_bytes with sysbench and
>> > > > > kernel compilation.
>> > > > >
>> > > > > Slab usage (from /proc/meminfo) shows consistent 30% reduction and the
>> > > > > same is seen right after boot too.
>> > > >
>> > > > That's just perfect!
>> > > >
>> > > > memory.usage_in_bytes was most likely the same because the freed space
>> > > > was taken by pagecache.
>> > > >
>> > > > Thank you very much for testing!
>> > > >
>> > > > Roman
>
David Hildenbrand Sept. 2, 2020, 10:39 a.m. UTC | #12
> Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> 
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
>> There appears to be another problem that is related to the
>> cgroup_mutex -> mem_hotplug_lock deadlock described above.
>> 
>> In the original deadlock that I described, the workaround is to
>> replace crash dump from piping to Linux traditional save to files
>> method. However, after trying this workaround, I still observed
>> hardware watchdog resets during machine  shutdown.
>> 
>> The new problem occurs for the following reason: upon shutdown systemd
>> calls a service that hot-removes memory, and if hot-removing fails for
> 
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
Michal Hocko Sept. 2, 2020, 11:26 a.m. UTC | #13
On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > There appears to be another problem that is related to the
> > cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > 
> > In the original deadlock that I described, the workaround is to
> > replace crash dump from piping to Linux traditional save to files
> > method. However, after trying this workaround, I still observed
> > hardware watchdog resets during machine  shutdown.
> > 
> > The new problem occurs for the following reason: upon shutdown systemd
> > calls a service that hot-removes memory, and if hot-removing fails for
> 
> Why is that hotremove even needed if we're shutting down? Are there any
> (virtualization?) platforms where it makes some difference over plain
> shutdown/restart?

Yes this sounds quite dubious.

> > some reason systemd kills that service after timeout. However, systemd
> > is never able to kill the service, and we get hardware reset caused by
> > watchdog or a hang during shutdown:
> > 
> > Thread #1: memory hot-remove systemd service
> > Loops indefinitely, because if there is something still to be migrated
> > this loop never terminates. However, this loop can be terminated via
> > signal from systemd after timeout.
> > __offline_pages()
> >       do {
> >           pfn = scan_movable_pages(pfn, end_pfn);
> >                   # Returns 0, meaning there is nothing available to
> >                   # migrate, no page is PageLRU(page)
> >           ...
> >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> >                                             NULL, check_pages_isolated_cb);
> >                   # Returns -EBUSY, meaning there is at least one PFN that
> >                   # still has to be migrated.
> >       } while (ret);

This shouldn't really happen. What does prevent from this to proceed?
Did you manage to catch the specific pfn and what is it used for?
start_isolate_page_range and scan_movable_pages should fail if there is
any memory that cannot be migrated permanently. This is something that
we should focus on when debugging.
Michal Hocko Sept. 2, 2020, 11:32 a.m. UTC | #14
On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> >> > > Thread #2: ccs killer kthread
> >> > >    css_killed_work_fn
> >> > >      cgroup_mutex  <- Grab this Mutex
> >> > >      mem_cgroup_css_offline
> >> > >        memcg_offline_kmem.part
> >> > >           memcg_deactivate_kmem_caches
> >> > >             get_online_mems
> >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access

And one more thing. THis has been brought up several times already.
Maybe I have forgoten but why do we take hotplug locks in this path in
the first place? Memory hotplug notifier takes slab_mutex so this
shouldn't be really needed.
Pasha Tatashin Sept. 2, 2020, 12:34 p.m. UTC | #15
> I am on an old codebase that already has the fix that you are proposing,
> so I might be seeing someother issue which I will debug further.
>
> So looks like the loop in __offline_pages() had a call to
> drain_all_pages() before it was removed by
>
> c52e75935f8d: mm: remove extra drain pages on pcp list

I see, thanks. There is a reason to have the second drain, my fix is a
little better as it is performed only on a rare occasion when it is
needed, but I should add a FIXES tag. I have not checked
alloc_contig_range race.
Pasha Tatashin Sept. 2, 2020, 12:42 p.m. UTC | #16
> > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> >
> > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> >> There appears to be another problem that is related to the
> >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> >>
> >> In the original deadlock that I described, the workaround is to
> >> replace crash dump from piping to Linux traditional save to files
> >> method. However, after trying this workaround, I still observed
> >> hardware watchdog resets during machine  shutdown.
> >>
> >> The new problem occurs for the following reason: upon shutdown systemd
> >> calls a service that hot-removes memory, and if hot-removing fails for
> >
> > Why is that hotremove even needed if we're shutting down? Are there any
> > (virtualization?) platforms where it makes some difference over plain
> > shutdown/restart?
>
> If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)

Hi David,

This is how we are using it at Microsoft: there is  a very large
number of small memory machines (8G each) with low downtime
requirements (reboot must be under a second). There is also a large
state ~2G of memory that we need to transfer during reboot, otherwise
it is very expensive to recreate the state. We have 2G of system
memory memory reserved as a pmem in the device tree, and use it to
pass information across reboots. Once the information is not needed we
hot-add that memory and use it during runtime, before shutdown we
hot-remove the 2G, save the program state on it, and do the reboot.

Pasha
Pasha Tatashin Sept. 2, 2020, 12:51 p.m. UTC | #17
> > > Thread #1: memory hot-remove systemd service
> > > Loops indefinitely, because if there is something still to be migrated
> > > this loop never terminates. However, this loop can be terminated via
> > > signal from systemd after timeout.
> > > __offline_pages()
> > >       do {
> > >           pfn = scan_movable_pages(pfn, end_pfn);
> > >                   # Returns 0, meaning there is nothing available to
> > >                   # migrate, no page is PageLRU(page)
> > >           ...
> > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > >                                             NULL, check_pages_isolated_cb);
> > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > >                   # still has to be migrated.
> > >       } while (ret);
>

Hi Micahl,

> This shouldn't really happen. What does prevent from this to proceed?
> Did you manage to catch the specific pfn and what is it used for?

I did.

> start_isolate_page_range and scan_movable_pages should fail if there is
> any memory that cannot be migrated permanently. This is something that
> we should focus on when debugging.

I was hitting this issue:
mm/memory_hotplug: drain per-cpu pages again during memory offline
https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com

Once the pcp drain  race is fixed, this particular deadlock becomes irrelavent.

The lock ordering, however, cgroup_mutex ->  mem_hotplug_lock is bad,
and the first race condition that I was hitting and described above is
still present. For now I added a temporary workaround by using save to
file instead of piping the core during shutdown. I am glad the
mainline is fixed, but stables should also have some kind of fix for
this problem.

Pasha
Pasha Tatashin Sept. 2, 2020, 12:53 p.m. UTC | #18
On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > >> > > Thread #2: ccs killer kthread
> > >> > >    css_killed_work_fn
> > >> > >      cgroup_mutex  <- Grab this Mutex
> > >> > >      mem_cgroup_css_offline
> > >> > >        memcg_offline_kmem.part
> > >> > >           memcg_deactivate_kmem_caches
> > >> > >             get_online_mems
> > >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
>
> And one more thing. THis has been brought up several times already.
> Maybe I have forgoten but why do we take hotplug locks in this path in
> the first place? Memory hotplug notifier takes slab_mutex so this
> shouldn't be really needed.

Good point, it seems this lock can be completely removed from
memcg_deactivate_kmem_caches

Pasha

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Sept. 2, 2020, 1:50 p.m. UTC | #19
On Wed 02-09-20 08:42:13, Pavel Tatashin wrote:
> > > Am 02.09.2020 um 11:53 schrieb Vlastimil Babka <vbabka@suse.cz>:
> > >
> > > On 8/28/20 6:47 PM, Pavel Tatashin wrote:
> > >> There appears to be another problem that is related to the
> > >> cgroup_mutex -> mem_hotplug_lock deadlock described above.
> > >>
> > >> In the original deadlock that I described, the workaround is to
> > >> replace crash dump from piping to Linux traditional save to files
> > >> method. However, after trying this workaround, I still observed
> > >> hardware watchdog resets during machine  shutdown.
> > >>
> > >> The new problem occurs for the following reason: upon shutdown systemd
> > >> calls a service that hot-removes memory, and if hot-removing fails for
> > >
> > > Why is that hotremove even needed if we're shutting down? Are there any
> > > (virtualization?) platforms where it makes some difference over plain
> > > shutdown/restart?
> >
> > If all it‘s doing is offlining random memory that sounds unnecessary and dangerous. Any pointers to this service so we can figure out what it‘s doing and why? (Arch? Hypervisor?)
> 
> Hi David,
> 
> This is how we are using it at Microsoft: there is  a very large
> number of small memory machines (8G each) with low downtime
> requirements (reboot must be under a second). There is also a large
> state ~2G of memory that we need to transfer during reboot, otherwise
> it is very expensive to recreate the state. We have 2G of system
> memory memory reserved as a pmem in the device tree, and use it to
> pass information across reboots. Once the information is not needed we
> hot-add that memory and use it during runtime, before shutdown we
> hot-remove the 2G, save the program state on it, and do the reboot.

I still do not get it. So what does guarantee that the memory is
offlineable in the first place? Also what is the difference between
offlining and simply shutting the system down so that the memory is not
used in the first place. In other words what kind of difference
hotremove makes?
Michal Hocko Sept. 2, 2020, 1:51 p.m. UTC | #20
On Wed 02-09-20 08:51:06, Pavel Tatashin wrote:
> > > > Thread #1: memory hot-remove systemd service
> > > > Loops indefinitely, because if there is something still to be migrated
> > > > this loop never terminates. However, this loop can be terminated via
> > > > signal from systemd after timeout.
> > > > __offline_pages()
> > > >       do {
> > > >           pfn = scan_movable_pages(pfn, end_pfn);
> > > >                   # Returns 0, meaning there is nothing available to
> > > >                   # migrate, no page is PageLRU(page)
> > > >           ...
> > > >           ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn,
> > > >                                             NULL, check_pages_isolated_cb);
> > > >                   # Returns -EBUSY, meaning there is at least one PFN that
> > > >                   # still has to be migrated.
> > > >       } while (ret);
> >
> 
> Hi Micahl,
> 
> > This shouldn't really happen. What does prevent from this to proceed?
> > Did you manage to catch the specific pfn and what is it used for?
> 
> I did.
> 
> > start_isolate_page_range and scan_movable_pages should fail if there is
> > any memory that cannot be migrated permanently. This is something that
> > we should focus on when debugging.
> 
> I was hitting this issue:
> mm/memory_hotplug: drain per-cpu pages again during memory offline
> https://lore.kernel.org/lkml/20200901124615.137200-1-pasha.tatashin@soleen.com

I have noticed the patch but didn't have time to think it through (have
been few days off and catching up with emails). Will give it a higher
priority.
Michal Hocko Sept. 2, 2020, 1:52 p.m. UTC | #21
On Wed 02-09-20 08:53:49, Pavel Tatashin wrote:
> On Wed, Sep 2, 2020 at 7:32 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 02-09-20 11:53:00, Vlastimil Babka wrote:
> > > >> > > Thread #2: ccs killer kthread
> > > >> > >    css_killed_work_fn
> > > >> > >      cgroup_mutex  <- Grab this Mutex
> > > >> > >      mem_cgroup_css_offline
> > > >> > >        memcg_offline_kmem.part
> > > >> > >           memcg_deactivate_kmem_caches
> > > >> > >             get_online_mems
> > > >> > >               mem_hotplug_lock <- waits for Thread#1 to get read access
> >
> > And one more thing. THis has been brought up several times already.
> > Maybe I have forgoten but why do we take hotplug locks in this path in
> > the first place? Memory hotplug notifier takes slab_mutex so this
> > shouldn't be really needed.
> 
> Good point, it seems this lock can be completely removed from
> memcg_deactivate_kmem_caches

I am pretty sure we have discussed that in the past. But I do not
remember the outcome. Either we have concluded that this is indeed the
case but nobody came up with a patch or we have hit some obscure
issue... Maybe David/Roman rememeber more than I do.
Pasha Tatashin Sept. 2, 2020, 2:20 p.m. UTC | #22
> > This is how we are using it at Microsoft: there is  a very large
> > number of small memory machines (8G each) with low downtime
> > requirements (reboot must be under a second). There is also a large
> > state ~2G of memory that we need to transfer during reboot, otherwise
> > it is very expensive to recreate the state. We have 2G of system
> > memory memory reserved as a pmem in the device tree, and use it to
> > pass information across reboots. Once the information is not needed we
> > hot-add that memory and use it during runtime, before shutdown we
> > hot-remove the 2G, save the program state on it, and do the reboot.
>
> I still do not get it. So what does guarantee that the memory is
> offlineable in the first place?

It is in a movable zone, and we have more than 2G of free memory for
successful migrations.

> Also what is the difference between
> offlining and simply shutting the system down so that the memory is not
> used in the first place. In other words what kind of difference
> hotremove makes?

For performance reasons during system updates/reboots we do not erase
memory content. The memory content is erased only on power cycle,
which we do not do in production.

Once we hot-remove the memory, we convert it back into DAXFS PMEM
device, format it into EXT4, mount it as DAX file system, and allow
programs to serialize their states to it so they can read it back
after the reboot.

During startup we mount pmem, programs read the state back, and after
that we hotplug the PMEM DAX as a movable zone. This way during normal
runtime we have 8G available to programs.

Pasha
David Hildenbrand Sept. 3, 2020, 6:09 p.m. UTC | #23
> For performance reasons during system updates/reboots we do not erase
> memory content. The memory content is erased only on power cycle,
> which we do not do in production.
> 
> Once we hot-remove the memory, we convert it back into DAXFS PMEM
> device, format it into EXT4, mount it as DAX file system, and allow
> programs to serialize their states to it so they can read it back
> after the reboot.
> 
> During startup we mount pmem, programs read the state back, and after
> that we hotplug the PMEM DAX as a movable zone. This way during normal
> runtime we have 8G available to programs.
> 

Thanks for sharing the workflow - while it sounds somewhat sub-optimal,
I guess it gets the job done using existing tools / mechanisms.

(I remember the persistent tmpfs over kexec RFC, which tries to tackle
it by introducing something new)