mbox series

[RFC,0/5] add option to restore swap account to cgroupv1 mode

Message ID 20250319064148.774406-1-jingxiangzeng.cas@gmail.com (mailing list archive)
Headers show
Series add option to restore swap account to cgroupv1 mode | expand

Message

jingxiang zeng March 19, 2025, 6:41 a.m. UTC
From: Zeng Jingxiang <linuszeng@tencent.com>

memsw account is a very useful knob for container memory
overcommitting: It's a great abstraction of the "expected total
memory usage" of a container, so containers can't allocate too
much memory using SWAP, but still be able to SWAP out.

For a simple example, with memsw.limit == memory.limit, containers
can't exceed their original memory limit, even with SWAP enabled, they
get OOM killed as how they used to, but the host is now able to
offload cold pages.

Similar ability seems absent with V2: With memory.swap.max == 0, the
host can't use SWAP to reclaim container memory at all. But with a
value larger than that, containers are able to overuse memory, causing
delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
out of balance, especially with compress SWAP backends.

This patch set adds two interfaces to control the behavior of the
memory.swap.max/current in cgroupv2:

CONFIG_MEMSW_ACCOUNT_ON_DFL
cgroup.memsw_account_on_dfl={0, 1}

When one of the interfaces is enabled: memory.swap.current and
memory.swap.max represents the usage/limit of swap.
When neither is enabled (default behavior),memory.swap.current and
memory.swap.max represents the usage/limit of memory+swap.

As discussed in [1], this patch set can change the semantics of
of memory.swap.max/current to the v1-like behavior.

Link:
https://lore.kernel.org/all/Zk-fQtFrj-2YDJOo@P9FQF9L96D.corp.robot.car/ [1]

linuszeng (5):
  Kconfig: add SWAP_CHARGE_V1_MODE config
  memcontrol: add boot option to enable memsw account on dfl
  mm/memcontrol: do not scan anon pages if memsw limit is hit
  mm/memcontrol: allow memsw account in cgroup v2
  Docs/cgroup-v2: add cgroup.memsw_account_on_dfl Documentation

 Documentation/admin-guide/cgroup-v2.rst       | 21 +++++--
 .../admin-guide/kernel-parameters.txt         |  7 +++
 include/linux/memcontrol.h                    |  8 +++
 init/Kconfig                                  | 16 ++++++
 mm/memcontrol-v1.c                            |  2 +-
 mm/memcontrol-v1.h                            |  4 +-
 mm/memcontrol.c                               | 55 ++++++++++++++-----
 7 files changed, 93 insertions(+), 20 deletions(-)

Comments

Shakeel Butt March 19, 2025, 7:27 p.m. UTC | #1
+Tejun

Hi Zeng,

On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> From: Zeng Jingxiang <linuszeng@tencent.com>
> 
> memsw account is a very useful knob for container memory
> overcommitting: It's a great abstraction of the "expected total
> memory usage" of a container, so containers can't allocate too
> much memory using SWAP, but still be able to SWAP out.
> 
> For a simple example, with memsw.limit == memory.limit, containers
> can't exceed their original memory limit, even with SWAP enabled, they
> get OOM killed as how they used to, but the host is now able to
> offload cold pages.
> 
> Similar ability seems absent with V2: With memory.swap.max == 0, the
> host can't use SWAP to reclaim container memory at all. But with a
> value larger than that, containers are able to overuse memory, causing
> delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> out of balance, especially with compress SWAP backends.
> 
> This patch set adds two interfaces to control the behavior of the
> memory.swap.max/current in cgroupv2:
> 
> CONFIG_MEMSW_ACCOUNT_ON_DFL
> cgroup.memsw_account_on_dfl={0, 1}
> 
> When one of the interfaces is enabled: memory.swap.current and
> memory.swap.max represents the usage/limit of swap.
> When neither is enabled (default behavior),memory.swap.current and
> memory.swap.max represents the usage/limit of memory+swap.
> 
> As discussed in [1], this patch set can change the semantics of
> of memory.swap.max/current to the v1-like behavior.
> 
> Link:
> https://lore.kernel.org/all/Zk-fQtFrj-2YDJOo@P9FQF9L96D.corp.robot.car/ [1]

Overall I don't have objection but I would like to keep the changes
separate from v2 code as much as possible.

More specifically:

1. Keep CONFIG_MEMSW_ACCOUNT_ON_DFL dependent on CONFIG_MEMCG_V1 and
   disabled by default (as you already did).

2. Keep the changes in memcontrol-v1.[h|c] as much as possible.

I will go over the patches but let's see what others have to say.
Johannes Weiner March 19, 2025, 7:38 p.m. UTC | #2
On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> From: Zeng Jingxiang <linuszeng@tencent.com>
> 
> memsw account is a very useful knob for container memory
> overcommitting: It's a great abstraction of the "expected total
> memory usage" of a container, so containers can't allocate too
> much memory using SWAP, but still be able to SWAP out.
> 
> For a simple example, with memsw.limit == memory.limit, containers
> can't exceed their original memory limit, even with SWAP enabled, they
> get OOM killed as how they used to, but the host is now able to
> offload cold pages.
> 
> Similar ability seems absent with V2: With memory.swap.max == 0, the
> host can't use SWAP to reclaim container memory at all. But with a
> value larger than that, containers are able to overuse memory, causing
> delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> out of balance, especially with compress SWAP backends.
> 
> This patch set adds two interfaces to control the behavior of the
> memory.swap.max/current in cgroupv2:
> 
> CONFIG_MEMSW_ACCOUNT_ON_DFL
> cgroup.memsw_account_on_dfl={0, 1}
> 
> When one of the interfaces is enabled: memory.swap.current and
> memory.swap.max represents the usage/limit of swap.
> When neither is enabled (default behavior),memory.swap.current and
> memory.swap.max represents the usage/limit of memory+swap.

This should be new knobs, e.g. memory.memsw.current, memory.memsw.max.

Overloading the existing swap knobs is confusing.

And there doesn't seem to be a good reason to make the behavior
either-or anyway. If memory.swap.max=max (default), it won't interfere
with the memsw operation. And it's at least conceivable somebody might
want to set both, memsw.max > swap.max, to get some flexibility while
excluding the craziest edge cases.
Shakeel Butt March 19, 2025, 7:51 p.m. UTC | #3
On Wed, Mar 19, 2025 at 03:38:38PM -0400, Johannes Weiner wrote:
> On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> > 
> > memsw account is a very useful knob for container memory
> > overcommitting: It's a great abstraction of the "expected total
> > memory usage" of a container, so containers can't allocate too
> > much memory using SWAP, but still be able to SWAP out.
> > 
> > For a simple example, with memsw.limit == memory.limit, containers
> > can't exceed their original memory limit, even with SWAP enabled, they
> > get OOM killed as how they used to, but the host is now able to
> > offload cold pages.
> > 
> > Similar ability seems absent with V2: With memory.swap.max == 0, the
> > host can't use SWAP to reclaim container memory at all. But with a
> > value larger than that, containers are able to overuse memory, causing
> > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> > out of balance, especially with compress SWAP backends.
> > 
> > This patch set adds two interfaces to control the behavior of the
> > memory.swap.max/current in cgroupv2:
> > 
> > CONFIG_MEMSW_ACCOUNT_ON_DFL
> > cgroup.memsw_account_on_dfl={0, 1}
> > 
> > When one of the interfaces is enabled: memory.swap.current and
> > memory.swap.max represents the usage/limit of swap.
> > When neither is enabled (default behavior),memory.swap.current and
> > memory.swap.max represents the usage/limit of memory+swap.
> 
> This should be new knobs, e.g. memory.memsw.current, memory.memsw.max.
> 
> Overloading the existing swap knobs is confusing.
> 
> And there doesn't seem to be a good reason to make the behavior
> either-or anyway. If memory.swap.max=max (default), it won't interfere
> with the memsw operation. And it's at least conceivable somebody might
> want to set both, memsw.max > swap.max, to get some flexibility while
> excluding the craziest edge cases.

At the moment memsw and swap shares the underlying page_counter. This
would require having explicit page_counter for memsw.

What's your take on memsw interfaces still behind
CONFIG_MEMSW_ACCOUNT_ON_DFL?
jingxiang zeng March 20, 2025, 8:09 a.m. UTC | #4
On Thu, 20 Mar 2025 at 03:38, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> >
> > memsw account is a very useful knob for container memory
> > overcommitting: It's a great abstraction of the "expected total
> > memory usage" of a container, so containers can't allocate too
> > much memory using SWAP, but still be able to SWAP out.
> >
> > For a simple example, with memsw.limit == memory.limit, containers
> > can't exceed their original memory limit, even with SWAP enabled, they
> > get OOM killed as how they used to, but the host is now able to
> > offload cold pages.
> >
> > Similar ability seems absent with V2: With memory.swap.max == 0, the
> > host can't use SWAP to reclaim container memory at all. But with a
> > value larger than that, containers are able to overuse memory, causing
> > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> > out of balance, especially with compress SWAP backends.
> >
> > This patch set adds two interfaces to control the behavior of the
> > memory.swap.max/current in cgroupv2:
> >
> > CONFIG_MEMSW_ACCOUNT_ON_DFL
> > cgroup.memsw_account_on_dfl={0, 1}
> >
> > When one of the interfaces is enabled: memory.swap.current and
> > memory.swap.max represents the usage/limit of swap.
> > When neither is enabled (default behavior),memory.swap.current and
> > memory.swap.max represents the usage/limit of memory+swap.
>
> This should be new knobs, e.g. memory.memsw.current, memory.memsw.max.
>
> Overloading the existing swap knobs is confusing.
>
> And there doesn't seem to be a good reason to make the behavior
> either-or anyway. If memory.swap.max=max (default), it won't interfere
> with the memsw operation. And it's at least conceivable somebody might
> want to set both, memsw.max > swap.max, to get some flexibility while
> excluding the craziest edge cases.

Hi Johannes,

If both memsw.max and swap.max are provided in cgroupv2, there will be some
issues as follows:
(1. As Shakeel Butt mentioned, currently memsw and swap share the page_counter,
and we need to provide a separate page_counter for memsw.
(2. Currently, the statistics for memsw and swap are mutually
exclusive. For example,
during uncharging, both memsw and swap call the __mem_cgroup_uncharge_swap
function together, and this function currently only selects a single
counter for statistics
based on the static do_memsw_account.

As mentioned above, this patch set considers the approach suggested by Roman
Gushchin[1], which involves switching to cgroupv1 behavior through a
configuration
option, making it easier to implement.

Link: https://lore.kernel.org/all/Zk-fQtFrj-2YDJOo@P9FQF9L96D.corp.robot.car/
[1]
Johannes Weiner March 20, 2025, 3:08 p.m. UTC | #5
On Thu, Mar 20, 2025 at 04:09:29PM +0800, jingxiang zeng wrote:
> If both memsw.max and swap.max are provided in cgroupv2, there will be some
> issues as follows:
> (1. As Shakeel Butt mentioned, currently memsw and swap share the page_counter,
> and we need to provide a separate page_counter for memsw.
> (2. Currently, the statistics for memsw and swap are mutually
> exclusive. For example,
> during uncharging, both memsw and swap call the __mem_cgroup_uncharge_swap
> function together, and this function currently only selects a single
> counter for statistics
> based on the static do_memsw_account.

My suggestion is to factor out from struct page_counter all the stuff
that is not necessary for all users, and then have separate counters
for swap and memsw.

The protection stuff is long overdue for this. It makes up nearly half
of the struct's members, but is only used by the memory counter. Even
before your patches this is unnecessary bloat in the swap/memsw, kmem
and tcpmem counters.

Fix that and having separate counters is a non-issue.

Overloading the memory.swap.* controls to mean "combined memory and
swap" is unacceptable to me. They have established meaning on v2. It
may well become a common setting, and there might well be usecases
where people want one setting for one cgroup and another setting for
another cgroup. Or maybe even both controls in one group. And why not?

This is a user-visible interface that we'll have to support forever,
and deployments will be forced to use forever. Tech debt in the
current implementation is not a convincing argument to forever trap us
all with a suboptimal choice.

Nacked-by: Johannes Weiner <hannes@cmpxchg.org>