[rfc,0/9] mm: memcg: separate legacy cgroup v1 code and put under config option

Message ID	20240509034138.2207186-1-roman.gushchin@linux.dev (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Roman Gushchin <roman.gushchin@linux.dev> To: Andrew Morton <akpm@linux-foundation.org> Cc: Muchun Song <muchun.song@linux.dev>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeel.butt@linux.dev>, Matthew Wilcox <willy@infradead.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Roman Gushchin <roman.gushchin@linux.dev> Subject: [PATCH rfc 0/9] mm: memcg: separate legacy cgroup v1 code and put under config option Date: Wed, 8 May 2024 20:41:29 -0700 Message-ID: <20240509034138.2207186-1-roman.gushchin@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcg: separate legacy cgroup v1 code and put under config option \| expand [rfc,0/9] mm: memcg: separate legacy cgroup v1 code and put under config option [rfc,1/9] mm: memcg: introduce memcontrol-v1.c [rfc,2/9] mm: memcg: move soft limit reclaim code to memcontrol-v1.c [rfc,3/9] mm: memcg: move charge migration code to memcontrol-v1.c [rfc,4/9] mm: memcg: move legacy memcg event code into memcontrol-v1.c [rfc,5/9] mm: memcg: move cgroup v1 interface files to memcontrol-v1.c [rfc,6/9] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c [rfc,7/9] mm: memcg: put cgroup v1-specific code under a config option [rfc,8/9] mm: memcg: put corresponding struct mem_cgroup members under CONFIG_MEMCG_V1 [rfc,9/9] mm: memcg: put cgroup v1-related members of task_struct under config option

Roman Gushchin May 9, 2024, 3:41 a.m. UTC

Cgroups v2 have been around for a while and many users have fully adopted them,
so they never use cgroups v1 features and functionality. Yet they have to "pay"
for the cgroup v1 support anyway:
1) the kernel binary contains useless cgroup v1 code,
2) some common structures like task_struct and mem_cgroup have never used
   cgroup v1-specific members,
3) some code paths have additional checks which are not needed.

Cgroup v1's memory controller has a number of features that are not supported
by cgroup v2 and their implementation is pretty much self contained.
Most notably, these features are: soft limit reclaim, oom handling in userspace,
complicated event notification system, charge migration.

Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
intervened with generic and cgroup v2-specific code. It's a burden on
developers and maintainers.

This patchset aims to solve these problems by:
1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
   mm/internal.h header
3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
5) putting unused struct memory_cgroup and task_struct members under
   CONFIG_MEMCG_V1 as well.

This is an RFC version, which is not 100% polished yet, so but it would be great
to discuss and agree on the overall approach.

Some open questions, opinions are appreciated:
1) I consider renaming non-static functions in memcontrol-v1.c to have
   mem_cgroup_v1_ prefix. Is this a good idea?
2) Do we want to extend it beyond the memory controller? Should
3) Is it better to use a new include/linux/memcontrol-v1.h instead of
   mm/internal.h? Or mm/memcontrol-v1.h.

diffstat:
 include/linux/memcontrol.h |  165 ++++---
 include/linux/sched.h      |    5 +-
 init/Kconfig               |    7 +
 mm/Makefile                |    2 +
 mm/internal.h              |  124 +++++
 mm/memcontrol-v1.c         | 2941 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c            | 4121 ++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------------
 7 files changed, 3765 insertions(+), 3600 deletions(-)

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>


Roman Gushchin (9):
  mm: memcg: introduce memcontrol-v1.c
  mm: memcg: move soft limit reclaim code to memcontrol-v1.c
  mm: memcg: move charge migration code to memcontrol-v1.c
  mm: memcg: move legacy memcg event code into memcontrol-v1.c
  mm: memcg: move cgroup v1 interface files to memcontrol-v1.c
  mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c
  mm: memcg: put cgroup v1-specific code under a config option
  mm: memcg: put corresponding struct mem_cgroup members under
    CONFIG_MEMCG_V1
  mm: memcg: put cgroup v1-related members of task_struct under config
    option

 include/linux/memcontrol.h |  165 +-
 include/linux/sched.h      |    5 +-
 init/Kconfig               |    7 +
 mm/Makefile                |    2 +
 mm/internal.h              |  124 ++
 mm/memcontrol-v1.c         | 2941 +++++++++++++++++++++++++
 mm/memcontrol.c            | 4121 ++++++------------------------------
 7 files changed, 3765 insertions(+), 3600 deletions(-)
 create mode 100644 mm/memcontrol-v1.c

Shakeel Butt May 9, 2024, 6:33 a.m. UTC | #1

On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> Cgroups v2 have been around for a while and many users have fully adopted them,
> so they never use cgroups v1 features and functionality. Yet they have to "pay"
> for the cgroup v1 support anyway:
> 1) the kernel binary contains useless cgroup v1 code,
> 2) some common structures like task_struct and mem_cgroup have never used
>    cgroup v1-specific members,
> 3) some code paths have additional checks which are not needed.
> 
> Cgroup v1's memory controller has a number of features that are not supported
> by cgroup v2 and their implementation is pretty much self contained.
> Most notably, these features are: soft limit reclaim, oom handling in userspace,
> complicated event notification system, charge migration.
> 
> Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> intervened with generic and cgroup v2-specific code. It's a burden on
> developers and maintainers.
> 
> This patchset aims to solve these problems by:
> 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
>    mm/internal.h header
> 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> 5) putting unused struct memory_cgroup and task_struct members under
>    CONFIG_MEMCG_V1 as well.
> 
> This is an RFC version, which is not 100% polished yet, so but it would be great
> to discuss and agree on the overall approach.
> 
> Some open questions, opinions are appreciated:
> 1) I consider renaming non-static functions in memcontrol-v1.c to have
>    mem_cgroup_v1_ prefix. Is this a good idea?
> 2) Do we want to extend it beyond the memory controller? Should
> 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
>    mm/internal.h? Or mm/memcontrol-v1.h.
> 

Hi Roman,

A very timely and important topic and we should definitely talk about it
during LSFMM as well. I have been thinking about this problem for quite
sometime and I am getting more and more convinced that we should aim to
completely deprecate memcg-v1.

More specifically:

1. What are the memcg-v1 features which have no alternative in memcg-v2
and are blocker for memcg-v1 users? (setting aside the cgroup v2
structual restrictions)

2. What are unused memcg-v1 features which we should start deprecating?

IMO we should systematically start deprecating memcg-v1 features and
start unblocking the users stuck on memcg-v1.

Now regarding the proposal in this series, I think it can be a first
step but should not give an impression that we are done. The only
concern I have is the potential of "out of sight, out of mind" situation
with this change but if we keep the momentum of deprecation of memcg-v1
it should be fine.

I have CCed Greg and David from Google to get their opinion on what
memcg-v1 features are blocker for their memcg-v2 migration and if they
have concern in deprecation of memcg-v1 features.

Anyone else still on memcg-v1, please do provide your input.

thanks,
Shakeel

Johannes Weiner May 9, 2024, 2:22 p.m. UTC | #2

On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> Cgroups v2 have been around for a while and many users have fully adopted them,
> so they never use cgroups v1 features and functionality. Yet they have to "pay"
> for the cgroup v1 support anyway:
> 1) the kernel binary contains useless cgroup v1 code,
> 2) some common structures like task_struct and mem_cgroup have never used
>    cgroup v1-specific members,
> 3) some code paths have additional checks which are not needed.
> 
> Cgroup v1's memory controller has a number of features that are not supported
> by cgroup v2 and their implementation is pretty much self contained.
> Most notably, these features are: soft limit reclaim, oom handling in userspace,
> complicated event notification system, charge migration.
> 
> Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> intervened with generic and cgroup v2-specific code. It's a burden on
> developers and maintainers.

Great patchset. The moves look clean and straight-forward to me on
first glance.

> This patchset aims to solve these problems by:
> 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,

+1

> 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
>    mm/internal.h header

You proposed mm/memcontrol-v1.h below, IMO that's the best option.

> 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default

+1

CONFIG_MEMCG1 should also work.

> 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set

+1

> 5) putting unused struct memory_cgroup and task_struct members under
>    CONFIG_MEMCG_V1 as well.

+1

> 
> This is an RFC version, which is not 100% polished yet, so but it would be great
> to discuss and agree on the overall approach.
> 
> Some open questions, opinions are appreciated:
> 1) I consider renaming non-static functions in memcontrol-v1.c to have
>    mem_cgroup_v1_ prefix. Is this a good idea?

I think this would be great, to make it more obvious in memcontrol.c.

For core cgroup code, we used cgroup1_foo(). Maybe name them all
things like memcg1_update_tree() etc.? That's short and sweet while
sticking out visually pretty well.

> 2) Do we want to extend it beyond the memory controller? Should

Could you please elaborate? ^_^

> 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
>    mm/internal.h? Or mm/memcontrol-v1.h.

mm/memcontrol-v1.h sounds good to me.

>  mm/memcontrol.c            | 4121 ++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------------

Lol, awesome.

Johannes Weiner May 9, 2024, 2:36 p.m. UTC | #3

On Thu, May 09, 2024 at 10:22:10AM -0400, Johannes Weiner wrote:
> On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> >    mm/internal.h? Or mm/memcontrol-v1.h.
> 
> mm/memcontrol-v1.h sounds good to me.

Argh, there is a folio_memcg_lock() callsite in fs/buffer.c. I suppose
include/linux/memcontrol-v1.h makes the most sense then.

Roman Gushchin May 9, 2024, 2:57 p.m. UTC | #4

On Thu, May 09, 2024 at 10:36:35AM -0400, Johannes Weiner wrote:
> On Thu, May 09, 2024 at 10:22:10AM -0400, Johannes Weiner wrote:
> > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > >    mm/internal.h? Or mm/memcontrol-v1.h.
> > 
> > mm/memcontrol-v1.h sounds good to me.
> 
> Argh, there is a folio_memcg_lock() callsite in fs/buffer.c. I suppose
> include/linux/memcontrol-v1.h makes the most sense then.

You mean put everything into include/linux/memcontrol-v1.h?
And functions from memcontrol.c by memcontrol-v1.c into
include/linux/memcontrol.h?

It's an option I considered it but the downside is that we're "leaking"
a lot of internal definitions into the outside world, because
memcontrol.h is included everywhere.

So maybe mm/memcontrol-v1.h for definitions shared between v1 and v2
and keep exported functions in include/linux/memcontrol.h? There are
only few of them.

Thanks!

Roman Gushchin May 9, 2024, 5:30 p.m. UTC | #5

On Wed, May 08, 2024 at 11:33:07PM -0700, Shakeel Butt wrote:
> On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > Cgroups v2 have been around for a while and many users have fully adopted them,
> > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > for the cgroup v1 support anyway:
> > 1) the kernel binary contains useless cgroup v1 code,
> > 2) some common structures like task_struct and mem_cgroup have never used
> >    cgroup v1-specific members,
> > 3) some code paths have additional checks which are not needed.
> > 
> > Cgroup v1's memory controller has a number of features that are not supported
> > by cgroup v2 and their implementation is pretty much self contained.
> > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > complicated event notification system, charge migration.
> > 
> > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > intervened with generic and cgroup v2-specific code. It's a burden on
> > developers and maintainers.
> > 
> > This patchset aims to solve these problems by:
> > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> >    mm/internal.h header
> > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > 5) putting unused struct memory_cgroup and task_struct members under
> >    CONFIG_MEMCG_V1 as well.
> > 
> > This is an RFC version, which is not 100% polished yet, so but it would be great
> > to discuss and agree on the overall approach.
> > 
> > Some open questions, opinions are appreciated:
> > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> >    mem_cgroup_v1_ prefix. Is this a good idea?
> > 2) Do we want to extend it beyond the memory controller? Should
> > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> >    mm/internal.h? Or mm/memcontrol-v1.h.
> > 
> 
> Hi Roman,
> 
> A very timely and important topic and we should definitely talk about it
> during LSFMM as well. I have been thinking about this problem for quite
> sometime and I am getting more and more convinced that we should aim to
> completely deprecate memcg-v1.
> 
> More specifically:
> 
> 1. What are the memcg-v1 features which have no alternative in memcg-v2
> and are blocker for memcg-v1 users? (setting aside the cgroup v2
> structual restrictions)

I don't think there are any, except there might be a certain cost to migrate,
so some companies might be resistant to put in resources, because they don't
see any immediate benefits as well.

> 
> 2. What are unused memcg-v1 features which we should start deprecating?
> 
> IMO we should systematically start deprecating memcg-v1 features and
> start unblocking the users stuck on memcg-v1.

I'm not sure we want to deprecate them one-by-one - it's a lot of work
and maybe we can deprecate it all together instead.

I think the only feature which we might want to deprecate separately -
it's the charge migration. It's the most annoying feature as it requires
a lot more synchronization, which can be dropped otherwise, so it's
complicating a lot of things. Other features are more or less self-contained.

> 
> Now regarding the proposal in this series, I think it can be a first
> step but should not give an impression that we are done.

Yeah, it's really only a first step.

> The only
> concern I have is the potential of "out of sight, out of mind" situation
> with this change but if we keep the momentum of deprecation of memcg-v1
> it should be fine.

My rough plan here:
1) move it out to a separate file and put under a config option, default on
2) clean up all remaining small bits here and there
... < wait a year >
3) flip the config option to be off by default
... < wait another year or two >
4) drop the code entirely

> 
> I have CCed Greg and David from Google to get their opinion on what
> memcg-v1 features are blocker for their memcg-v2 migration and if they
> have concern in deprecation of memcg-v1 features.

Thank you!

David Rientjes May 10, 2024, 2:59 a.m. UTC | #6

On Wed, 8 May 2024, Shakeel Butt wrote:

> Hi Roman,
> 
> A very timely and important topic and we should definitely talk about it
> during LSFMM as well. I have been thinking about this problem for quite
> sometime and I am getting more and more convinced that we should aim to
> completely deprecate memcg-v1.
> 

I think this would be a very worthwhile discussion at LSF/MM, I'm not sure 
if it would be too late for someone to make a formal proposal for it to be 
included in the schedule.  Michal would know if there is a opportunity.

I say that in light of 
https://lore.kernel.org/bpf/ZjL5b-zipMrV2JSg@archie.me/T/#mb6c21b09543c434dd85e718a8ecf5ca6485e6d07 
as well for the whole cgroup v1 -> v2 transition.

Chris, now cc'd, would know best about all of the dependencies that Google 
has for memcg specifically.

> More specifically:
> 
> 1. What are the memcg-v1 features which have no alternative in memcg-v2
> and are blocker for memcg-v1 users? (setting aside the cgroup v2
> structual restrictions)
> 
> 2. What are unused memcg-v1 features which we should start deprecating?
> 
> IMO we should systematically start deprecating memcg-v1 features and
> start unblocking the users stuck on memcg-v1.
> 
> Now regarding the proposal in this series, I think it can be a first
> step but should not give an impression that we are done. The only
> concern I have is the potential of "out of sight, out of mind" situation
> with this change but if we keep the momentum of deprecation of memcg-v1
> it should be fine.
> 
> I have CCed Greg and David from Google to get their opinion on what
> memcg-v1 features are blocker for their memcg-v2 migration and if they
> have concern in deprecation of memcg-v1 features.
> 
> Anyone else still on memcg-v1, please do provide your input.
> 
> thanks,
> Shakeel
>

Chris Li May 10, 2024, 7:10 a.m. UTC | #7

On Thu, May 9, 2024 at 7:59 PM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 8 May 2024, Shakeel Butt wrote:
>
> > Hi Roman,
> >
> > A very timely and important topic and we should definitely talk about it
> > during LSFMM as well. I have been thinking about this problem for quite
> > sometime and I am getting more and more convinced that we should aim to
> > completely deprecate memcg-v1.
> >
>
> I think this would be a very worthwhile discussion at LSF/MM, I'm not sure
> if it would be too late for someone to make a formal proposal for it to be
> included in the schedule.  Michal would know if there is a opportunity.
>
> I say that in light of
> https://lore.kernel.org/bpf/ZjL5b-zipMrV2JSg@archie.me/T/#mb6c21b09543c434dd85e718a8ecf5ca6485e6d07
> as well for the whole cgroup v1 -> v2 transition.
>
> Chris, now cc'd, would know best about all of the dependencies that Google
> has for memcg specifically.

Thanks David,

Yes, I am very interested in that cgroup v1 -> v2 transition discussion.
>
> > More specifically:
> >
> > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > structual restrictions)

In the list mentioned by Roman: "soft limit reclaim, oom handling in userspace,
complicated event notification system, charge migration."

The "oom.control" and leak of user space oom control is a big one for google.
Some test frameworks also use "memory.force_empty".
Soft limit reclaim and charge migration is also used.

There is also the combined "memsw" limit enforcement. Google has some
internal work around for V2 but it would be good if that upstream can
support it directly.

BTW, I know you are not looking for the "cgroup v2 structure restrictions".
Two cgroup controllers that can't have different sets of processes is
a bit too restrictive.

That is what I recall right now, I might be missing some small odd items.

Anyway, glad to join the discussion if there is a session.

Chris


Chris
> >
> > 2. What are unused memcg-v1 features which we should start deprecating?
> >
> > IMO we should systematically start deprecating memcg-v1 features and
> > start unblocking the users stuck on memcg-v1.
> >
> > Now regarding the proposal in this series, I think it can be a first
> > step but should not give an impression that we are done. The only
> > concern I have is the potential of "out of sight, out of mind" situation
> > with this change but if we keep the momentum of deprecation of memcg-v1
> > it should be fine.
> >
> > I have CCed Greg and David from Google to get their opinion on what
> > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > have concern in deprecation of memcg-v1 features.
> >
> > Anyone else still on memcg-v1, please do provide your input.
> >
> > thanks,
> > Shakeel
> >

Michal Hocko May 10, 2024, 8:10 a.m. UTC | #8

On Thu 09-05-24 19:59:19, David Rientjes wrote:
> On Wed, 8 May 2024, Shakeel Butt wrote:
> 
> > Hi Roman,
> > 
> > A very timely and important topic and we should definitely talk about it
> > during LSFMM as well. I have been thinking about this problem for quite
> > sometime and I am getting more and more convinced that we should aim to
> > completely deprecate memcg-v1.
> > 
> 
> I think this would be a very worthwhile discussion at LSF/MM, I'm not sure 
> if it would be too late for someone to make a formal proposal for it to be 
> included in the schedule.  Michal would know if there is a opportunity.

yes, I think we can and should have this discussion. I will put that on
the schedule. I will reference this email thread as a topic proposal
with Shakeel and Roman to lead the session.

Michal Hocko May 10, 2024, 1:33 p.m. UTC | #9

On Wed 08-05-24 20:41:29, Roman Gushchin wrote:
> Cgroups v2 have been around for a while and many users have fully adopted them,
> so they never use cgroups v1 features and functionality. Yet they have to "pay"
> for the cgroup v1 support anyway:
> 1) the kernel binary contains useless cgroup v1 code,
> 2) some common structures like task_struct and mem_cgroup have never used
>    cgroup v1-specific members,
> 3) some code paths have additional checks which are not needed.
> 
> Cgroup v1's memory controller has a number of features that are not supported
> by cgroup v2 and their implementation is pretty much self contained.
> Most notably, these features are: soft limit reclaim, oom handling in userspace,
> complicated event notification system, charge migration.
> 
> Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> intervened with generic and cgroup v2-specific code. It's a burden on
> developers and maintainers.
> 
> This patchset aims to solve these problems by:
> 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
>    mm/internal.h header
> 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> 5) putting unused struct memory_cgroup and task_struct members under
>    CONFIG_MEMCG_V1 as well.

This makes sense and I have to admit I didn't think this was so much
code to move. It will make the code base much esier to follow. I do not
think we can drop that code anytime soon as there is still quite a lot
of use of v1 out there. From my experience there is no good reason for
many other than inertia and those are just waiting for somebody to move
them to v2. There are some workloads which depend on v1 only features
and we should discuss what to do about those.

Johannes Weiner May 10, 2024, 2:18 p.m. UTC | #10

On Thu, May 09, 2024 at 07:57:30AM -0700, Roman Gushchin wrote:
> So maybe mm/memcontrol-v1.h for definitions shared between v1 and v2
> and keep exported functions in include/linux/memcontrol.h? There are
> only few of them.

That sounds best to me.

Yafang Shao May 16, 2024, 3:35 a.m. UTC | #11

On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > Cgroups v2 have been around for a while and many users have fully adopted them,
> > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > for the cgroup v1 support anyway:
> > 1) the kernel binary contains useless cgroup v1 code,
> > 2) some common structures like task_struct and mem_cgroup have never used
> >    cgroup v1-specific members,
> > 3) some code paths have additional checks which are not needed.
> >
> > Cgroup v1's memory controller has a number of features that are not supported
> > by cgroup v2 and their implementation is pretty much self contained.
> > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > complicated event notification system, charge migration.
> >
> > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > intervened with generic and cgroup v2-specific code. It's a burden on
> > developers and maintainers.
> >
> > This patchset aims to solve these problems by:
> > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> >    mm/internal.h header
> > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > 5) putting unused struct memory_cgroup and task_struct members under
> >    CONFIG_MEMCG_V1 as well.
> >
> > This is an RFC version, which is not 100% polished yet, so but it would be great
> > to discuss and agree on the overall approach.
> >
> > Some open questions, opinions are appreciated:
> > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> >    mem_cgroup_v1_ prefix. Is this a good idea?
> > 2) Do we want to extend it beyond the memory controller? Should
> > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> >    mm/internal.h? Or mm/memcontrol-v1.h.
> >
>
> Hi Roman,
>
> A very timely and important topic and we should definitely talk about it
> during LSFMM as well. I have been thinking about this problem for quite
> sometime and I am getting more and more convinced that we should aim to
> completely deprecate memcg-v1.
>
> More specifically:
>
> 1. What are the memcg-v1 features which have no alternative in memcg-v2
> and are blocker for memcg-v1 users? (setting aside the cgroup v2
> structual restrictions)
>
> 2. What are unused memcg-v1 features which we should start deprecating?
>
> IMO we should systematically start deprecating memcg-v1 features and
> start unblocking the users stuck on memcg-v1.
>
> Now regarding the proposal in this series, I think it can be a first
> step but should not give an impression that we are done. The only
> concern I have is the potential of "out of sight, out of mind" situation
> with this change but if we keep the momentum of deprecation of memcg-v1
> it should be fine.
>
> I have CCed Greg and David from Google to get their opinion on what
> memcg-v1 features are blocker for their memcg-v2 migration and if they
> have concern in deprecation of memcg-v1 features.
>
> Anyone else still on memcg-v1, please do provide your input.

Hi Shakeel,

Hopefully I'm not too late.  We are currently using memcg v1.

One specific feature we rely on in v1 is skmem accounting. In v1, we
account for TCP memory usage without charging it to memcg v1, which is
useful for monitoring the TCP memory usage generated by tasks running
in a container. However, in memcg v2, monitoring TCP memory requires
charging it to the container, which can easily cause OOM issues. It
would be better if we could monitor skmem usage without charging it in
the memcg v2, allowing us to account for it without the risk of
triggering OOM conditions.

Roman Gushchin May 16, 2024, 5:29 p.m. UTC | #12

On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote:
> On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > Cgroups v2 have been around for a while and many users have fully adopted them,
> > > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > > for the cgroup v1 support anyway:
> > > 1) the kernel binary contains useless cgroup v1 code,
> > > 2) some common structures like task_struct and mem_cgroup have never used
> > >    cgroup v1-specific members,
> > > 3) some code paths have additional checks which are not needed.
> > >
> > > Cgroup v1's memory controller has a number of features that are not supported
> > > by cgroup v2 and their implementation is pretty much self contained.
> > > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > > complicated event notification system, charge migration.
> > >
> > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > > intervened with generic and cgroup v2-specific code. It's a burden on
> > > developers and maintainers.
> > >
> > > This patchset aims to solve these problems by:
> > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> > >    mm/internal.h header
> > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > > 5) putting unused struct memory_cgroup and task_struct members under
> > >    CONFIG_MEMCG_V1 as well.
> > >
> > > This is an RFC version, which is not 100% polished yet, so but it would be great
> > > to discuss and agree on the overall approach.
> > >
> > > Some open questions, opinions are appreciated:
> > > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> > >    mem_cgroup_v1_ prefix. Is this a good idea?
> > > 2) Do we want to extend it beyond the memory controller? Should
> > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > >    mm/internal.h? Or mm/memcontrol-v1.h.
> > >
> >
> > Hi Roman,
> >
> > A very timely and important topic and we should definitely talk about it
> > during LSFMM as well. I have been thinking about this problem for quite
> > sometime and I am getting more and more convinced that we should aim to
> > completely deprecate memcg-v1.
> >
> > More specifically:
> >
> > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > structual restrictions)
> >
> > 2. What are unused memcg-v1 features which we should start deprecating?
> >
> > IMO we should systematically start deprecating memcg-v1 features and
> > start unblocking the users stuck on memcg-v1.
> >
> > Now regarding the proposal in this series, I think it can be a first
> > step but should not give an impression that we are done. The only
> > concern I have is the potential of "out of sight, out of mind" situation
> > with this change but if we keep the momentum of deprecation of memcg-v1
> > it should be fine.
> >
> > I have CCed Greg and David from Google to get their opinion on what
> > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > have concern in deprecation of memcg-v1 features.
> >
> > Anyone else still on memcg-v1, please do provide your input.
> 
> Hi Shakeel,
> 
> Hopefully I'm not too late.  We are currently using memcg v1.
> 
> One specific feature we rely on in v1 is skmem accounting. In v1, we
> account for TCP memory usage without charging it to memcg v1, which is
> useful for monitoring the TCP memory usage generated by tasks running
> in a container. However, in memcg v2, monitoring TCP memory requires
> charging it to the container, which can easily cause OOM issues. It
> would be better if we could monitor skmem usage without charging it in
> the memcg v2, allowing us to account for it without the risk of
> triggering OOM conditions.

Hi Yafang,

the data itself is available on cgroup v2 in memory.stat:sock, however
you're right, it's charged on pair with other types of memory. It was
one of the main principles of cgroup v2's memory controller, so I don't
think it can be changed.

So the feature you need is not skmem accounting, but something quite
opposite :)

The question I have here: what makes socket memory different here?

Is it something specific to your setup (e.g. you mostly use memory.max
to protect against memory leaks in the userspace code, but socket memory
spikes are always caused by external traffic and are legit) or we have
more fundamental problems with the socket memory handling, e.g. we can't
effectively reclaim it under the memory pressure?

In the first case you can maintain a ~2-lines non-upstream patch which will
disable the charging while maintaining statistics - it's not a perfect, but
likely the best option here. In the second case we need collectively fix it
for cgroup v2.

Thanks!

Yafang Shao May 17, 2024, 2:21 a.m. UTC | #13

On Fri, May 17, 2024 at 1:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote:
> > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > > Cgroups v2 have been around for a while and many users have fully adopted them,
> > > > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > > > for the cgroup v1 support anyway:
> > > > 1) the kernel binary contains useless cgroup v1 code,
> > > > 2) some common structures like task_struct and mem_cgroup have never used
> > > >    cgroup v1-specific members,
> > > > 3) some code paths have additional checks which are not needed.
> > > >
> > > > Cgroup v1's memory controller has a number of features that are not supported
> > > > by cgroup v2 and their implementation is pretty much self contained.
> > > > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > > > complicated event notification system, charge migration.
> > > >
> > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > > > intervened with generic and cgroup v2-specific code. It's a burden on
> > > > developers and maintainers.
> > > >
> > > > This patchset aims to solve these problems by:
> > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> > > >    mm/internal.h header
> > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > > > 5) putting unused struct memory_cgroup and task_struct members under
> > > >    CONFIG_MEMCG_V1 as well.
> > > >
> > > > This is an RFC version, which is not 100% polished yet, so but it would be great
> > > > to discuss and agree on the overall approach.
> > > >
> > > > Some open questions, opinions are appreciated:
> > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> > > >    mem_cgroup_v1_ prefix. Is this a good idea?
> > > > 2) Do we want to extend it beyond the memory controller? Should
> > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > > >    mm/internal.h? Or mm/memcontrol-v1.h.
> > > >
> > >
> > > Hi Roman,
> > >
> > > A very timely and important topic and we should definitely talk about it
> > > during LSFMM as well. I have been thinking about this problem for quite
> > > sometime and I am getting more and more convinced that we should aim to
> > > completely deprecate memcg-v1.
> > >
> > > More specifically:
> > >
> > > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > > structual restrictions)
> > >
> > > 2. What are unused memcg-v1 features which we should start deprecating?
> > >
> > > IMO we should systematically start deprecating memcg-v1 features and
> > > start unblocking the users stuck on memcg-v1.
> > >
> > > Now regarding the proposal in this series, I think it can be a first
> > > step but should not give an impression that we are done. The only
> > > concern I have is the potential of "out of sight, out of mind" situation
> > > with this change but if we keep the momentum of deprecation of memcg-v1
> > > it should be fine.
> > >
> > > I have CCed Greg and David from Google to get their opinion on what
> > > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > > have concern in deprecation of memcg-v1 features.
> > >
> > > Anyone else still on memcg-v1, please do provide your input.
> >
> > Hi Shakeel,
> >
> > Hopefully I'm not too late.  We are currently using memcg v1.
> >
> > One specific feature we rely on in v1 is skmem accounting. In v1, we
> > account for TCP memory usage without charging it to memcg v1, which is
> > useful for monitoring the TCP memory usage generated by tasks running
> > in a container. However, in memcg v2, monitoring TCP memory requires
> > charging it to the container, which can easily cause OOM issues. It
> > would be better if we could monitor skmem usage without charging it in
> > the memcg v2, allowing us to account for it without the risk of
> > triggering OOM conditions.
>
> Hi Yafang,
>
> the data itself is available on cgroup v2 in memory.stat:sock, however
> you're right, it's charged on pair with other types of memory. It was
> one of the main principles of cgroup v2's memory controller, so I don't
> think it can be changed.
>
> So the feature you need is not skmem accounting, but something quite
> opposite :)
>
> The question I have here: what makes socket memory different here?
>
> Is it something specific to your setup (e.g. you mostly use memory.max
> to protect against memory leaks in the userspace code, but socket memory
> spikes are always caused by external traffic and are legit) or we have
> more fundamental problems with the socket memory handling, e.g. we can't
> effectively reclaim it under the memory pressure?

It is the first case.

>
> In the first case you can maintain a ~2-lines non-upstream patch which will
> disable the charging while maintaining statistics - it's not a perfect, but
> likely the best option here. In the second case we need collectively fix it
> for cgroup v2.
>

Thank you for your advice. Currently, we do not have any immediate
plans to migrate to cgroup v2. If we are required to use cgroup v2 in
the future, we will need to maintain non-upstream patches.

By the way, is there any reason we cannot keep this behavior
consistent with memcg v1 in the upstream kernel? That would save us
from having to maintain it locally.

Roman Gushchin May 18, 2024, 2:13 a.m. UTC | #14

On Fri, May 17, 2024 at 10:21:01AM +0800, Yafang Shao wrote:
> On Fri, May 17, 2024 at 1:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote:
> > > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > > > Cgroups v2 have been around for a while and many users have fully adopted them,
> > > > > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > > > > for the cgroup v1 support anyway:
> > > > > 1) the kernel binary contains useless cgroup v1 code,
> > > > > 2) some common structures like task_struct and mem_cgroup have never used
> > > > >    cgroup v1-specific members,
> > > > > 3) some code paths have additional checks which are not needed.
> > > > >
> > > > > Cgroup v1's memory controller has a number of features that are not supported
> > > > > by cgroup v2 and their implementation is pretty much self contained.
> > > > > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > > > > complicated event notification system, charge migration.
> > > > >
> > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > > > > intervened with generic and cgroup v2-specific code. It's a burden on
> > > > > developers and maintainers.
> > > > >
> > > > > This patchset aims to solve these problems by:
> > > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> > > > >    mm/internal.h header
> > > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > > > > 5) putting unused struct memory_cgroup and task_struct members under
> > > > >    CONFIG_MEMCG_V1 as well.
> > > > >
> > > > > This is an RFC version, which is not 100% polished yet, so but it would be great
> > > > > to discuss and agree on the overall approach.
> > > > >
> > > > > Some open questions, opinions are appreciated:
> > > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> > > > >    mem_cgroup_v1_ prefix. Is this a good idea?
> > > > > 2) Do we want to extend it beyond the memory controller? Should
> > > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > > > >    mm/internal.h? Or mm/memcontrol-v1.h.
> > > > >
> > > >
> > > > Hi Roman,
> > > >
> > > > A very timely and important topic and we should definitely talk about it
> > > > during LSFMM as well. I have been thinking about this problem for quite
> > > > sometime and I am getting more and more convinced that we should aim to
> > > > completely deprecate memcg-v1.
> > > >
> > > > More specifically:
> > > >
> > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > > > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > > > structual restrictions)
> > > >
> > > > 2. What are unused memcg-v1 features which we should start deprecating?
> > > >
> > > > IMO we should systematically start deprecating memcg-v1 features and
> > > > start unblocking the users stuck on memcg-v1.
> > > >
> > > > Now regarding the proposal in this series, I think it can be a first
> > > > step but should not give an impression that we are done. The only
> > > > concern I have is the potential of "out of sight, out of mind" situation
> > > > with this change but if we keep the momentum of deprecation of memcg-v1
> > > > it should be fine.
> > > >
> > > > I have CCed Greg and David from Google to get their opinion on what
> > > > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > > > have concern in deprecation of memcg-v1 features.
> > > >
> > > > Anyone else still on memcg-v1, please do provide your input.
> > >
> > > Hi Shakeel,
> > >
> > > Hopefully I'm not too late.  We are currently using memcg v1.
> > >
> > > One specific feature we rely on in v1 is skmem accounting. In v1, we
> > > account for TCP memory usage without charging it to memcg v1, which is
> > > useful for monitoring the TCP memory usage generated by tasks running
> > > in a container. However, in memcg v2, monitoring TCP memory requires
> > > charging it to the container, which can easily cause OOM issues. It
> > > would be better if we could monitor skmem usage without charging it in
> > > the memcg v2, allowing us to account for it without the risk of
> > > triggering OOM conditions.
> >
> > Hi Yafang,
> >
> > the data itself is available on cgroup v2 in memory.stat:sock, however
> > you're right, it's charged on pair with other types of memory. It was
> > one of the main principles of cgroup v2's memory controller, so I don't
> > think it can be changed.
> >
> > So the feature you need is not skmem accounting, but something quite
> > opposite :)
> >
> > The question I have here: what makes socket memory different here?
> >
> > Is it something specific to your setup (e.g. you mostly use memory.max
> > to protect against memory leaks in the userspace code, but socket memory
> > spikes are always caused by external traffic and are legit) or we have
> > more fundamental problems with the socket memory handling, e.g. we can't
> > effectively reclaim it under the memory pressure?
> 
> It is the first case.
> 
> >
> > In the first case you can maintain a ~2-lines non-upstream patch which will
> > disable the charging while maintaining statistics - it's not a perfect, but
> > likely the best option here. In the second case we need collectively fix it
> > for cgroup v2.
> >
> 
> Thank you for your advice. Currently, we do not have any immediate
> plans to migrate to cgroup v2. If we are required to use cgroup v2 in
> the future, we will need to maintain non-upstream patches.
> 
> By the way, is there any reason we cannot keep this behavior
> consistent with memcg v1 in the upstream kernel? That would save us
> from having to maintain it locally.

The idea to handle various types of memory independently isn't working well
for most users: it makes the configuration trickier and more fragile.
It's also more expensive in terms of the accounting overhead.

The tcpmem accounting is btw quite expensive by itself. So by switching to
cgroup v2 you might see (depending on your traffic and cpu load) some
nice performance benefits.

Thanks!

Shakeel Butt May 18, 2024, 7:32 a.m. UTC | #15

On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote:
> On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> 
[...]
> Hi Shakeel,
> 
> Hopefully I'm not too late.  We are currently using memcg v1.
> 
> One specific feature we rely on in v1 is skmem accounting. In v1, we
> account for TCP memory usage without charging it to memcg v1, which is
> useful for monitoring the TCP memory usage generated by tasks running
> in a container. However, in memcg v2, monitoring TCP memory requires
> charging it to the container, which can easily cause OOM issues. It
> would be better if we could monitor skmem usage without charging it in
> the memcg v2, allowing us to account for it without the risk of
> triggering OOM conditions.
> 

Hi Yafang,

No worries. From what I understand, you are not really using skmem
charging of v1 but just the network memory usage stats and you are
worried that charging network memory to cgroup memory may cause OOMs. Is
that correct? Have you tried charging network memory to cgroup memory
before and saw OOMs? If yes then I would really like to see OOM reports.

I have two examples where the v2's skmem charging is working fine in
production namely Google and Meta. Google is still on v1 but for skmem
charging, they have moved to v2 semantics. Actually I have another
report from Cloudflare [0] where the tcp throttling mechanism for v2's
tcp memory accounting is too much conservative for their production
traffic.

Anyways this just means that we need a more flexible way to provide
and enforce semantics for tcp memory pressure with a decent default
behavior. I will followup on this separately.

[0] https://lore.kernel.org/lkml/CABWYdi0G7cyNFbndM-ELTDAR3x4Ngm0AehEp5aP0tfNkXUE+Uw@mail.gmail.com/

thanks,
Shakeel

Yafang Shao May 20, 2024, 2:14 a.m. UTC | #16

On Sat, May 18, 2024 at 3:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote:
> > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> >
> [...]
> > Hi Shakeel,
> >
> > Hopefully I'm not too late.  We are currently using memcg v1.
> >
> > One specific feature we rely on in v1 is skmem accounting. In v1, we
> > account for TCP memory usage without charging it to memcg v1, which is
> > useful for monitoring the TCP memory usage generated by tasks running
> > in a container. However, in memcg v2, monitoring TCP memory requires
> > charging it to the container, which can easily cause OOM issues. It
> > would be better if we could monitor skmem usage without charging it in
> > the memcg v2, allowing us to account for it without the risk of
> > triggering OOM conditions.
> >
>
> Hi Yafang,
>
> No worries. From what I understand, you are not really using skmem
> charging of v1 but just the network memory usage stats and you are
> worried that charging network memory to cgroup memory may cause OOMs. Is
> that correct?

Correct.

> Have you tried charging network memory to cgroup memory
> before and saw OOMs? If yes then I would really like to see OOM reports.

No, we don't enable the charging for TCP memory in memcg v1 and we
don't have a plan to add support for it currently.

>
> I have two examples where the v2's skmem charging is working fine in
> production namely Google and Meta. Google is still on v1 but for skmem
> charging, they have moved to v2 semantics. Actually I have another
> report from Cloudflare [0] where the tcp throttling mechanism for v2's
> tcp memory accounting is too much conservative for their production
> traffic.
>
> Anyways this just means that we need a more flexible way to provide
> and enforce semantics for tcp memory pressure with a decent default
> behavior. I will followup on this separately.
>
> [0] https://lore.kernel.org/lkml/CABWYdi0G7cyNFbndM-ELTDAR3x4Ngm0AehEp5aP0tfNkXUE+Uw@mail.gmail.com/

Thanks for your explanation.

Kairui Song May 22, 2024, 5:58 p.m. UTC | #17

On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > Cgroups v2 have been around for a while and many users have fully adopted them,
> > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > for the cgroup v1 support anyway:
> > 1) the kernel binary contains useless cgroup v1 code,
> > 2) some common structures like task_struct and mem_cgroup have never used
> >    cgroup v1-specific members,
> > 3) some code paths have additional checks which are not needed.
> >
> > Cgroup v1's memory controller has a number of features that are not supported
> > by cgroup v2 and their implementation is pretty much self contained.
> > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > complicated event notification system, charge migration.
> >
> > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > intervened with generic and cgroup v2-specific code. It's a burden on
> > developers and maintainers.
> >
> > This patchset aims to solve these problems by:
> > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> >    mm/internal.h header
> > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > 5) putting unused struct memory_cgroup and task_struct members under
> >    CONFIG_MEMCG_V1 as well.
> >
> > This is an RFC version, which is not 100% polished yet, so but it would be great
> > to discuss and agree on the overall approach.
> >
> > Some open questions, opinions are appreciated:
> > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> >    mem_cgroup_v1_ prefix. Is this a good idea?
> > 2) Do we want to extend it beyond the memory controller? Should
> > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> >    mm/internal.h? Or mm/memcontrol-v1.h.
> >
>
> Hi Roman,
>
> A very timely and important topic and we should definitely talk about it
> during LSFMM as well. I have been thinking about this problem for quite
> sometime and I am getting more and more convinced that we should aim to
> completely deprecate memcg-v1.
>
> More specifically:
>
> 1. What are the memcg-v1 features which have no alternative in memcg-v2
> and are blocker for memcg-v1 users? (setting aside the cgroup v2
> structual restrictions)
>
> 2. What are unused memcg-v1 features which we should start deprecating?
>
> IMO we should systematically start deprecating memcg-v1 features and
> start unblocking the users stuck on memcg-v1.
>
> Now regarding the proposal in this series, I think it can be a first
> step but should not give an impression that we are done. The only
> concern I have is the potential of "out of sight, out of mind" situation
> with this change but if we keep the momentum of deprecation of memcg-v1
> it should be fine.
>
> I have CCed Greg and David from Google to get their opinion on what
> memcg-v1 features are blocker for their memcg-v2 migration and if they
> have concern in deprecation of memcg-v1 features.
>
> Anyone else still on memcg-v1, please do provide your input.
>

Hi,

Sorry for joining the discussion late, but I'd like to add some info
here: We are using the "memsw" feature a lot. It's a very useful knob
for container memory overcommitting: It's a great abstraction of the
"expected total memory usage" of a container, so containers can't
allocate too much memory using SWAP, but still be able to SWAP out.

For a simple example, with memsw.limit == memory.limit, containers
can't exceed their original memory limit, even with SWAP enabled, they
get OOM killed as how they used to, but the host is now able to
offload cold pages.

Similar ability seems absent with V2: With memory.swap.max == 0, the
host can't use SWAP to reclaim container memory at all. But with a
value larger than that, containers are able to overuse memory, causing
delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
out of balance, especially with compress SWAP backends.

Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to
account for the total raw usage, not the compressed usage. One example
is that if a container uses tons of duplicated pages, then it can
allocate much more memory than it is limited, that could cause
trouble.

I saw Chris also mentioned Google has a workaround internally for it
for Cgroup V2. This will be a blocker for us and a similar workaround
might be needed. It will be great so see an upstream support for this.

Roman Gushchin May 23, 2024, 7:55 p.m. UTC | #18

On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote:
> On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > Cgroups v2 have been around for a while and many users have fully adopted them,
> > > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > > for the cgroup v1 support anyway:
> > > 1) the kernel binary contains useless cgroup v1 code,
> > > 2) some common structures like task_struct and mem_cgroup have never used
> > >    cgroup v1-specific members,
> > > 3) some code paths have additional checks which are not needed.
> > >
> > > Cgroup v1's memory controller has a number of features that are not supported
> > > by cgroup v2 and their implementation is pretty much self contained.
> > > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > > complicated event notification system, charge migration.
> > >
> > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > > intervened with generic and cgroup v2-specific code. It's a burden on
> > > developers and maintainers.
> > >
> > > This patchset aims to solve these problems by:
> > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> > >    mm/internal.h header
> > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > > 5) putting unused struct memory_cgroup and task_struct members under
> > >    CONFIG_MEMCG_V1 as well.
> > >
> > > This is an RFC version, which is not 100% polished yet, so but it would be great
> > > to discuss and agree on the overall approach.
> > >
> > > Some open questions, opinions are appreciated:
> > > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> > >    mem_cgroup_v1_ prefix. Is this a good idea?
> > > 2) Do we want to extend it beyond the memory controller? Should
> > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > >    mm/internal.h? Or mm/memcontrol-v1.h.
> > >
> >
> > Hi Roman,
> >
> > A very timely and important topic and we should definitely talk about it
> > during LSFMM as well. I have been thinking about this problem for quite
> > sometime and I am getting more and more convinced that we should aim to
> > completely deprecate memcg-v1.
> >
> > More specifically:
> >
> > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > structual restrictions)
> >
> > 2. What are unused memcg-v1 features which we should start deprecating?
> >
> > IMO we should systematically start deprecating memcg-v1 features and
> > start unblocking the users stuck on memcg-v1.
> >
> > Now regarding the proposal in this series, I think it can be a first
> > step but should not give an impression that we are done. The only
> > concern I have is the potential of "out of sight, out of mind" situation
> > with this change but if we keep the momentum of deprecation of memcg-v1
> > it should be fine.
> >
> > I have CCed Greg and David from Google to get their opinion on what
> > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > have concern in deprecation of memcg-v1 features.
> >
> > Anyone else still on memcg-v1, please do provide your input.
> >
> 
> Hi,
> 
> Sorry for joining the discussion late, but I'd like to add some info
> here: We are using the "memsw" feature a lot. It's a very useful knob
> for container memory overcommitting: It's a great abstraction of the
> "expected total memory usage" of a container, so containers can't
> allocate too much memory using SWAP, but still be able to SWAP out.
> 
> For a simple example, with memsw.limit == memory.limit, containers
> can't exceed their original memory limit, even with SWAP enabled, they
> get OOM killed as how they used to, but the host is now able to
> offload cold pages.
> 
> Similar ability seems absent with V2: With memory.swap.max == 0, the
> host can't use SWAP to reclaim container memory at all. But with a
> value larger than that, containers are able to overuse memory, causing
> delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> out of balance, especially with compress SWAP backends.
> 
> Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to
> account for the total raw usage, not the compressed usage. One example
> is that if a container uses tons of duplicated pages, then it can
> allocate much more memory than it is limited, that could cause
> trouble.

So you don't need separate swap knobs, only combined, right?

> I saw Chris also mentioned Google has a workaround internally for it
> for Cgroup V2. This will be a blocker for us and a similar workaround
> might be needed. It will be great so see an upstream support for this.

I think that _at least_ we should refactor the code so that it would
be a minimal patch (e.g. one #define) to switch to the old mode.

I don't think it's reasonable to add a new interface, but having a
patch/config option or even a mount option which changes the semantics
of memory.swap.max to the v1-like behavior should be ok.

I'll try to do the first part (refactoring this code), and we can have
a discussion from there.

Thanks!

Chris Li May 23, 2024, 8:26 p.m. UTC | #19

On Thu, May 23, 2024 at 12:56 PM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote:
> > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > > Cgroups v2 have been around for a while and many users have fully adopted them,
> > > > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > > > for the cgroup v1 support anyway:
> > > > 1) the kernel binary contains useless cgroup v1 code,
> > > > 2) some common structures like task_struct and mem_cgroup have never used
> > > >    cgroup v1-specific members,
> > > > 3) some code paths have additional checks which are not needed.
> > > >
> > > > Cgroup v1's memory controller has a number of features that are not supported
> > > > by cgroup v2 and their implementation is pretty much self contained.
> > > > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > > > complicated event notification system, charge migration.
> > > >
> > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > > > intervened with generic and cgroup v2-specific code. It's a burden on
> > > > developers and maintainers.
> > > >
> > > > This patchset aims to solve these problems by:
> > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> > > >    mm/internal.h header
> > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > > > 5) putting unused struct memory_cgroup and task_struct members under
> > > >    CONFIG_MEMCG_V1 as well.
> > > >
> > > > This is an RFC version, which is not 100% polished yet, so but it would be great
> > > > to discuss and agree on the overall approach.
> > > >
> > > > Some open questions, opinions are appreciated:
> > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> > > >    mem_cgroup_v1_ prefix. Is this a good idea?
> > > > 2) Do we want to extend it beyond the memory controller? Should
> > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > > >    mm/internal.h? Or mm/memcontrol-v1.h.
> > > >
> > >
> > > Hi Roman,
> > >
> > > A very timely and important topic and we should definitely talk about it
> > > during LSFMM as well. I have been thinking about this problem for quite
> > > sometime and I am getting more and more convinced that we should aim to
> > > completely deprecate memcg-v1.
> > >
> > > More specifically:
> > >
> > > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > > structual restrictions)
> > >
> > > 2. What are unused memcg-v1 features which we should start deprecating?
> > >
> > > IMO we should systematically start deprecating memcg-v1 features and
> > > start unblocking the users stuck on memcg-v1.
> > >
> > > Now regarding the proposal in this series, I think it can be a first
> > > step but should not give an impression that we are done. The only
> > > concern I have is the potential of "out of sight, out of mind" situation
> > > with this change but if we keep the momentum of deprecation of memcg-v1
> > > it should be fine.
> > >
> > > I have CCed Greg and David from Google to get their opinion on what
> > > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > > have concern in deprecation of memcg-v1 features.
> > >
> > > Anyone else still on memcg-v1, please do provide your input.
> > >
> >
> > Hi,
> >
> > Sorry for joining the discussion late, but I'd like to add some info
> > here: We are using the "memsw" feature a lot. It's a very useful knob
> > for container memory overcommitting: It's a great abstraction of the
> > "expected total memory usage" of a container, so containers can't
> > allocate too much memory using SWAP, but still be able to SWAP out.

+Michal,

Just FYI, we do have companies like Tensent using the V1 combine memsw
limitation as well. Google is not the only company using this API.

> >
> > For a simple example, with memsw.limit == memory.limit, containers
> > can't exceed their original memory limit, even with SWAP enabled, they
> > get OOM killed as how they used to, but the host is now able to
> > offload cold pages.
> >
> > Similar ability seems absent with V2: With memory.swap.max == 0, the
> > host can't use SWAP to reclaim container memory at all. But with a
> > value larger than that, containers are able to overuse memory, causing
> > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> > out of balance, especially with compress SWAP backends.
> >
> > Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to
> > account for the total raw usage, not the compressed usage. One example
> > is that if a container uses tons of duplicated pages, then it can
> > allocate much more memory than it is limited, that could cause
> > trouble.
>
> So you don't need separate swap knobs, only combined, right?
>
> > I saw Chris also mentioned Google has a workaround internally for it
> > for Cgroup V2. This will be a blocker for us and a similar workaround
> > might be needed. It will be great so see an upstream support for this.
>
> I think that _at least_ we should refactor the code so that it would
> be a minimal patch (e.g. one #define) to switch to the old mode.

That would be great to have a path forward to allow cgroup V2 to
provide the combined memsw limitations.

>
> I don't think it's reasonable to add a new interface, but having a
> patch/config option or even a mount option which changes the semantics
> of memory.swap.max to the v1-like behavior should be ok.

Using sysctl or a slightly different cgroup API is fine. The feature
needs to be there.

> I'll try to do the first part (refactoring this code), and we can have
> a discussion from there.

Looking forward to it.

Thanks

Chris

Kairui Song May 28, 2024, 5:20 p.m. UTC | #20

On Fri, May 24, 2024 at 3:55 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote:
> > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote:
> > > > Cgroups v2 have been around for a while and many users have fully adopted them,
> > > > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > > > for the cgroup v1 support anyway:
> > > > 1) the kernel binary contains useless cgroup v1 code,
> > > > 2) some common structures like task_struct and mem_cgroup have never used
> > > >    cgroup v1-specific members,
> > > > 3) some code paths have additional checks which are not needed.
> > > >
> > > > Cgroup v1's memory controller has a number of features that are not supported
> > > > by cgroup v2 and their implementation is pretty much self contained.
> > > > Most notably, these features are: soft limit reclaim, oom handling in userspace,
> > > > complicated event notification system, charge migration.
> > > >
> > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's
> > > > intervened with generic and cgroup v2-specific code. It's a burden on
> > > > developers and maintainers.
> > > >
> > > > This patchset aims to solve these problems by:
> > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
> > > >    mm/internal.h header
> > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default
> > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set
> > > > 5) putting unused struct memory_cgroup and task_struct members under
> > > >    CONFIG_MEMCG_V1 as well.
> > > >
> > > > This is an RFC version, which is not 100% polished yet, so but it would be great
> > > > to discuss and agree on the overall approach.
> > > >
> > > > Some open questions, opinions are appreciated:
> > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have
> > > >    mem_cgroup_v1_ prefix. Is this a good idea?
> > > > 2) Do we want to extend it beyond the memory controller? Should
> > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of
> > > >    mm/internal.h? Or mm/memcontrol-v1.h.
> > > >
> > >
> > > Hi Roman,
> > >
> > > A very timely and important topic and we should definitely talk about it
> > > during LSFMM as well. I have been thinking about this problem for quite
> > > sometime and I am getting more and more convinced that we should aim to
> > > completely deprecate memcg-v1.
> > >
> > > More specifically:
> > >
> > > 1. What are the memcg-v1 features which have no alternative in memcg-v2
> > > and are blocker for memcg-v1 users? (setting aside the cgroup v2
> > > structual restrictions)
> > >
> > > 2. What are unused memcg-v1 features which we should start deprecating?
> > >
> > > IMO we should systematically start deprecating memcg-v1 features and
> > > start unblocking the users stuck on memcg-v1.
> > >
> > > Now regarding the proposal in this series, I think it can be a first
> > > step but should not give an impression that we are done. The only
> > > concern I have is the potential of "out of sight, out of mind" situation
> > > with this change but if we keep the momentum of deprecation of memcg-v1
> > > it should be fine.
> > >
> > > I have CCed Greg and David from Google to get their opinion on what
> > > memcg-v1 features are blocker for their memcg-v2 migration and if they
> > > have concern in deprecation of memcg-v1 features.
> > >
> > > Anyone else still on memcg-v1, please do provide your input.
> > >
> >
> > Hi,
> >
> > Sorry for joining the discussion late, but I'd like to add some info
> > here: We are using the "memsw" feature a lot. It's a very useful knob
> > for container memory overcommitting: It's a great abstraction of the
> > "expected total memory usage" of a container, so containers can't
> > allocate too much memory using SWAP, but still be able to SWAP out.
> >
> > For a simple example, with memsw.limit == memory.limit, containers
> > can't exceed their original memory limit, even with SWAP enabled, they
> > get OOM killed as how they used to, but the host is now able to
> > offload cold pages.
> >
> > Similar ability seems absent with V2: With memory.swap.max == 0, the
> > host can't use SWAP to reclaim container memory at all. But with a
> > value larger than that, containers are able to overuse memory, causing
> > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> > out of balance, especially with compress SWAP backends.
> >
> > Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to
> > account for the total raw usage, not the compressed usage. One example
> > is that if a container uses tons of duplicated pages, then it can
> > allocate much more memory than it is limited, that could cause
> > trouble.
>
> So you don't need separate swap knobs, only combined, right?

Yes, currently we use either combined or separate knobs.

> > I saw Chris also mentioned Google has a workaround internally for it
> > for Cgroup V2. This will be a blocker for us and a similar workaround
> > might be needed. It will be great so see an upstream support for this.
>
> I think that _at least_ we should refactor the code so that it would
> be a minimal patch (e.g. one #define) to switch to the old mode.
>
> I don't think it's reasonable to add a new interface, but having a
> patch/config option or even a mount option which changes the semantics
> of memory.swap.max to the v1-like behavior should be ok.
>
> I'll try to do the first part (refactoring this code), and we can have
> a discussion from there.

Thanks, that sounds like a good start.

[rfc,0/9] mm: memcg: separate legacy cgroup v1 code and put under config option

Message

Comments