Message ID | 20240509034138.2207186-1-roman.gushchin@linux.dev (mailing list archive) |
---|---|
Headers | show |
Series | mm: memcg: separate legacy cgroup v1 code and put under config option | expand |
On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > Cgroups v2 have been around for a while and many users have fully adopted them, > so they never use cgroups v1 features and functionality. Yet they have to "pay" > for the cgroup v1 support anyway: > 1) the kernel binary contains useless cgroup v1 code, > 2) some common structures like task_struct and mem_cgroup have never used > cgroup v1-specific members, > 3) some code paths have additional checks which are not needed. > > Cgroup v1's memory controller has a number of features that are not supported > by cgroup v2 and their implementation is pretty much self contained. > Most notably, these features are: soft limit reclaim, oom handling in userspace, > complicated event notification system, charge migration. > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > intervened with generic and cgroup v2-specific code. It's a burden on > developers and maintainers. > > This patchset aims to solve these problems by: > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > mm/internal.h header > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > 5) putting unused struct memory_cgroup and task_struct members under > CONFIG_MEMCG_V1 as well. > > This is an RFC version, which is not 100% polished yet, so but it would be great > to discuss and agree on the overall approach. > > Some open questions, opinions are appreciated: > 1) I consider renaming non-static functions in memcontrol-v1.c to have > mem_cgroup_v1_ prefix. Is this a good idea? > 2) Do we want to extend it beyond the memory controller? Should > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > mm/internal.h? Or mm/memcontrol-v1.h. > Hi Roman, A very timely and important topic and we should definitely talk about it during LSFMM as well. I have been thinking about this problem for quite sometime and I am getting more and more convinced that we should aim to completely deprecate memcg-v1. More specifically: 1. What are the memcg-v1 features which have no alternative in memcg-v2 and are blocker for memcg-v1 users? (setting aside the cgroup v2 structual restrictions) 2. What are unused memcg-v1 features which we should start deprecating? IMO we should systematically start deprecating memcg-v1 features and start unblocking the users stuck on memcg-v1. Now regarding the proposal in this series, I think it can be a first step but should not give an impression that we are done. The only concern I have is the potential of "out of sight, out of mind" situation with this change but if we keep the momentum of deprecation of memcg-v1 it should be fine. I have CCed Greg and David from Google to get their opinion on what memcg-v1 features are blocker for their memcg-v2 migration and if they have concern in deprecation of memcg-v1 features. Anyone else still on memcg-v1, please do provide your input. thanks, Shakeel
On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > Cgroups v2 have been around for a while and many users have fully adopted them, > so they never use cgroups v1 features and functionality. Yet they have to "pay" > for the cgroup v1 support anyway: > 1) the kernel binary contains useless cgroup v1 code, > 2) some common structures like task_struct and mem_cgroup have never used > cgroup v1-specific members, > 3) some code paths have additional checks which are not needed. > > Cgroup v1's memory controller has a number of features that are not supported > by cgroup v2 and their implementation is pretty much self contained. > Most notably, these features are: soft limit reclaim, oom handling in userspace, > complicated event notification system, charge migration. > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > intervened with generic and cgroup v2-specific code. It's a burden on > developers and maintainers. Great patchset. The moves look clean and straight-forward to me on first glance. > This patchset aims to solve these problems by: > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, +1 > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > mm/internal.h header You proposed mm/memcontrol-v1.h below, IMO that's the best option. > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default +1 CONFIG_MEMCG1 should also work. > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set +1 > 5) putting unused struct memory_cgroup and task_struct members under > CONFIG_MEMCG_V1 as well. +1 > > This is an RFC version, which is not 100% polished yet, so but it would be great > to discuss and agree on the overall approach. > > Some open questions, opinions are appreciated: > 1) I consider renaming non-static functions in memcontrol-v1.c to have > mem_cgroup_v1_ prefix. Is this a good idea? I think this would be great, to make it more obvious in memcontrol.c. For core cgroup code, we used cgroup1_foo(). Maybe name them all things like memcg1_update_tree() etc.? That's short and sweet while sticking out visually pretty well. > 2) Do we want to extend it beyond the memory controller? Should Could you please elaborate? ^_^ > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > mm/internal.h? Or mm/memcontrol-v1.h. mm/memcontrol-v1.h sounds good to me. > mm/memcontrol.c | 4121 ++++++++++++++++++++++--------------------------------------------------------------------------------------------------------------------------------- Lol, awesome.
On Thu, May 09, 2024 at 10:22:10AM -0400, Johannes Weiner wrote: > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > mm/internal.h? Or mm/memcontrol-v1.h. > > mm/memcontrol-v1.h sounds good to me. Argh, there is a folio_memcg_lock() callsite in fs/buffer.c. I suppose include/linux/memcontrol-v1.h makes the most sense then.
On Thu, May 09, 2024 at 10:36:35AM -0400, Johannes Weiner wrote: > On Thu, May 09, 2024 at 10:22:10AM -0400, Johannes Weiner wrote: > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > mm/memcontrol-v1.h sounds good to me. > > Argh, there is a folio_memcg_lock() callsite in fs/buffer.c. I suppose > include/linux/memcontrol-v1.h makes the most sense then. You mean put everything into include/linux/memcontrol-v1.h? And functions from memcontrol.c by memcontrol-v1.c into include/linux/memcontrol.h? It's an option I considered it but the downside is that we're "leaking" a lot of internal definitions into the outside world, because memcontrol.h is included everywhere. So maybe mm/memcontrol-v1.h for definitions shared between v1 and v2 and keep exported functions in include/linux/memcontrol.h? There are only few of them. Thanks!
On Wed, May 08, 2024 at 11:33:07PM -0700, Shakeel Butt wrote: > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > Cgroups v2 have been around for a while and many users have fully adopted them, > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > for the cgroup v1 support anyway: > > 1) the kernel binary contains useless cgroup v1 code, > > 2) some common structures like task_struct and mem_cgroup have never used > > cgroup v1-specific members, > > 3) some code paths have additional checks which are not needed. > > > > Cgroup v1's memory controller has a number of features that are not supported > > by cgroup v2 and their implementation is pretty much self contained. > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > complicated event notification system, charge migration. > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > intervened with generic and cgroup v2-specific code. It's a burden on > > developers and maintainers. > > > > This patchset aims to solve these problems by: > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > mm/internal.h header > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > 5) putting unused struct memory_cgroup and task_struct members under > > CONFIG_MEMCG_V1 as well. > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > to discuss and agree on the overall approach. > > > > Some open questions, opinions are appreciated: > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > mem_cgroup_v1_ prefix. Is this a good idea? > > 2) Do we want to extend it beyond the memory controller? Should > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > Hi Roman, > > A very timely and important topic and we should definitely talk about it > during LSFMM as well. I have been thinking about this problem for quite > sometime and I am getting more and more convinced that we should aim to > completely deprecate memcg-v1. > > More specifically: > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > structual restrictions) I don't think there are any, except there might be a certain cost to migrate, so some companies might be resistant to put in resources, because they don't see any immediate benefits as well. > > 2. What are unused memcg-v1 features which we should start deprecating? > > IMO we should systematically start deprecating memcg-v1 features and > start unblocking the users stuck on memcg-v1. I'm not sure we want to deprecate them one-by-one - it's a lot of work and maybe we can deprecate it all together instead. I think the only feature which we might want to deprecate separately - it's the charge migration. It's the most annoying feature as it requires a lot more synchronization, which can be dropped otherwise, so it's complicating a lot of things. Other features are more or less self-contained. > > Now regarding the proposal in this series, I think it can be a first > step but should not give an impression that we are done. Yeah, it's really only a first step. > The only > concern I have is the potential of "out of sight, out of mind" situation > with this change but if we keep the momentum of deprecation of memcg-v1 > it should be fine. My rough plan here: 1) move it out to a separate file and put under a config option, default on 2) clean up all remaining small bits here and there ... < wait a year > 3) flip the config option to be off by default ... < wait another year or two > 4) drop the code entirely > > I have CCed Greg and David from Google to get their opinion on what > memcg-v1 features are blocker for their memcg-v2 migration and if they > have concern in deprecation of memcg-v1 features. Thank you!
On Wed, 8 May 2024, Shakeel Butt wrote: > Hi Roman, > > A very timely and important topic and we should definitely talk about it > during LSFMM as well. I have been thinking about this problem for quite > sometime and I am getting more and more convinced that we should aim to > completely deprecate memcg-v1. > I think this would be a very worthwhile discussion at LSF/MM, I'm not sure if it would be too late for someone to make a formal proposal for it to be included in the schedule. Michal would know if there is a opportunity. I say that in light of https://lore.kernel.org/bpf/ZjL5b-zipMrV2JSg@archie.me/T/#mb6c21b09543c434dd85e718a8ecf5ca6485e6d07 as well for the whole cgroup v1 -> v2 transition. Chris, now cc'd, would know best about all of the dependencies that Google has for memcg specifically. > More specifically: > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > structual restrictions) > > 2. What are unused memcg-v1 features which we should start deprecating? > > IMO we should systematically start deprecating memcg-v1 features and > start unblocking the users stuck on memcg-v1. > > Now regarding the proposal in this series, I think it can be a first > step but should not give an impression that we are done. The only > concern I have is the potential of "out of sight, out of mind" situation > with this change but if we keep the momentum of deprecation of memcg-v1 > it should be fine. > > I have CCed Greg and David from Google to get their opinion on what > memcg-v1 features are blocker for their memcg-v2 migration and if they > have concern in deprecation of memcg-v1 features. > > Anyone else still on memcg-v1, please do provide your input. > > thanks, > Shakeel >
On Thu, May 9, 2024 at 7:59 PM David Rientjes <rientjes@google.com> wrote: > > On Wed, 8 May 2024, Shakeel Butt wrote: > > > Hi Roman, > > > > A very timely and important topic and we should definitely talk about it > > during LSFMM as well. I have been thinking about this problem for quite > > sometime and I am getting more and more convinced that we should aim to > > completely deprecate memcg-v1. > > > > I think this would be a very worthwhile discussion at LSF/MM, I'm not sure > if it would be too late for someone to make a formal proposal for it to be > included in the schedule. Michal would know if there is a opportunity. > > I say that in light of > https://lore.kernel.org/bpf/ZjL5b-zipMrV2JSg@archie.me/T/#mb6c21b09543c434dd85e718a8ecf5ca6485e6d07 > as well for the whole cgroup v1 -> v2 transition. > > Chris, now cc'd, would know best about all of the dependencies that Google > has for memcg specifically. Thanks David, Yes, I am very interested in that cgroup v1 -> v2 transition discussion. > > > More specifically: > > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > structual restrictions) In the list mentioned by Roman: "soft limit reclaim, oom handling in userspace, complicated event notification system, charge migration." The "oom.control" and leak of user space oom control is a big one for google. Some test frameworks also use "memory.force_empty". Soft limit reclaim and charge migration is also used. There is also the combined "memsw" limit enforcement. Google has some internal work around for V2 but it would be good if that upstream can support it directly. BTW, I know you are not looking for the "cgroup v2 structure restrictions". Two cgroup controllers that can't have different sets of processes is a bit too restrictive. That is what I recall right now, I might be missing some small odd items. Anyway, glad to join the discussion if there is a session. Chris Chris > > > > 2. What are unused memcg-v1 features which we should start deprecating? > > > > IMO we should systematically start deprecating memcg-v1 features and > > start unblocking the users stuck on memcg-v1. > > > > Now regarding the proposal in this series, I think it can be a first > > step but should not give an impression that we are done. The only > > concern I have is the potential of "out of sight, out of mind" situation > > with this change but if we keep the momentum of deprecation of memcg-v1 > > it should be fine. > > > > I have CCed Greg and David from Google to get their opinion on what > > memcg-v1 features are blocker for their memcg-v2 migration and if they > > have concern in deprecation of memcg-v1 features. > > > > Anyone else still on memcg-v1, please do provide your input. > > > > thanks, > > Shakeel > >
On Thu 09-05-24 19:59:19, David Rientjes wrote: > On Wed, 8 May 2024, Shakeel Butt wrote: > > > Hi Roman, > > > > A very timely and important topic and we should definitely talk about it > > during LSFMM as well. I have been thinking about this problem for quite > > sometime and I am getting more and more convinced that we should aim to > > completely deprecate memcg-v1. > > > > I think this would be a very worthwhile discussion at LSF/MM, I'm not sure > if it would be too late for someone to make a formal proposal for it to be > included in the schedule. Michal would know if there is a opportunity. yes, I think we can and should have this discussion. I will put that on the schedule. I will reference this email thread as a topic proposal with Shakeel and Roman to lead the session.
On Wed 08-05-24 20:41:29, Roman Gushchin wrote: > Cgroups v2 have been around for a while and many users have fully adopted them, > so they never use cgroups v1 features and functionality. Yet they have to "pay" > for the cgroup v1 support anyway: > 1) the kernel binary contains useless cgroup v1 code, > 2) some common structures like task_struct and mem_cgroup have never used > cgroup v1-specific members, > 3) some code paths have additional checks which are not needed. > > Cgroup v1's memory controller has a number of features that are not supported > by cgroup v2 and their implementation is pretty much self contained. > Most notably, these features are: soft limit reclaim, oom handling in userspace, > complicated event notification system, charge migration. > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > intervened with generic and cgroup v2-specific code. It's a burden on > developers and maintainers. > > This patchset aims to solve these problems by: > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > mm/internal.h header > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > 5) putting unused struct memory_cgroup and task_struct members under > CONFIG_MEMCG_V1 as well. This makes sense and I have to admit I didn't think this was so much code to move. It will make the code base much esier to follow. I do not think we can drop that code anytime soon as there is still quite a lot of use of v1 out there. From my experience there is no good reason for many other than inertia and those are just waiting for somebody to move them to v2. There are some workloads which depend on v1 only features and we should discuss what to do about those.
On Thu, May 09, 2024 at 07:57:30AM -0700, Roman Gushchin wrote: > So maybe mm/memcontrol-v1.h for definitions shared between v1 and v2 > and keep exported functions in include/linux/memcontrol.h? There are > only few of them. That sounds best to me.
On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > Cgroups v2 have been around for a while and many users have fully adopted them, > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > for the cgroup v1 support anyway: > > 1) the kernel binary contains useless cgroup v1 code, > > 2) some common structures like task_struct and mem_cgroup have never used > > cgroup v1-specific members, > > 3) some code paths have additional checks which are not needed. > > > > Cgroup v1's memory controller has a number of features that are not supported > > by cgroup v2 and their implementation is pretty much self contained. > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > complicated event notification system, charge migration. > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > intervened with generic and cgroup v2-specific code. It's a burden on > > developers and maintainers. > > > > This patchset aims to solve these problems by: > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > mm/internal.h header > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > 5) putting unused struct memory_cgroup and task_struct members under > > CONFIG_MEMCG_V1 as well. > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > to discuss and agree on the overall approach. > > > > Some open questions, opinions are appreciated: > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > mem_cgroup_v1_ prefix. Is this a good idea? > > 2) Do we want to extend it beyond the memory controller? Should > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > Hi Roman, > > A very timely and important topic and we should definitely talk about it > during LSFMM as well. I have been thinking about this problem for quite > sometime and I am getting more and more convinced that we should aim to > completely deprecate memcg-v1. > > More specifically: > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > structual restrictions) > > 2. What are unused memcg-v1 features which we should start deprecating? > > IMO we should systematically start deprecating memcg-v1 features and > start unblocking the users stuck on memcg-v1. > > Now regarding the proposal in this series, I think it can be a first > step but should not give an impression that we are done. The only > concern I have is the potential of "out of sight, out of mind" situation > with this change but if we keep the momentum of deprecation of memcg-v1 > it should be fine. > > I have CCed Greg and David from Google to get their opinion on what > memcg-v1 features are blocker for their memcg-v2 migration and if they > have concern in deprecation of memcg-v1 features. > > Anyone else still on memcg-v1, please do provide your input. Hi Shakeel, Hopefully I'm not too late. We are currently using memcg v1. One specific feature we rely on in v1 is skmem accounting. In v1, we account for TCP memory usage without charging it to memcg v1, which is useful for monitoring the TCP memory usage generated by tasks running in a container. However, in memcg v2, monitoring TCP memory requires charging it to the container, which can easily cause OOM issues. It would be better if we could monitor skmem usage without charging it in the memcg v2, allowing us to account for it without the risk of triggering OOM conditions.
On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote: > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > Cgroups v2 have been around for a while and many users have fully adopted them, > > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > > for the cgroup v1 support anyway: > > > 1) the kernel binary contains useless cgroup v1 code, > > > 2) some common structures like task_struct and mem_cgroup have never used > > > cgroup v1-specific members, > > > 3) some code paths have additional checks which are not needed. > > > > > > Cgroup v1's memory controller has a number of features that are not supported > > > by cgroup v2 and their implementation is pretty much self contained. > > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > > complicated event notification system, charge migration. > > > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > > intervened with generic and cgroup v2-specific code. It's a burden on > > > developers and maintainers. > > > > > > This patchset aims to solve these problems by: > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > > mm/internal.h header > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > > 5) putting unused struct memory_cgroup and task_struct members under > > > CONFIG_MEMCG_V1 as well. > > > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > > to discuss and agree on the overall approach. > > > > > > Some open questions, opinions are appreciated: > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > > mem_cgroup_v1_ prefix. Is this a good idea? > > > 2) Do we want to extend it beyond the memory controller? Should > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > > > > Hi Roman, > > > > A very timely and important topic and we should definitely talk about it > > during LSFMM as well. I have been thinking about this problem for quite > > sometime and I am getting more and more convinced that we should aim to > > completely deprecate memcg-v1. > > > > More specifically: > > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > structual restrictions) > > > > 2. What are unused memcg-v1 features which we should start deprecating? > > > > IMO we should systematically start deprecating memcg-v1 features and > > start unblocking the users stuck on memcg-v1. > > > > Now regarding the proposal in this series, I think it can be a first > > step but should not give an impression that we are done. The only > > concern I have is the potential of "out of sight, out of mind" situation > > with this change but if we keep the momentum of deprecation of memcg-v1 > > it should be fine. > > > > I have CCed Greg and David from Google to get their opinion on what > > memcg-v1 features are blocker for their memcg-v2 migration and if they > > have concern in deprecation of memcg-v1 features. > > > > Anyone else still on memcg-v1, please do provide your input. > > Hi Shakeel, > > Hopefully I'm not too late. We are currently using memcg v1. > > One specific feature we rely on in v1 is skmem accounting. In v1, we > account for TCP memory usage without charging it to memcg v1, which is > useful for monitoring the TCP memory usage generated by tasks running > in a container. However, in memcg v2, monitoring TCP memory requires > charging it to the container, which can easily cause OOM issues. It > would be better if we could monitor skmem usage without charging it in > the memcg v2, allowing us to account for it without the risk of > triggering OOM conditions. Hi Yafang, the data itself is available on cgroup v2 in memory.stat:sock, however you're right, it's charged on pair with other types of memory. It was one of the main principles of cgroup v2's memory controller, so I don't think it can be changed. So the feature you need is not skmem accounting, but something quite opposite :) The question I have here: what makes socket memory different here? Is it something specific to your setup (e.g. you mostly use memory.max to protect against memory leaks in the userspace code, but socket memory spikes are always caused by external traffic and are legit) or we have more fundamental problems with the socket memory handling, e.g. we can't effectively reclaim it under the memory pressure? In the first case you can maintain a ~2-lines non-upstream patch which will disable the charging while maintaining statistics - it's not a perfect, but likely the best option here. In the second case we need collectively fix it for cgroup v2. Thanks!
On Fri, May 17, 2024 at 1:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote: > > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > > Cgroups v2 have been around for a while and many users have fully adopted them, > > > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > > > for the cgroup v1 support anyway: > > > > 1) the kernel binary contains useless cgroup v1 code, > > > > 2) some common structures like task_struct and mem_cgroup have never used > > > > cgroup v1-specific members, > > > > 3) some code paths have additional checks which are not needed. > > > > > > > > Cgroup v1's memory controller has a number of features that are not supported > > > > by cgroup v2 and their implementation is pretty much self contained. > > > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > > > complicated event notification system, charge migration. > > > > > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > > > intervened with generic and cgroup v2-specific code. It's a burden on > > > > developers and maintainers. > > > > > > > > This patchset aims to solve these problems by: > > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > > > mm/internal.h header > > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > > > 5) putting unused struct memory_cgroup and task_struct members under > > > > CONFIG_MEMCG_V1 as well. > > > > > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > > > to discuss and agree on the overall approach. > > > > > > > > Some open questions, opinions are appreciated: > > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > > > mem_cgroup_v1_ prefix. Is this a good idea? > > > > 2) Do we want to extend it beyond the memory controller? Should > > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > > > > > > > Hi Roman, > > > > > > A very timely and important topic and we should definitely talk about it > > > during LSFMM as well. I have been thinking about this problem for quite > > > sometime and I am getting more and more convinced that we should aim to > > > completely deprecate memcg-v1. > > > > > > More specifically: > > > > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > > structual restrictions) > > > > > > 2. What are unused memcg-v1 features which we should start deprecating? > > > > > > IMO we should systematically start deprecating memcg-v1 features and > > > start unblocking the users stuck on memcg-v1. > > > > > > Now regarding the proposal in this series, I think it can be a first > > > step but should not give an impression that we are done. The only > > > concern I have is the potential of "out of sight, out of mind" situation > > > with this change but if we keep the momentum of deprecation of memcg-v1 > > > it should be fine. > > > > > > I have CCed Greg and David from Google to get their opinion on what > > > memcg-v1 features are blocker for their memcg-v2 migration and if they > > > have concern in deprecation of memcg-v1 features. > > > > > > Anyone else still on memcg-v1, please do provide your input. > > > > Hi Shakeel, > > > > Hopefully I'm not too late. We are currently using memcg v1. > > > > One specific feature we rely on in v1 is skmem accounting. In v1, we > > account for TCP memory usage without charging it to memcg v1, which is > > useful for monitoring the TCP memory usage generated by tasks running > > in a container. However, in memcg v2, monitoring TCP memory requires > > charging it to the container, which can easily cause OOM issues. It > > would be better if we could monitor skmem usage without charging it in > > the memcg v2, allowing us to account for it without the risk of > > triggering OOM conditions. > > Hi Yafang, > > the data itself is available on cgroup v2 in memory.stat:sock, however > you're right, it's charged on pair with other types of memory. It was > one of the main principles of cgroup v2's memory controller, so I don't > think it can be changed. > > So the feature you need is not skmem accounting, but something quite > opposite :) > > The question I have here: what makes socket memory different here? > > Is it something specific to your setup (e.g. you mostly use memory.max > to protect against memory leaks in the userspace code, but socket memory > spikes are always caused by external traffic and are legit) or we have > more fundamental problems with the socket memory handling, e.g. we can't > effectively reclaim it under the memory pressure? It is the first case. > > In the first case you can maintain a ~2-lines non-upstream patch which will > disable the charging while maintaining statistics - it's not a perfect, but > likely the best option here. In the second case we need collectively fix it > for cgroup v2. > Thank you for your advice. Currently, we do not have any immediate plans to migrate to cgroup v2. If we are required to use cgroup v2 in the future, we will need to maintain non-upstream patches. By the way, is there any reason we cannot keep this behavior consistent with memcg v1 in the upstream kernel? That would save us from having to maintain it locally.
On Fri, May 17, 2024 at 10:21:01AM +0800, Yafang Shao wrote: > On Fri, May 17, 2024 at 1:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote: > > > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > > > Cgroups v2 have been around for a while and many users have fully adopted them, > > > > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > > > > for the cgroup v1 support anyway: > > > > > 1) the kernel binary contains useless cgroup v1 code, > > > > > 2) some common structures like task_struct and mem_cgroup have never used > > > > > cgroup v1-specific members, > > > > > 3) some code paths have additional checks which are not needed. > > > > > > > > > > Cgroup v1's memory controller has a number of features that are not supported > > > > > by cgroup v2 and their implementation is pretty much self contained. > > > > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > > > > complicated event notification system, charge migration. > > > > > > > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > > > > intervened with generic and cgroup v2-specific code. It's a burden on > > > > > developers and maintainers. > > > > > > > > > > This patchset aims to solve these problems by: > > > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > > > > mm/internal.h header > > > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > > > > 5) putting unused struct memory_cgroup and task_struct members under > > > > > CONFIG_MEMCG_V1 as well. > > > > > > > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > > > > to discuss and agree on the overall approach. > > > > > > > > > > Some open questions, opinions are appreciated: > > > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > > > > mem_cgroup_v1_ prefix. Is this a good idea? > > > > > 2) Do we want to extend it beyond the memory controller? Should > > > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > > > > > > > > > > Hi Roman, > > > > > > > > A very timely and important topic and we should definitely talk about it > > > > during LSFMM as well. I have been thinking about this problem for quite > > > > sometime and I am getting more and more convinced that we should aim to > > > > completely deprecate memcg-v1. > > > > > > > > More specifically: > > > > > > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > > > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > > > structual restrictions) > > > > > > > > 2. What are unused memcg-v1 features which we should start deprecating? > > > > > > > > IMO we should systematically start deprecating memcg-v1 features and > > > > start unblocking the users stuck on memcg-v1. > > > > > > > > Now regarding the proposal in this series, I think it can be a first > > > > step but should not give an impression that we are done. The only > > > > concern I have is the potential of "out of sight, out of mind" situation > > > > with this change but if we keep the momentum of deprecation of memcg-v1 > > > > it should be fine. > > > > > > > > I have CCed Greg and David from Google to get their opinion on what > > > > memcg-v1 features are blocker for their memcg-v2 migration and if they > > > > have concern in deprecation of memcg-v1 features. > > > > > > > > Anyone else still on memcg-v1, please do provide your input. > > > > > > Hi Shakeel, > > > > > > Hopefully I'm not too late. We are currently using memcg v1. > > > > > > One specific feature we rely on in v1 is skmem accounting. In v1, we > > > account for TCP memory usage without charging it to memcg v1, which is > > > useful for monitoring the TCP memory usage generated by tasks running > > > in a container. However, in memcg v2, monitoring TCP memory requires > > > charging it to the container, which can easily cause OOM issues. It > > > would be better if we could monitor skmem usage without charging it in > > > the memcg v2, allowing us to account for it without the risk of > > > triggering OOM conditions. > > > > Hi Yafang, > > > > the data itself is available on cgroup v2 in memory.stat:sock, however > > you're right, it's charged on pair with other types of memory. It was > > one of the main principles of cgroup v2's memory controller, so I don't > > think it can be changed. > > > > So the feature you need is not skmem accounting, but something quite > > opposite :) > > > > The question I have here: what makes socket memory different here? > > > > Is it something specific to your setup (e.g. you mostly use memory.max > > to protect against memory leaks in the userspace code, but socket memory > > spikes are always caused by external traffic and are legit) or we have > > more fundamental problems with the socket memory handling, e.g. we can't > > effectively reclaim it under the memory pressure? > > It is the first case. > > > > > In the first case you can maintain a ~2-lines non-upstream patch which will > > disable the charging while maintaining statistics - it's not a perfect, but > > likely the best option here. In the second case we need collectively fix it > > for cgroup v2. > > > > Thank you for your advice. Currently, we do not have any immediate > plans to migrate to cgroup v2. If we are required to use cgroup v2 in > the future, we will need to maintain non-upstream patches. > > By the way, is there any reason we cannot keep this behavior > consistent with memcg v1 in the upstream kernel? That would save us > from having to maintain it locally. The idea to handle various types of memory independently isn't working well for most users: it makes the configuration trickier and more fragile. It's also more expensive in terms of the accounting overhead. The tcpmem accounting is btw quite expensive by itself. So by switching to cgroup v2 you might see (depending on your traffic and cpu load) some nice performance benefits. Thanks!
On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote: > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > [...] > Hi Shakeel, > > Hopefully I'm not too late. We are currently using memcg v1. > > One specific feature we rely on in v1 is skmem accounting. In v1, we > account for TCP memory usage without charging it to memcg v1, which is > useful for monitoring the TCP memory usage generated by tasks running > in a container. However, in memcg v2, monitoring TCP memory requires > charging it to the container, which can easily cause OOM issues. It > would be better if we could monitor skmem usage without charging it in > the memcg v2, allowing us to account for it without the risk of > triggering OOM conditions. > Hi Yafang, No worries. From what I understand, you are not really using skmem charging of v1 but just the network memory usage stats and you are worried that charging network memory to cgroup memory may cause OOMs. Is that correct? Have you tried charging network memory to cgroup memory before and saw OOMs? If yes then I would really like to see OOM reports. I have two examples where the v2's skmem charging is working fine in production namely Google and Meta. Google is still on v1 but for skmem charging, they have moved to v2 semantics. Actually I have another report from Cloudflare [0] where the tcp throttling mechanism for v2's tcp memory accounting is too much conservative for their production traffic. Anyways this just means that we need a more flexible way to provide and enforce semantics for tcp memory pressure with a decent default behavior. I will followup on this separately. [0] https://lore.kernel.org/lkml/CABWYdi0G7cyNFbndM-ELTDAR3x4Ngm0AehEp5aP0tfNkXUE+Uw@mail.gmail.com/ thanks, Shakeel
On Sat, May 18, 2024 at 3:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, May 16, 2024 at 11:35:57AM +0800, Yafang Shao wrote: > > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > [...] > > Hi Shakeel, > > > > Hopefully I'm not too late. We are currently using memcg v1. > > > > One specific feature we rely on in v1 is skmem accounting. In v1, we > > account for TCP memory usage without charging it to memcg v1, which is > > useful for monitoring the TCP memory usage generated by tasks running > > in a container. However, in memcg v2, monitoring TCP memory requires > > charging it to the container, which can easily cause OOM issues. It > > would be better if we could monitor skmem usage without charging it in > > the memcg v2, allowing us to account for it without the risk of > > triggering OOM conditions. > > > > Hi Yafang, > > No worries. From what I understand, you are not really using skmem > charging of v1 but just the network memory usage stats and you are > worried that charging network memory to cgroup memory may cause OOMs. Is > that correct? Correct. > Have you tried charging network memory to cgroup memory > before and saw OOMs? If yes then I would really like to see OOM reports. No, we don't enable the charging for TCP memory in memcg v1 and we don't have a plan to add support for it currently. > > I have two examples where the v2's skmem charging is working fine in > production namely Google and Meta. Google is still on v1 but for skmem > charging, they have moved to v2 semantics. Actually I have another > report from Cloudflare [0] where the tcp throttling mechanism for v2's > tcp memory accounting is too much conservative for their production > traffic. > > Anyways this just means that we need a more flexible way to provide > and enforce semantics for tcp memory pressure with a decent default > behavior. I will followup on this separately. > > [0] https://lore.kernel.org/lkml/CABWYdi0G7cyNFbndM-ELTDAR3x4Ngm0AehEp5aP0tfNkXUE+Uw@mail.gmail.com/ Thanks for your explanation.
On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > Cgroups v2 have been around for a while and many users have fully adopted them, > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > for the cgroup v1 support anyway: > > 1) the kernel binary contains useless cgroup v1 code, > > 2) some common structures like task_struct and mem_cgroup have never used > > cgroup v1-specific members, > > 3) some code paths have additional checks which are not needed. > > > > Cgroup v1's memory controller has a number of features that are not supported > > by cgroup v2 and their implementation is pretty much self contained. > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > complicated event notification system, charge migration. > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > intervened with generic and cgroup v2-specific code. It's a burden on > > developers and maintainers. > > > > This patchset aims to solve these problems by: > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > mm/internal.h header > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > 5) putting unused struct memory_cgroup and task_struct members under > > CONFIG_MEMCG_V1 as well. > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > to discuss and agree on the overall approach. > > > > Some open questions, opinions are appreciated: > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > mem_cgroup_v1_ prefix. Is this a good idea? > > 2) Do we want to extend it beyond the memory controller? Should > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > Hi Roman, > > A very timely and important topic and we should definitely talk about it > during LSFMM as well. I have been thinking about this problem for quite > sometime and I am getting more and more convinced that we should aim to > completely deprecate memcg-v1. > > More specifically: > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > structual restrictions) > > 2. What are unused memcg-v1 features which we should start deprecating? > > IMO we should systematically start deprecating memcg-v1 features and > start unblocking the users stuck on memcg-v1. > > Now regarding the proposal in this series, I think it can be a first > step but should not give an impression that we are done. The only > concern I have is the potential of "out of sight, out of mind" situation > with this change but if we keep the momentum of deprecation of memcg-v1 > it should be fine. > > I have CCed Greg and David from Google to get their opinion on what > memcg-v1 features are blocker for their memcg-v2 migration and if they > have concern in deprecation of memcg-v1 features. > > Anyone else still on memcg-v1, please do provide your input. > Hi, Sorry for joining the discussion late, but I'd like to add some info here: We are using the "memsw" feature a lot. It's a very useful knob for container memory overcommitting: It's a great abstraction of the "expected total memory usage" of a container, so containers can't allocate too much memory using SWAP, but still be able to SWAP out. For a simple example, with memsw.limit == memory.limit, containers can't exceed their original memory limit, even with SWAP enabled, they get OOM killed as how they used to, but the host is now able to offload cold pages. Similar ability seems absent with V2: With memory.swap.max == 0, the host can't use SWAP to reclaim container memory at all. But with a value larger than that, containers are able to overuse memory, causing delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily out of balance, especially with compress SWAP backends. Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to account for the total raw usage, not the compressed usage. One example is that if a container uses tons of duplicated pages, then it can allocate much more memory than it is limited, that could cause trouble. I saw Chris also mentioned Google has a workaround internally for it for Cgroup V2. This will be a blocker for us and a similar workaround might be needed. It will be great so see an upstream support for this.
On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote: > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > Cgroups v2 have been around for a while and many users have fully adopted them, > > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > > for the cgroup v1 support anyway: > > > 1) the kernel binary contains useless cgroup v1 code, > > > 2) some common structures like task_struct and mem_cgroup have never used > > > cgroup v1-specific members, > > > 3) some code paths have additional checks which are not needed. > > > > > > Cgroup v1's memory controller has a number of features that are not supported > > > by cgroup v2 and their implementation is pretty much self contained. > > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > > complicated event notification system, charge migration. > > > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > > intervened with generic and cgroup v2-specific code. It's a burden on > > > developers and maintainers. > > > > > > This patchset aims to solve these problems by: > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > > mm/internal.h header > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > > 5) putting unused struct memory_cgroup and task_struct members under > > > CONFIG_MEMCG_V1 as well. > > > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > > to discuss and agree on the overall approach. > > > > > > Some open questions, opinions are appreciated: > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > > mem_cgroup_v1_ prefix. Is this a good idea? > > > 2) Do we want to extend it beyond the memory controller? Should > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > > > > Hi Roman, > > > > A very timely and important topic and we should definitely talk about it > > during LSFMM as well. I have been thinking about this problem for quite > > sometime and I am getting more and more convinced that we should aim to > > completely deprecate memcg-v1. > > > > More specifically: > > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > structual restrictions) > > > > 2. What are unused memcg-v1 features which we should start deprecating? > > > > IMO we should systematically start deprecating memcg-v1 features and > > start unblocking the users stuck on memcg-v1. > > > > Now regarding the proposal in this series, I think it can be a first > > step but should not give an impression that we are done. The only > > concern I have is the potential of "out of sight, out of mind" situation > > with this change but if we keep the momentum of deprecation of memcg-v1 > > it should be fine. > > > > I have CCed Greg and David from Google to get their opinion on what > > memcg-v1 features are blocker for their memcg-v2 migration and if they > > have concern in deprecation of memcg-v1 features. > > > > Anyone else still on memcg-v1, please do provide your input. > > > > Hi, > > Sorry for joining the discussion late, but I'd like to add some info > here: We are using the "memsw" feature a lot. It's a very useful knob > for container memory overcommitting: It's a great abstraction of the > "expected total memory usage" of a container, so containers can't > allocate too much memory using SWAP, but still be able to SWAP out. > > For a simple example, with memsw.limit == memory.limit, containers > can't exceed their original memory limit, even with SWAP enabled, they > get OOM killed as how they used to, but the host is now able to > offload cold pages. > > Similar ability seems absent with V2: With memory.swap.max == 0, the > host can't use SWAP to reclaim container memory at all. But with a > value larger than that, containers are able to overuse memory, causing > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily > out of balance, especially with compress SWAP backends. > > Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to > account for the total raw usage, not the compressed usage. One example > is that if a container uses tons of duplicated pages, then it can > allocate much more memory than it is limited, that could cause > trouble. So you don't need separate swap knobs, only combined, right? > I saw Chris also mentioned Google has a workaround internally for it > for Cgroup V2. This will be a blocker for us and a similar workaround > might be needed. It will be great so see an upstream support for this. I think that _at least_ we should refactor the code so that it would be a minimal patch (e.g. one #define) to switch to the old mode. I don't think it's reasonable to add a new interface, but having a patch/config option or even a mount option which changes the semantics of memory.swap.max to the v1-like behavior should be ok. I'll try to do the first part (refactoring this code), and we can have a discussion from there. Thanks!
On Thu, May 23, 2024 at 12:56 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote: > > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > > Cgroups v2 have been around for a while and many users have fully adopted them, > > > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > > > for the cgroup v1 support anyway: > > > > 1) the kernel binary contains useless cgroup v1 code, > > > > 2) some common structures like task_struct and mem_cgroup have never used > > > > cgroup v1-specific members, > > > > 3) some code paths have additional checks which are not needed. > > > > > > > > Cgroup v1's memory controller has a number of features that are not supported > > > > by cgroup v2 and their implementation is pretty much self contained. > > > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > > > complicated event notification system, charge migration. > > > > > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > > > intervened with generic and cgroup v2-specific code. It's a burden on > > > > developers and maintainers. > > > > > > > > This patchset aims to solve these problems by: > > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > > > mm/internal.h header > > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > > > 5) putting unused struct memory_cgroup and task_struct members under > > > > CONFIG_MEMCG_V1 as well. > > > > > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > > > to discuss and agree on the overall approach. > > > > > > > > Some open questions, opinions are appreciated: > > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > > > mem_cgroup_v1_ prefix. Is this a good idea? > > > > 2) Do we want to extend it beyond the memory controller? Should > > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > > > > > > > Hi Roman, > > > > > > A very timely and important topic and we should definitely talk about it > > > during LSFMM as well. I have been thinking about this problem for quite > > > sometime and I am getting more and more convinced that we should aim to > > > completely deprecate memcg-v1. > > > > > > More specifically: > > > > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > > structual restrictions) > > > > > > 2. What are unused memcg-v1 features which we should start deprecating? > > > > > > IMO we should systematically start deprecating memcg-v1 features and > > > start unblocking the users stuck on memcg-v1. > > > > > > Now regarding the proposal in this series, I think it can be a first > > > step but should not give an impression that we are done. The only > > > concern I have is the potential of "out of sight, out of mind" situation > > > with this change but if we keep the momentum of deprecation of memcg-v1 > > > it should be fine. > > > > > > I have CCed Greg and David from Google to get their opinion on what > > > memcg-v1 features are blocker for their memcg-v2 migration and if they > > > have concern in deprecation of memcg-v1 features. > > > > > > Anyone else still on memcg-v1, please do provide your input. > > > > > > > Hi, > > > > Sorry for joining the discussion late, but I'd like to add some info > > here: We are using the "memsw" feature a lot. It's a very useful knob > > for container memory overcommitting: It's a great abstraction of the > > "expected total memory usage" of a container, so containers can't > > allocate too much memory using SWAP, but still be able to SWAP out. +Michal, Just FYI, we do have companies like Tensent using the V1 combine memsw limitation as well. Google is not the only company using this API. > > > > For a simple example, with memsw.limit == memory.limit, containers > > can't exceed their original memory limit, even with SWAP enabled, they > > get OOM killed as how they used to, but the host is now able to > > offload cold pages. > > > > Similar ability seems absent with V2: With memory.swap.max == 0, the > > host can't use SWAP to reclaim container memory at all. But with a > > value larger than that, containers are able to overuse memory, causing > > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily > > out of balance, especially with compress SWAP backends. > > > > Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to > > account for the total raw usage, not the compressed usage. One example > > is that if a container uses tons of duplicated pages, then it can > > allocate much more memory than it is limited, that could cause > > trouble. > > So you don't need separate swap knobs, only combined, right? > > > I saw Chris also mentioned Google has a workaround internally for it > > for Cgroup V2. This will be a blocker for us and a similar workaround > > might be needed. It will be great so see an upstream support for this. > > I think that _at least_ we should refactor the code so that it would > be a minimal patch (e.g. one #define) to switch to the old mode. That would be great to have a path forward to allow cgroup V2 to provide the combined memsw limitations. > > I don't think it's reasonable to add a new interface, but having a > patch/config option or even a mount option which changes the semantics > of memory.swap.max to the v1-like behavior should be ok. Using sysctl or a slightly different cgroup API is fine. The feature needs to be there. > I'll try to do the first part (refactoring this code), and we can have > a discussion from there. Looking forward to it. Thanks Chris
On Fri, May 24, 2024 at 3:55 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote: > > On Thu, May 9, 2024 at 2:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > > Cgroups v2 have been around for a while and many users have fully adopted them, > > > > so they never use cgroups v1 features and functionality. Yet they have to "pay" > > > > for the cgroup v1 support anyway: > > > > 1) the kernel binary contains useless cgroup v1 code, > > > > 2) some common structures like task_struct and mem_cgroup have never used > > > > cgroup v1-specific members, > > > > 3) some code paths have additional checks which are not needed. > > > > > > > > Cgroup v1's memory controller has a number of features that are not supported > > > > by cgroup v2 and their implementation is pretty much self contained. > > > > Most notably, these features are: soft limit reclaim, oom handling in userspace, > > > > complicated event notification system, charge migration. > > > > > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's > > > > intervened with generic and cgroup v2-specific code. It's a burden on > > > > developers and maintainers. > > > > > > > > This patchset aims to solve these problems by: > > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, > > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the > > > > mm/internal.h header > > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default > > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > > > 5) putting unused struct memory_cgroup and task_struct members under > > > > CONFIG_MEMCG_V1 as well. > > > > > > > > This is an RFC version, which is not 100% polished yet, so but it would be great > > > > to discuss and agree on the overall approach. > > > > > > > > Some open questions, opinions are appreciated: > > > > 1) I consider renaming non-static functions in memcontrol-v1.c to have > > > > mem_cgroup_v1_ prefix. Is this a good idea? > > > > 2) Do we want to extend it beyond the memory controller? Should > > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead of > > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > > > > > > > Hi Roman, > > > > > > A very timely and important topic and we should definitely talk about it > > > during LSFMM as well. I have been thinking about this problem for quite > > > sometime and I am getting more and more convinced that we should aim to > > > completely deprecate memcg-v1. > > > > > > More specifically: > > > > > > 1. What are the memcg-v1 features which have no alternative in memcg-v2 > > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > > structual restrictions) > > > > > > 2. What are unused memcg-v1 features which we should start deprecating? > > > > > > IMO we should systematically start deprecating memcg-v1 features and > > > start unblocking the users stuck on memcg-v1. > > > > > > Now regarding the proposal in this series, I think it can be a first > > > step but should not give an impression that we are done. The only > > > concern I have is the potential of "out of sight, out of mind" situation > > > with this change but if we keep the momentum of deprecation of memcg-v1 > > > it should be fine. > > > > > > I have CCed Greg and David from Google to get their opinion on what > > > memcg-v1 features are blocker for their memcg-v2 migration and if they > > > have concern in deprecation of memcg-v1 features. > > > > > > Anyone else still on memcg-v1, please do provide your input. > > > > > > > Hi, > > > > Sorry for joining the discussion late, but I'd like to add some info > > here: We are using the "memsw" feature a lot. It's a very useful knob > > for container memory overcommitting: It's a great abstraction of the > > "expected total memory usage" of a container, so containers can't > > allocate too much memory using SWAP, but still be able to SWAP out. > > > > For a simple example, with memsw.limit == memory.limit, containers > > can't exceed their original memory limit, even with SWAP enabled, they > > get OOM killed as how they used to, but the host is now able to > > offload cold pages. > > > > Similar ability seems absent with V2: With memory.swap.max == 0, the > > host can't use SWAP to reclaim container memory at all. But with a > > value larger than that, containers are able to overuse memory, causing > > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily > > out of balance, especially with compress SWAP backends. > > > > Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to > > account for the total raw usage, not the compressed usage. One example > > is that if a container uses tons of duplicated pages, then it can > > allocate much more memory than it is limited, that could cause > > trouble. > > So you don't need separate swap knobs, only combined, right? Yes, currently we use either combined or separate knobs. > > I saw Chris also mentioned Google has a workaround internally for it > > for Cgroup V2. This will be a blocker for us and a similar workaround > > might be needed. It will be great so see an upstream support for this. > > I think that _at least_ we should refactor the code so that it would > be a minimal patch (e.g. one #define) to switch to the old mode. > > I don't think it's reasonable to add a new interface, but having a > patch/config option or even a mount option which changes the semantics > of memory.swap.max to the v1-like behavior should be ok. > > I'll try to do the first part (refactoring this code), and we can have > a discussion from there. Thanks, that sounds like a good start.
Cgroups v2 have been around for a while and many users have fully adopted them, so they never use cgroups v1 features and functionality. Yet they have to "pay" for the cgroup v1 support anyway: 1) the kernel binary contains useless cgroup v1 code, 2) some common structures like task_struct and mem_cgroup have never used cgroup v1-specific members, 3) some code paths have additional checks which are not needed. Cgroup v1's memory controller has a number of features that are not supported by cgroup v2 and their implementation is pretty much self contained. Most notably, these features are: soft limit reclaim, oom handling in userspace, complicated event notification system, charge migration. Cgroup v1-specific code in memcontrol.c is close to 4k lines in size and it's intervened with generic and cgroup v2-specific code. It's a burden on developers and maintainers. This patchset aims to solve these problems by: 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file, 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the mm/internal.h header 3) introducing the CONFIG_MEMCG_V1 config option, turned on by default 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set 5) putting unused struct memory_cgroup and task_struct members under CONFIG_MEMCG_V1 as well. This is an RFC version, which is not 100% polished yet, so but it would be great to discuss and agree on the overall approach. Some open questions, opinions are appreciated: 1) I consider renaming non-static functions in memcontrol-v1.c to have mem_cgroup_v1_ prefix. Is this a good idea? 2) Do we want to extend it beyond the memory controller? Should 3) Is it better to use a new include/linux/memcontrol-v1.h instead of mm/internal.h? Or mm/memcontrol-v1.h. diffstat: include/linux/memcontrol.h | 165 ++++--- include/linux/sched.h | 5 +- init/Kconfig | 7 + mm/Makefile | 2 + mm/internal.h | 124 +++++ mm/memcontrol-v1.c | 2941 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ mm/memcontrol.c | 4121 ++++++++++++++++++++++--------------------------------------------------------------------------------------------------------------------------------- 7 files changed, 3765 insertions(+), 3600 deletions(-) Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Roman Gushchin (9): mm: memcg: introduce memcontrol-v1.c mm: memcg: move soft limit reclaim code to memcontrol-v1.c mm: memcg: move charge migration code to memcontrol-v1.c mm: memcg: move legacy memcg event code into memcontrol-v1.c mm: memcg: move cgroup v1 interface files to memcontrol-v1.c mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c mm: memcg: put cgroup v1-specific code under a config option mm: memcg: put corresponding struct mem_cgroup members under CONFIG_MEMCG_V1 mm: memcg: put cgroup v1-related members of task_struct under config option include/linux/memcontrol.h | 165 +- include/linux/sched.h | 5 +- init/Kconfig | 7 + mm/Makefile | 2 + mm/internal.h | 124 ++ mm/memcontrol-v1.c | 2941 +++++++++++++++++++++++++ mm/memcontrol.c | 4121 ++++++------------------------------ 7 files changed, 3765 insertions(+), 3600 deletions(-) create mode 100644 mm/memcontrol-v1.c