Message ID | 1562310330-16074-1-git-send-email-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, memcg: support memory.{min, low} protection in cgroup v1 | expand |
On Fri 05-07-19 15:05:30, Yafang Shao wrote: > We always deploy many containers on one host. Some of these containers > are with high priority, while others are with low priority. > memory.{min, low} is useful to help us protect page cache of a specified > container to gain better performance. > But currently it is only supported in cgroup v2. > To support it in cgroup v1, we only need to make small changes, as the > facility is already exist. > This patch exposed two files to user in cgroup v1, which are memory.min > and memory.low. The usage to set these two files is same with cgroup v2. > Both hierarchical and non-hierarchical mode are supported. Cgroup v1 API is considered frozen with new features added only to v2. Why cannot you move over to v2 and have to stick with v1?
On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 05-07-19 15:05:30, Yafang Shao wrote: > > We always deploy many containers on one host. Some of these containers > > are with high priority, while others are with low priority. > > memory.{min, low} is useful to help us protect page cache of a specified > > container to gain better performance. > > But currently it is only supported in cgroup v2. > > To support it in cgroup v1, we only need to make small changes, as the > > facility is already exist. > > This patch exposed two files to user in cgroup v1, which are memory.min > > and memory.low. The usage to set these two files is same with cgroup v2. > > Both hierarchical and non-hierarchical mode are supported. > > Cgroup v1 API is considered frozen with new features added only to v2. The facilities support both cgroup v1 and cgroup v2, and what we need to do is only exposing the interface. If the cgroup v1 API is frozen, it will be a pity. > Why cannot you move over to v2 and have to stick with v1? Because the interfaces between cgroup v1 and cgroup v2 are changed too much, which is unacceptable by our customer. It may take long time to use cgroup v2 in production envrioment, per my understanding. BTW, the filesystem on our servers is XFS, but the cgroup v2 writeback throttle is not supported on XFS by now, that is beyond my comprehension. Thanks Yafang
On Fri 05-07-19 17:41:44, Yafang Shao wrote: > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: [...] > > Why cannot you move over to v2 and have to stick with v1? > Because the interfaces between cgroup v1 and cgroup v2 are changed too > much, which is unacceptable by our customer. Could you be more specific about obstacles with respect to interfaces please? > It may take long time to use cgroup v2 in production envrioment, per > my understanding. > BTW, the filesystem on our servers is XFS, but the cgroup v2 > writeback throttle is not supported on XFS by now, that is beyond my > comprehension. Are you sure? I would be surprised if v1 throttling would work while v2 wouldn't. As far as I remember it is v2 writeback throttling which actually works. The only throttling we have for v1 is reclaim based one which is a huge hammer.
On Fri, Jul 5, 2019 at 7:10 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 05-07-19 17:41:44, Yafang Shao wrote: > > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > [...] > > > Why cannot you move over to v2 and have to stick with v1? > > Because the interfaces between cgroup v1 and cgroup v2 are changed too > > much, which is unacceptable by our customer. > > Could you be more specific about obstacles with respect to interfaces > please? > Lots of applications will be changed. Kubernetes, Docker and some other applications which are using cgroup v1, that will be a trouble, because they are not maintained by us. > > It may take long time to use cgroup v2 in production envrioment, per > > my understanding. > > BTW, the filesystem on our servers is XFS, but the cgroup v2 > > writeback throttle is not supported on XFS by now, that is beyond my > > comprehension. > > Are you sure? I would be surprised if v1 throttling would work while v2 > wouldn't. As far as I remember it is v2 writeback throttling which > actually works. The only throttling we have for v1 is reclaim based one > which is a huge hammer. > -- We did it in cgroup v1 in our kernel. But the upstream still don't support it in cgroup v2. So my real question is why upstream can't support such an import file system ? Do you know which companies besides facebook are using cgroup v2 in their product enviroment? Thanks Yafang
cc linux-xfs On Fri, Jul 05, 2019 at 10:33:04PM +0800, Yafang Shao wrote: > On Fri, Jul 5, 2019 at 7:10 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 05-07-19 17:41:44, Yafang Shao wrote: > > > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > [...] > > > > Why cannot you move over to v2 and have to stick with v1? > > > Because the interfaces between cgroup v1 and cgroup v2 are changed too > > > much, which is unacceptable by our customer. > > > > Could you be more specific about obstacles with respect to interfaces > > please? > > > > Lots of applications will be changed. > Kubernetes, Docker and some other applications which are using cgroup v1, > that will be a trouble, because they are not maintained by us. > > > > It may take long time to use cgroup v2 in production envrioment, per > > > my understanding. > > > BTW, the filesystem on our servers is XFS, but the cgroup v2 > > > writeback throttle is not supported on XFS by now, that is beyond my > > > comprehension. > > > > Are you sure? I would be surprised if v1 throttling would work while v2 > > wouldn't. As far as I remember it is v2 writeback throttling which > > actually works. The only throttling we have for v1 is reclaim based one > > which is a huge hammer. > > -- > > We did it in cgroup v1 in our kernel. > But the upstream still don't support it in cgroup v2. > So my real question is why upstream can't support such an import file system ? > Do you know which companies besides facebook are using cgroup v2 in > their product enviroment? > I think the original issue with regard to XFS cgroupv2 writeback throttling support was that at the time the XFS patch was proposed, there wasn't any test coverage to prove that the code worked (and the original author never followed up). That has since been resolved and Christoph has recently posted a new patch [1], which appears to have been accepted by the maintainer. Brian [1] https://marc.info/?l=linux-xfs&m=156138379906141&w=2 > Thanks > Yafang >
Yafang Shao writes: >> Cgroup v1 API is considered frozen with new features added only to v2. > >The facilities support both cgroup v1 and cgroup v2, and what we need >to do is only exposing the interface. >If the cgroup v1 API is frozen, it will be a pity. This might be true in the absolute purest technical sense, but not in a practical one. Just exposing the memory protection interface without making it comprehend v1's API semantics seems a bad move to me -- for example, how it (and things like effective protections) interact without the no internal process constraint, and certainly many many more things that nobody has realised are not going to work yet. And to that extent, you're really implicitly asking for a lot of work and evaluation to be done on memory protections for an interface which is already frozen. I'm quite strongly against that. >Because the interfaces between cgroup v1 and cgroup v2 are changed too >much, which is unacceptable by our customer. The problem is that you're explicitly requesting to use functionality which under the hood relies on that new interface while also requesting to not use that new interface at the same time :-) While it may superficially work without it, I'm sceptical that simply adding memory.low and memory.min to the v1 hierarchy is going to end up with reasonable results under scrutiny, or a coherent system or API.
On Fri 05-07-19 22:33:04, Yafang Shao wrote: > On Fri, Jul 5, 2019 at 7:10 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 05-07-19 17:41:44, Yafang Shao wrote: > > > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > [...] > > > > Why cannot you move over to v2 and have to stick with v1? > > > Because the interfaces between cgroup v1 and cgroup v2 are changed too > > > much, which is unacceptable by our customer. > > > > Could you be more specific about obstacles with respect to interfaces > > please? > > > > Lots of applications will be changed. > Kubernetes, Docker and some other applications which are using cgroup v1, > that will be a trouble, because they are not maintained by us. Do they actually have to change or they can simply use v2? I mean, how many of them really do rely on having tasks in intermediate nodes or rely on per-thread cgroups? Those should be the most visibile changes in the interface except for control files naming. If it is purely about the naming then it should be quite trivial to update, no? Brian has already answered the xfs part I believe. I am not really familiar with that topic so I cannot comment anyway. > Do you know which companies besides facebook are using cgroup v2 in > their product enviroment? I do not really know who those users are but it has been made a wider decision that v2 is going to be a rework of a new interface and the the v1 will be preserved and maintain for ever for backward compatibility. If there are usecases which cannot use v2 because of some fundamental reasons then we really want to hear about those. And if v2 really is not usable we can think of adding features to v1 of course.
On Fri, Jul 5, 2019 at 11:11 PM Brian Foster <bfoster@redhat.com> wrote: > > cc linux-xfs > > On Fri, Jul 05, 2019 at 10:33:04PM +0800, Yafang Shao wrote: > > On Fri, Jul 5, 2019 at 7:10 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Fri 05-07-19 17:41:44, Yafang Shao wrote: > > > > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > > [...] > > > > > Why cannot you move over to v2 and have to stick with v1? > > > > Because the interfaces between cgroup v1 and cgroup v2 are changed too > > > > much, which is unacceptable by our customer. > > > > > > Could you be more specific about obstacles with respect to interfaces > > > please? > > > > > > > Lots of applications will be changed. > > Kubernetes, Docker and some other applications which are using cgroup v1, > > that will be a trouble, because they are not maintained by us. > > > > > > It may take long time to use cgroup v2 in production envrioment, per > > > > my understanding. > > > > BTW, the filesystem on our servers is XFS, but the cgroup v2 > > > > writeback throttle is not supported on XFS by now, that is beyond my > > > > comprehension. > > > > > > Are you sure? I would be surprised if v1 throttling would work while v2 > > > wouldn't. As far as I remember it is v2 writeback throttling which > > > actually works. The only throttling we have for v1 is reclaim based one > > > which is a huge hammer. > > > -- > > > > We did it in cgroup v1 in our kernel. > > But the upstream still don't support it in cgroup v2. > > So my real question is why upstream can't support such an import file system ? > > Do you know which companies besides facebook are using cgroup v2 in > > their product enviroment? > > > > I think the original issue with regard to XFS cgroupv2 writeback > throttling support was that at the time the XFS patch was proposed, > there wasn't any test coverage to prove that the code worked (and the > original author never followed up). That has since been resolved and > Christoph has recently posted a new patch [1], which appears to have > been accepted by the maintainer. > > Brian > > [1] https://marc.info/?l=linux-xfs&m=156138379906141&w=2 > Thanks for your reference. I will pay attention to that thread.
On Fri, Jul 5, 2019 at 11:52 PM Chris Down <chris@chrisdown.name> wrote: > > Yafang Shao writes: > >> Cgroup v1 API is considered frozen with new features added only to v2. > > > >The facilities support both cgroup v1 and cgroup v2, and what we need > >to do is only exposing the interface. > >If the cgroup v1 API is frozen, it will be a pity. > > This might be true in the absolute purest technical sense, but not in a > practical one. Just exposing the memory protection interface without making it > comprehend v1's API semantics seems a bad move to me -- for example, how it > (and things like effective protections) interact without the no internal > process constraint, and certainly many many more things that nobody has > realised are not going to work yet. > Hmm ? Would be more specific about the issues without the o internal process constraint ? The memcg LRU scan/reclaim works fine on both cgroup v1 and cgroup v2, so the memcg LRU protection should works fine on both cgroup v1 and cgroup v2 as well. IOW, if there're some issues without internal process contraint, then there must be some issues in cgroup v1 LRU. > And to that extent, you're really implicitly asking for a lot of work and > evaluation to be done on memory protections for an interface which is already > frozen. I'm quite strongly against that. > > >Because the interfaces between cgroup v1 and cgroup v2 are changed too > >much, which is unacceptable by our customer. > > The problem is that you're explicitly requesting to use functionality which > under the hood relies on that new interface while also requesting to not use > that new interface at the same time :-) > I'm sorry that I can't get your point really. As I said, the facility is already there. The memcg LRU is not designed for cgroup v2 only. > While it may superficially work without it, I'm sceptical that simply adding > memory.low and memory.min to the v1 hierarchy is going to end up with > reasonable results under scrutiny, or a coherent system or API. I have finished some simple stress tests, and the result are fine. Would you pls give me some suggestion on how to test it if you think there may be some issues ? Thanks Yafang
On Fri, Jul 05, 2019 at 11:10:45AM -0400, Brian Foster wrote: > cc linux-xfs > > On Fri, Jul 05, 2019 at 10:33:04PM +0800, Yafang Shao wrote: > > On Fri, Jul 5, 2019 at 7:10 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Fri 05-07-19 17:41:44, Yafang Shao wrote: > > > > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > > [...] > > > > > Why cannot you move over to v2 and have to stick with v1? > > > > Because the interfaces between cgroup v1 and cgroup v2 are changed too > > > > much, which is unacceptable by our customer. > > > > > > Could you be more specific about obstacles with respect to interfaces > > > please? > > > > > > > Lots of applications will be changed. > > Kubernetes, Docker and some other applications which are using cgroup v1, > > that will be a trouble, because they are not maintained by us. > > > > > > It may take long time to use cgroup v2 in production envrioment, per > > > > my understanding. > > > > BTW, the filesystem on our servers is XFS, but the cgroup v2 > > > > writeback throttle is not supported on XFS by now, that is beyond my > > > > comprehension. > > > > > > Are you sure? I would be surprised if v1 throttling would work while v2 > > > wouldn't. As far as I remember it is v2 writeback throttling which > > > actually works. The only throttling we have for v1 is reclaim based one > > > which is a huge hammer. > > > -- > > > > We did it in cgroup v1 in our kernel. > > But the upstream still don't support it in cgroup v2. > > So my real question is why upstream can't support such an import file system ? > > Do you know which companies besides facebook are using cgroup v2 in > > their product enviroment? > > > > I think the original issue with regard to XFS cgroupv2 writeback > throttling support was that at the time the XFS patch was proposed, > there wasn't any test coverage to prove that the code worked (and the > original author never followed up). That has since been resolved and > Christoph has recently posted a new patch [1], which appears to have > been accepted by the maintainer. I don't think the validation issue has been resolved. i.e. we still don't have regression tests that ensure it keeps working it in future, or that it works correctly in any specific distro setting/configuration. The lack of repeatable QoS validation infrastructure was the reason I never merged support for this in the first place. So while the (simple) patch to support it has been merged now, there's no guarantee that it will work as expected or continue to do so over the long run as nobody upstream or in distro land has a way of validating that it is working correctly. From that perspective, it is still my opinion that one-off "works for me" testing isn't sufficient validation for a QoS feature that people will use to implement SLAs with $$$ penalities attached to QoS failures.... Cheers, Dave.
On Sat, Jul 6, 2019 at 3:54 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 05-07-19 22:33:04, Yafang Shao wrote: > > On Fri, Jul 5, 2019 at 7:10 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Fri 05-07-19 17:41:44, Yafang Shao wrote: > > > > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > > [...] > > > > > Why cannot you move over to v2 and have to stick with v1? > > > > Because the interfaces between cgroup v1 and cgroup v2 are changed too > > > > much, which is unacceptable by our customer. > > > > > > Could you be more specific about obstacles with respect to interfaces > > > please? > > > > > > > Lots of applications will be changed. > > Kubernetes, Docker and some other applications which are using cgroup v1, > > that will be a trouble, because they are not maintained by us. > > Do they actually have to change or they can simply use v2? I mean, how > many of them really do rely on having tasks in intermediate nodes or > rely on per-thread cgroups? Those should be the most visibile changes in > the interface except for control files naming. If it is purely about the > naming then it should be quite trivial to update, no? > This is not a technical issue. There're many other factors we have to consider, i.e. the cost. One simple example is in publich cloud, if we upgrade our system and force the customers to modify their user code to run on our new system, we may lost these customers. As this is not a techical issue really, I don't think we can make a clear conclusion here. > Brian has already answered the xfs part I believe. I am not really > familiar with that topic so I cannot comment anyway. > > > Do you know which companies besides facebook are using cgroup v2 in > > their product enviroment? > > I do not really know who those users are but it has been made a wider > decision that v2 is going to be a rework of a new interface and the the > v1 will be preserved and maintain for ever for backward compatibility. > If there are usecases which cannot use v2 because of some fundamental > reasons then we really want to hear about those. And if v2 really is not > usable we can think of adding features to v1 of course. > -- > Michal Hocko > SUSE Labs
Yafang Shao writes: >> While it may superficially work without it, I'm sceptical that simply adding >> memory.low and memory.min to the v1 hierarchy is going to end up with >> reasonable results under scrutiny, or a coherent system or API. > >I have finished some simple stress tests, and the result are fine. >Would you pls give me some suggestion on how to test it if you think >there may be some issues ? My concerns are about introducing an API that requires propagation of control into an API that doesn't delineate between using values for *propagation* and using values for *consumption*, and this becomes significantly more important when we're talking about something that necessitates active propagation of values like memory.{low,min}. That's one reason why I bring up concerns related to the fact that v1 allows processes in intermediate cgroups. So again, I'm not expecting that anything necessarily technically goes wrong (hence my comment at the beginning), just that it doesn't necessarily compose to a reasonable thing to have in v1. My concerns are almost strictly about exposing this as an interface in an API that doesn't have qualities that make the end result reasonable.
On Sat, Jul 06, 2019 at 09:52:22AM +1000, Dave Chinner wrote: > On Fri, Jul 05, 2019 at 11:10:45AM -0400, Brian Foster wrote: > > cc linux-xfs > > > > On Fri, Jul 05, 2019 at 10:33:04PM +0800, Yafang Shao wrote: > > > On Fri, Jul 5, 2019 at 7:10 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Fri 05-07-19 17:41:44, Yafang Shao wrote: > > > > > On Fri, Jul 5, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > [...] > > > > > > Why cannot you move over to v2 and have to stick with v1? > > > > > Because the interfaces between cgroup v1 and cgroup v2 are changed too > > > > > much, which is unacceptable by our customer. > > > > > > > > Could you be more specific about obstacles with respect to interfaces > > > > please? > > > > > > > > > > Lots of applications will be changed. > > > Kubernetes, Docker and some other applications which are using cgroup v1, > > > that will be a trouble, because they are not maintained by us. > > > > > > > > It may take long time to use cgroup v2 in production envrioment, per > > > > > my understanding. > > > > > BTW, the filesystem on our servers is XFS, but the cgroup v2 > > > > > writeback throttle is not supported on XFS by now, that is beyond my > > > > > comprehension. > > > > > > > > Are you sure? I would be surprised if v1 throttling would work while v2 > > > > wouldn't. As far as I remember it is v2 writeback throttling which > > > > actually works. The only throttling we have for v1 is reclaim based one > > > > which is a huge hammer. > > > > -- > > > > > > We did it in cgroup v1 in our kernel. > > > But the upstream still don't support it in cgroup v2. > > > So my real question is why upstream can't support such an import file system ? > > > Do you know which companies besides facebook are using cgroup v2 in > > > their product enviroment? > > > > > > > I think the original issue with regard to XFS cgroupv2 writeback > > throttling support was that at the time the XFS patch was proposed, > > there wasn't any test coverage to prove that the code worked (and the > > original author never followed up). That has since been resolved and > > Christoph has recently posted a new patch [1], which appears to have > > been accepted by the maintainer. > > I don't think the validation issue has been resolved. > > i.e. we still don't have regression tests that ensure it keeps > working it in future, or that it works correctly in any specific > distro setting/configuration. The lack of repeatable QoS validation > infrastructure was the reason I never merged support for this in the > first place. > > So while the (simple) patch to support it has been merged now, > there's no guarantee that it will work as expected or continue to do > so over the long run as nobody upstream or in distro land has a way > of validating that it is working correctly. > > From that perspective, it is still my opinion that one-off "works > for me" testing isn't sufficient validation for a QoS feature that > people will use to implement SLAs with $$$ penalities attached to > QoS failures.... > We do have an fstest to cover the accounting bits (which is what the fs is responsible for). Christoph also sent a patch[1] to enable that on XFS. I'm sure there's plenty of room for additional/broader test coverage, of course... Brian [1] https://marc.info/?l=fstests&m=156138385006173&w=2 > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt index a33cedf..7178247 100644 --- a/Documentation/cgroup-v1/memory.txt +++ b/Documentation/cgroup-v1/memory.txt @@ -63,6 +63,10 @@ Brief summary of control files. (See 5.5 for details) memory.limit_in_bytes # set/show limit of memory usage memory.memsw.limit_in_bytes # set/show limit of memory+Swap usage + memory.min # set/show hard memory protection + (See ../admin-guide/cgroup-v2.rst for details) + memory.low # set/show best-effort memory protection + (See ../admin-guide/cgroup-v2.rst for details) memory.failcnt # show the number of memory usage hits limits memory.memsw.failcnt # show the number of memory+Swap hits limits memory.max_usage_in_bytes # show max memory usage recorded diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 3ee806b..58dce75 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -169,6 +169,12 @@ struct mem_cgroup_event { static void mem_cgroup_threshold(struct mem_cgroup *memcg); static void mem_cgroup_oom_notify(struct mem_cgroup *memcg); +static int memory_min_show(struct seq_file *m, void *v); +static ssize_t memory_min_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); +static int memory_low_show(struct seq_file *m, void *v); +static ssize_t memory_low_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off); /* Stuffs for move charges at task migration. */ /* @@ -4288,6 +4294,18 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of, .read_u64 = mem_cgroup_read_u64, }, { + .name = "min", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_min_show, + .write = memory_min_write, + }, + { + .name = "low", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_low_show, + .write = memory_low_write, + }, + { .name = "failcnt", .private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT), .write = mem_cgroup_reset, @@ -5925,7 +5943,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, parent = parent_mem_cgroup(memcg); /* No parent means a non-hierarchical mode on v1 memcg */ if (!parent) - return MEMCG_PROT_NONE; + goto exit; if (parent == root) goto exit;
We always deploy many containers on one host. Some of these containers are with high priority, while others are with low priority. memory.{min, low} is useful to help us protect page cache of a specified container to gain better performance. But currently it is only supported in cgroup v2. To support it in cgroup v1, we only need to make small changes, as the facility is already exist. This patch exposed two files to user in cgroup v1, which are memory.min and memory.low. The usage to set these two files is same with cgroup v2. Both hierarchical and non-hierarchical mode are supported. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Yafang Shao <shaoyafang@didiglobal.com> --- Documentation/cgroup-v1/memory.txt | 4 ++++ mm/memcontrol.c | 20 +++++++++++++++++++- 2 files changed, 23 insertions(+), 1 deletion(-)