Message ID | 20190123223144.GA10798@chrisdown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2,1/2] mm: Rename ambiguously named memory.stat counters and functions | expand |
On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote: > memory.stat and other files already consider subtrees in their output, > and we should too in order to not present an inconsistent interface. > > The current situation is fairly confusing, because people interacting > with cgroups expect hierarchical behaviour in the vein of memory.stat, > cgroup.events, and other files. For example, this causes confusion when > debugging reclaim events under low, as currently these always read "0" > at non-leaf memcg nodes, which frequently causes people to misdiagnose > breach behaviour. The same confusion applies to other counters in this > file when debugging issues. > > Aggregation is done at write time instead of at read-time since these > counters aren't hot (unlike memory.stat which is per-page, so it does it > at read time), and it makes sense to bundle this with the file > notifications. I agree with the consistency argument (matching cgroup.events, ...), and it's definitely looks better for oom* events, but at the same time it feels like a API break. Just for example, let's say you have a delegated sub-tree with memory.max set. Earlier, getting memory.high/max event meant that the whole sub-tree is tight on memory, and, for example, led to shutdown of some parts of the tree. After your change, it might mean that some sub-cgroup has reached its limit, and probably doesn't matter on the top level. Maybe it's still ok, but we definitely need to document it better. It feels bad that different versions of the kernel will handle it differently, so the userspace has to workaround it to actually use these events. Also, please, make sure that it doesn't break memcg kselftests. > > After this patch, events are propagated up the hierarchy: > > [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events > low 0 > high 0 > max 0 > oom 0 > oom_kill 0 > [root@ktst ~]# systemd-run -p MemoryMax=1 true > Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service > [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events > low 0 > high 0 > max 7 > oom 1 > oom_kill 1 > > Signed-off-by: Chris Down <chris@chrisdown.name> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > To: Andrew Morton <akpm@linux-foundation.org> s/To/CC > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: Roman Gushchin <guro@fb.com> > Cc: Dennis Zhou <dennis@kernel.org> > Cc: linux-kernel@vger.kernel.org > Cc: cgroups@vger.kernel.org > Cc: linux-mm@kvack.org > Cc: kernel-team@fb.com > --- > include/linux/memcontrol.h | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 380a212a8c52..5428b372def4 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -769,8 +769,10 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, > static inline void memcg_memory_event(struct mem_cgroup *memcg, > enum memcg_memory_event event) > { > - atomic_long_inc(&memcg->memory_events[event]); > - cgroup_file_notify(&memcg->events_file); > + do { > + atomic_long_inc(&memcg->memory_events[event]); > + cgroup_file_notify(&memcg->events_file); > + } while ((memcg = parent_mem_cgroup(memcg))); We don't have memory.events file for the root cgroup, so we can stop earlier. Thanks!
Roman Gushchin writes: >On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote: >> memory.stat and other files already consider subtrees in their output, >> and we should too in order to not present an inconsistent interface. >> >> The current situation is fairly confusing, because people interacting >> with cgroups expect hierarchical behaviour in the vein of memory.stat, >> cgroup.events, and other files. For example, this causes confusion when >> debugging reclaim events under low, as currently these always read "0" >> at non-leaf memcg nodes, which frequently causes people to misdiagnose >> breach behaviour. The same confusion applies to other counters in this >> file when debugging issues. >> >> Aggregation is done at write time instead of at read-time since these >> counters aren't hot (unlike memory.stat which is per-page, so it does it >> at read time), and it makes sense to bundle this with the file >> notifications. > >I agree with the consistency argument (matching cgroup.events, ...), >and it's definitely looks better for oom* events, but at the same time it feels >like a API break. > >Just for example, let's say you have a delegated sub-tree with memory.max >set. Earlier, getting memory.high/max event meant that the whole sub-tree >is tight on memory, and, for example, led to shutdown of some parts of the tree. >After your change, it might mean that some sub-cgroup has reached its limit, >and probably doesn't matter on the top level. Yeah, this is something I was thinking about while writing it. I think there's an argument to be made either way, since functionally they can both represent the same feature set, just in different ways. In the subtree-propagated version you can find the level of the hierarchy that the event fired at by checking parent events vs. their subtrees' events, and this also allows trivially setting up event watches per-subtree. In the previous, non-propagated version, it's more trivial to work out the level as the event only appears in that memory.events file, but it's harder to actually find out about the existence of such an event because you need to keep a watch for each individual cgroup in the subtree at all times. So I think there's a reasonable argument to be made in favour of considering subtrees. 1. I'm not aware of anyone major currently relying on using the individual subtree level to indicate only subtree-level events. 2. Also, being able to detect the level at which an event happened can be achieved in both versions by comparing event counters. 3. Having memory.events work like cgroup.events and others seems to fit with principle of least astonishment. That said, I agree that there's a tradeoff here, but in my experience this behaviour more closely resembles user intuition and better matches the overall semantics around hierarchical behaviour we've generally established for cgroup v2. >Maybe it's still ok, but we definitely need to document it better. It feels >bad that different versions of the kernel will handle it differently, so >the userspace has to workaround it to actually use these events. That's perfectly reasonable. I'll update the documentation to match. >Also, please, make sure that it doesn't break memcg kselftests. For sure. >We don't have memory.events file for the root cgroup, so we can stop earlier. Oh yeah, I missed that when changing from a for loop to do/while. I'll fix that up, thanks. Thanks for your feedback!
On Wed 23-01-19 17:31:44, Chris Down wrote: > memory.stat and other files already consider subtrees in their output, > and we should too in order to not present an inconsistent interface. > > The current situation is fairly confusing, because people interacting > with cgroups expect hierarchical behaviour in the vein of memory.stat, > cgroup.events, and other files. For example, this causes confusion when > debugging reclaim events under low, as currently these always read "0" > at non-leaf memcg nodes, which frequently causes people to misdiagnose > breach behaviour. The same confusion applies to other counters in this > file when debugging issues. > > Aggregation is done at write time instead of at read-time since these > counters aren't hot (unlike memory.stat which is per-page, so it does it > at read time), and it makes sense to bundle this with the file > notifications. I do not think we can do that for two reasons. It breaks the existing semantic userspace might depend on and more importantly this is not a correct behavior IMO. You have to realize that stats are hierarchical because that is how we account. Events represent a way to inform that something has happened at the specific level of the tree though. If you do not setup low/high/max limit then you simply cannot expect to be informed those get hit because they cannot by definition. Or put it other way, if you are waiting for those events you really want to know the (sub)tree they happened and if you propagate the event up the hierarchy you have hard time to tell that (you would basically have to exclude all but the lowest one and that is an awkward semantic at best. Maybe we want to document this better but I do not see we are going to change the behavior. > Acked-by: Johannes Weiner <hannes@cmpxchg.org> btw. I do not see this patch posted anywhere yet it already comes with an ack. Have I just missed a previous version?
Hello, Michal. On Thu, Jan 24, 2019 at 09:22:52AM +0100, Michal Hocko wrote: > I do not think we can do that for two reasons. It breaks the existing > semantic userspace might depend on and more importantly this is not a > correct behavior IMO. This is a valid concern but I'll come back to this later. > You have to realize that stats are hierarchical because that is how we > account. Events represent a way to inform that something has happened at > the specific level of the tree though. If you do not setup low/high/max This isn't true. e.g. cgroup.events's populated event is hierarchical. Everything in cgroup should be hierarchical by default. > limit then you simply cannot expect to be informed those get hit because > they cannot by definition. Or put it other way, if you are waiting for > those events you really want to know the (sub)tree they happened and if > you propagate the event up the hierarchy you have hard time to tell that > (you would basically have to exclude all but the lowest one and that is > an awkward semantic at best. I don't think it's a good idea to argue this for each piece of information. Again, everything should be hierarchical unless there are clear and strong reasons against; otherwise, we end up with random mix of hierarchical and flat behaviors, something that we want to avoid the most - remember .use_hierarchy?. > Maybe we want to document this better but I do not see we are going to > change the behavior. I beg you to reconsider. This was a clear oversight and the cgroup2 usage is still relatively limited. We sure can add local-specific counters if needed but must not mix local and hierarchical counters without a clear way to tell what's what. Thanks.
On Thu 24-01-19 07:21:22, Tejun Heo wrote: > Hello, Michal. > > On Thu, Jan 24, 2019 at 09:22:52AM +0100, Michal Hocko wrote: > > I do not think we can do that for two reasons. It breaks the existing > > semantic userspace might depend on and more importantly this is not a > > correct behavior IMO. > > This is a valid concern but I'll come back to this later. > > > You have to realize that stats are hierarchical because that is how we > > account. Events represent a way to inform that something has happened at > > the specific level of the tree though. If you do not setup low/high/max > > This isn't true. e.g. cgroup.events's populated event is > hierarchical. Everything in cgroup should be hierarchical by default. > > > limit then you simply cannot expect to be informed those get hit because > > they cannot by definition. Or put it other way, if you are waiting for > > those events you really want to know the (sub)tree they happened and if > > you propagate the event up the hierarchy you have hard time to tell that > > (you would basically have to exclude all but the lowest one and that is > > an awkward semantic at best. > > I don't think it's a good idea to argue this for each piece of > information. Again, everything should be hierarchical unless there > are clear and strong reasons against; And I would argue that cgroups.events with its populated event is a different thing because that is inherently a hierarchical property. If you compare that to low, min, max and oom those are events very specific to the particular level of the hierarchy because it is an action at that level that the event informs about. E.g. max event down the hierarchy is a completely different thing from the upper level. That sounds like a pretty solid reason to differentiate here. > otherwise, we end up with random > mix of hierarchical and flat behaviors, something that we want to > avoid the most - remember .use_hierarchy?. > > > Maybe we want to document this better but I do not see we are going to > > change the behavior. > > I beg you to reconsider. This was a clear oversight and the cgroup2 > usage is still relatively limited. We sure can add local-specific > counters if needed but must not mix local and hierarchical counters > without a clear way to tell what's what. If you really have a usecase for hierarchical events then I think the only way without breaking existing userspace is to add a new set of events. But I still believe that the current behavior makes a lot of sense.
On Thu, Jan 24, 2019 at 09:22:52AM +0100, Michal Hocko wrote: > On Wed 23-01-19 17:31:44, Chris Down wrote: > > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > > btw. I do not see this patch posted anywhere yet it already comes with > an ack. Have I just missed a previous version? I reviewed it offline before Chris sent it out. I agree with the sentiment that the non-hierarchical behavior was an oversight, not a design decision. The arguments against the change don't convince me: the added difficulty of finding out local values is true for all other cgroup files as well. This is traded off with being able to detect any subtree state from the first level cgroups and drill down on-demand, without having to scan the entire tree on each monitoring interval. That's a trade-off we've made everywhere else, so this is simply an inconsistency, not a legitimate exception to the rule. We cannot fully eliminate a risk for regression, but it strikes me as highly unlikely, given the extremely young age of cgroup2-based system management and surrounding tooling. Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On Thu 24-01-19 11:00:10, Johannes Weiner wrote: [...] > We cannot fully eliminate a risk for regression, but it strikes me as > highly unlikely, given the extremely young age of cgroup2-based system > management and surrounding tooling. I am not really sure what you consider young but this interface is 4.0+ IIRC and the cgroup v2 is considered stable since 4.5 unless I missrememeber and that is not a short time period in my book. Changing interfaces now represents a non-trivial risk and so far I haven't heard any actual usecase where the current semantic is actually wrong. Inconsistency on its own is not a sufficient justification IMO.
On Thu, Jan 24, 2019 at 06:01:17PM +0100, Michal Hocko wrote: > On Thu 24-01-19 11:00:10, Johannes Weiner wrote: > [...] > > We cannot fully eliminate a risk for regression, but it strikes me as > > highly unlikely, given the extremely young age of cgroup2-based system > > management and surrounding tooling. > > I am not really sure what you consider young but this interface is 4.0+ > IIRC and the cgroup v2 is considered stable since 4.5 unless I > missrememeber and that is not a short time period in my book. If you read my sentence again, I'm not talking about the kernel but the surrounding infrastructure that consumes this data. The risk is not dependent on the age of the interface age, but on its adoption. > Changing interfaces now represents a non-trivial risk and so far I > haven't heard any actual usecase where the current semantic is > actually wrong. Inconsistency on its own is not a sufficient > justification IMO. It can be seen either way, and in isolation it wouldn't be wrong to count events on the local level. But we made that decision for the entire interface, and this file is the odd one out now. From that comprehensive perspective, yes, the behavior is wrong. It really confuses people who are trying to use it, because they *do* expect it to behave recursively. I'm really having a hard time believing there are existing cgroup2 users with specific expectations for the non-recursive behavior...
On Thu 24-01-19 13:23:28, Johannes Weiner wrote: > On Thu, Jan 24, 2019 at 06:01:17PM +0100, Michal Hocko wrote: > > On Thu 24-01-19 11:00:10, Johannes Weiner wrote: > > [...] > > > We cannot fully eliminate a risk for regression, but it strikes me as > > > highly unlikely, given the extremely young age of cgroup2-based system > > > management and surrounding tooling. > > > > I am not really sure what you consider young but this interface is 4.0+ > > IIRC and the cgroup v2 is considered stable since 4.5 unless I > > missrememeber and that is not a short time period in my book. > > If you read my sentence again, I'm not talking about the kernel but > the surrounding infrastructure that consumes this data. The risk is > not dependent on the age of the interface age, but on its adoption. You really have to assume the user visible interface is consumed shortly after it is exposed/considered stable in this case as cgroups v2 was explicitly called unstable for a considerable period of time. This is a general policy regarding user APIs in the kernel. I can see arguments a next release after introduction or in similar cases but this is 3 years ago. We already have distribution kernels based on 4.12 kernel and it is old comparing to 5.0. > > Changing interfaces now represents a non-trivial risk and so far I > > haven't heard any actual usecase where the current semantic is > > actually wrong. Inconsistency on its own is not a sufficient > > justification IMO. > > It can be seen either way, and in isolation it wouldn't be wrong to > count events on the local level. But we made that decision for the > entire interface, and this file is the odd one out now. From that > comprehensive perspective, yes, the behavior is wrong. I do see your point about consistency. But it is also important to consider the usability of this interface. As already mentioned, catching an oom event at a level where the oom doesn't happen and having hard time to identify that place without races is a not a straightforward API to use. So it might be really the case that the api is actually usable for its purpose. > It really > confuses people who are trying to use it, because they *do* expect it > to behave recursively. Then we should improve the documentation. But seriously these are no strong reasons to change a long term semantic people might rely on. > I'm really having a hard time believing there are existing cgroup2 > users with specific expectations for the non-recursive behavior... I can certainly imagine monitoring tools to hook at levels where limits are set and report events as they happen. It would be more than confusing to receive events for reclaim/ooms that hasn't happened at that level just because a delegated memcg down the hierarchy has decided to set a more restrictive limits. Really this is a very unexpected behavior change for anybody using that interface right now on anything but leaf memcgs.
Hello, Michal. On Fri, Jan 25, 2019 at 09:42:13AM +0100, Michal Hocko wrote: > > If you read my sentence again, I'm not talking about the kernel but > > the surrounding infrastructure that consumes this data. The risk is > > not dependent on the age of the interface age, but on its adoption. > > You really have to assume the user visible interface is consumed shortly > after it is exposed/considered stable in this case as cgroups v2 was > explicitly called unstable for a considerable period of time. This is a > general policy regarding user APIs in the kernel. I can see arguments a > next release after introduction or in similar cases but this is 3 years > ago. We already have distribution kernels based on 4.12 kernel and it is > old comparing to 5.0. We do change userland-visible behaviors if the existing behavior is buggy / misleading / confusing. For example, we recently changed how discard bytes are accounted (no longer included in write bytes or ios) and even how mincore(2) behaves, both of which are far older than cgroup2. The main considerations are the blast radius and existing use cases in these decisions. Age does contribute to it but mostly because they affect how widely the behavior may be depended upon. > > > Changing interfaces now represents a non-trivial risk and so far I > > > haven't heard any actual usecase where the current semantic is > > > actually wrong. Inconsistency on its own is not a sufficient > > > justification IMO. > > > > It can be seen either way, and in isolation it wouldn't be wrong to > > count events on the local level. But we made that decision for the > > entire interface, and this file is the odd one out now. From that > > comprehensive perspective, yes, the behavior is wrong. > > I do see your point about consistency. But it is also important to > consider the usability of this interface. As already mentioned, catching > an oom event at a level where the oom doesn't happen and having hard > time to identify that place without races is a not a straightforward API > to use. So it might be really the case that the api is actually usable > for its purpose. What if a user wants to monitor any ooms in the subtree tho, which is a valid use case? If local event monitoring is useful and it can be, let's add separate events which are clearly identifiable to be local. Right now, it's confusing like hell. > > It really > > confuses people who are trying to use it, because they *do* expect it > > to behave recursively. > > Then we should improve the documentation. But seriously these are no > strong reasons to change a long term semantic people might rely on. This is broken interface. We're mixing local and hierarchical numbers willy nilly without obvious way of telling them apart. > > I'm really having a hard time believing there are existing cgroup2 > > users with specific expectations for the non-recursive behavior... > > I can certainly imagine monitoring tools to hook at levels where limits > are set and report events as they happen. It would be more than > confusing to receive events for reclaim/ooms that hasn't happened at > that level just because a delegated memcg down the hierarchy has decided > to set a more restrictive limits. Really this is a very unexpected > behavior change for anybody using that interface right now on anything > but leaf memcgs. Sure, there's some probability this change may cause some disruptions although I'm pretty skeptical given that inner node event monitoring is mostly useless right now. However, there's also a lot of on-going and future costs everyone is paying because the interface is so confusing. Thanks.
On Fri 25-01-19 08:51:52, Tejun Heo wrote: [...] > > I do see your point about consistency. But it is also important to > > consider the usability of this interface. As already mentioned, catching > > an oom event at a level where the oom doesn't happen and having hard > > time to identify that place without races is a not a straightforward API > > to use. So it might be really the case that the api is actually usable > > for its purpose. > > What if a user wants to monitor any ooms in the subtree tho, which is > a valid use case? How is that information useful without know which memcg the oom applies to? > If local event monitoring is useful and it can be, > let's add separate events which are clearly identifiable to be local. > Right now, it's confusing like hell. From a backward compatible POV it should be a new interface added. Please note that I understand that this might be confusing with the rest of the cgroup APIs but considering that this is the first time somebody is actually complaining and the interface is "production ready" for more than three years I am not really sure the situation is all that bad.
Hello, Michal. On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote: > > What if a user wants to monitor any ooms in the subtree tho, which is > > a valid use case? > > How is that information useful without know which memcg the oom applies > to? For example, a workload manager watching over a subtree for a job with nested memory limits set by the job itself. It wants to take action (reporting and possibly other remediative actions) when something goes wrong in the delegated subtree but isn't involved in how the subtree is configured inside. > > If local event monitoring is useful and it can be, > > let's add separate events which are clearly identifiable to be local. > > Right now, it's confusing like hell. > > From a backward compatible POV it should be a new interface added. That sure is an option for use cases like above but it has the downside of carrying over the confusing interface into the indefinite future. Again, I'd like to point back at how we changed the accounting write and trim accounting because the benefits outweighted the risks. > Please note that I understand that this might be confusing with the rest > of the cgroup APIs but considering that this is the first time somebody > is actually complaining and the interface is "production ready" for more > than three years I am not really sure the situation is all that bad. cgroup2 uptake hasn't progressed that fast. None of the major distros or container frameworks are currently shipping with it although many are evaluating switching. I don't think I'm too mistaken in that we (FB) are at the bleeding edge in terms of adopting cgroup2 and its various new features and are hitting these corner cases and oversights in the process. If there are noticeable breakages arising from this change, we sure can backpaddle but I think the better course of action is fixing them up while we can. Thanks.
On Fri 25-01-19 10:28:08, Tejun Heo wrote: > Hello, Michal. > > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote: > > > What if a user wants to monitor any ooms in the subtree tho, which is > > > a valid use case? > > > > How is that information useful without know which memcg the oom applies > > to? > > For example, a workload manager watching over a subtree for a job with > nested memory limits set by the job itself. It wants to take action > (reporting and possibly other remediative actions) when something goes > wrong in the delegated subtree but isn't involved in how the subtree > is configured inside. Yes, I understand this part, but it is not clear to me, _how_ to report anything sensible without knowing _what_ has caused the event. You can walk the cgroup hierarchy and compare cached results with new ones but this is a) racy and b) clumsy. > > > If local event monitoring is useful and it can be, > > > let's add separate events which are clearly identifiable to be local. > > > Right now, it's confusing like hell. > > > > From a backward compatible POV it should be a new interface added. > > That sure is an option for use cases like above but it has the > downside of carrying over the confusing interface into the indefinite > future. I actually believe that this is not such a big deal. For one thing the current events are actually helpful to watch the reclaim/setup behavior. > Again, I'd like to point back at how we changed the > accounting write and trim accounting because the benefits outweighted > the risks. > > > Please note that I understand that this might be confusing with the rest > > of the cgroup APIs but considering that this is the first time somebody > > is actually complaining and the interface is "production ready" for more > > than three years I am not really sure the situation is all that bad. > > cgroup2 uptake hasn't progressed that fast. None of the major distros > or container frameworks are currently shipping with it although many > are evaluating switching. I don't think I'm too mistaken in that we > (FB) are at the bleeding edge in terms of adopting cgroup2 and its > various new features and are hitting these corner cases and oversights > in the process. If there are noticeable breakages arising from this > change, we sure can backpaddle but I think the better course of action > is fixing them up while we can. I do not really think you can go back. You cannot simply change semantic back and forth because you just break new users. Really, I do not see the semantic changing after more than 3 years of production ready interface. If you really believe we need a hierarchical notification mechanism for the reclaim activity then add a new one.
Hello, Michal. On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote: > > For example, a workload manager watching over a subtree for a job with > > nested memory limits set by the job itself. It wants to take action > > (reporting and possibly other remediative actions) when something goes > > wrong in the delegated subtree but isn't involved in how the subtree > > is configured inside. > > Yes, I understand this part, but it is not clear to me, _how_ to report > anything sensible without knowing _what_ has caused the event. You can > walk the cgroup hierarchy and compare cached results with new ones but > this is a) racy and b) clumsy. All .events files generate aggregated stateful notifications. For anyone to do anything, they'd have to remember the previous state to identify what actually happened. Being hierarchical, it'd of course need to walk down when an event triggers. > > That sure is an option for use cases like above but it has the > > downside of carrying over the confusing interface into the indefinite > > future. > > I actually believe that this is not such a big deal. For one thing the > current events are actually helpful to watch the reclaim/setup behavior. Sure, it isn't something critical. It's just confusing and I think it'd be better to improve. > I do not really think you can go back. You cannot simply change semantic > back and forth because you just break new users. > > Really, I do not see the semantic changing after more than 3 years of > production ready interface. If you really believe we need a hierarchical > notification mechanism for the reclaim activity then add a new one. I don't see it as black and white as you do. Let's agree to disagree. I'll ack the patch and note the disagreement. Thanks.
On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote: > memory.stat and other files already consider subtrees in their output, > and we should too in order to not present an inconsistent interface. > > The current situation is fairly confusing, because people interacting > with cgroups expect hierarchical behaviour in the vein of memory.stat, > cgroup.events, and other files. For example, this causes confusion when > debugging reclaim events under low, as currently these always read "0" > at non-leaf memcg nodes, which frequently causes people to misdiagnose > breach behaviour. The same confusion applies to other counters in this > file when debugging issues. > > Aggregation is done at write time instead of at read-time since these > counters aren't hot (unlike memory.stat which is per-page, so it does it > at read time), and it makes sense to bundle this with the file > notifications. ... > Signed-off-by: Chris Down <chris@chrisdown.name> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Tejun Heo <tj@kernel.org> Michal has a valid counterpoint that this is a change in userland visible behavior but to me this patch seems to be more of a bug fix than anything else in that it's addressing an obvious inconsistency in the interface. Thanks.
On Mon 28-01-19 06:28:16, Tejun Heo wrote: > Hello, Michal. > > On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote: > > > For example, a workload manager watching over a subtree for a job with > > > nested memory limits set by the job itself. It wants to take action > > > (reporting and possibly other remediative actions) when something goes > > > wrong in the delegated subtree but isn't involved in how the subtree > > > is configured inside. > > > > Yes, I understand this part, but it is not clear to me, _how_ to report > > anything sensible without knowing _what_ has caused the event. You can > > walk the cgroup hierarchy and compare cached results with new ones but > > this is a) racy and b) clumsy. > > All .events files generate aggregated stateful notifications. For > anyone to do anything, they'd have to remember the previous state to > identify what actually happened. Being hierarchical, it'd of course > need to walk down when an event triggers. And how do you do that in a raceless fashion? > > > That sure is an option for use cases like above but it has the > > > downside of carrying over the confusing interface into the indefinite > > > future. > > > > I actually believe that this is not such a big deal. For one thing the > > current events are actually helpful to watch the reclaim/setup behavior. > > Sure, it isn't something critical. It's just confusing and I think > it'd be better to improve. > > > I do not really think you can go back. You cannot simply change semantic > > back and forth because you just break new users. > > > > Really, I do not see the semantic changing after more than 3 years of > > production ready interface. If you really believe we need a hierarchical > > notification mechanism for the reclaim activity then add a new one. > > I don't see it as black and white as you do. Let's agree to disagree. > I'll ack the patch and note the disagreement. Considering the justification behhind this change I really do not see other option than nack this change. There is simply no _strong_ reason to change the behavior. Even if the current behavior is confusing, the documentation can be improved to be more specific. If there is a strong demand for hierarchical reporting then add a new interface. But I have to say that I would consider such a reporting clumsy at best.
Hello, On Mon, Jan 28, 2019 at 03:52:10PM +0100, Michal Hocko wrote: > > All .events files generate aggregated stateful notifications. For > > anyone to do anything, they'd have to remember the previous state to > > identify what actually happened. Being hierarchical, it'd of course > > need to walk down when an event triggers. > > And how do you do that in a raceless fashion? Hmm... I'm having trouble imagining why this would be a problem. How would it race? Thanks.
On Mon 28-01-19 06:54:07, Tejun Heo wrote: > Hello, > > On Mon, Jan 28, 2019 at 03:52:10PM +0100, Michal Hocko wrote: > > > All .events files generate aggregated stateful notifications. For > > > anyone to do anything, they'd have to remember the previous state to > > > identify what actually happened. Being hierarchical, it'd of course > > > need to walk down when an event triggers. > > > > And how do you do that in a raceless fashion? > > Hmm... I'm having trouble imagining why this would be a problem. How > would it race? How do you make an atomic snapshot of the hierarchy state? Or you do not need it because event counters are monotonic and you are willing to sacrifice some lost or misinterpreted events? For example, you receive an oom event while the two children increase the oom event counter. How do you tell which one was the source of the event and which one is still pending? Or is the ordering unimportant in general? I can imagine you can live with this model, but having a hierarchical reporting without a source of the event just sounds too clumsy from my POV. But I guess this is getting tangent to the original patch.
Hello, Michal. On Mon, Jan 28, 2019 at 04:18:59PM +0100, Michal Hocko wrote: > How do you make an atomic snapshot of the hierarchy state? Or you do > not need it because event counters are monotonic and you are willing to > sacrifice some lost or misinterpreted events? For example, you receive > an oom event while the two children increase the oom event counter. How > do you tell which one was the source of the event and which one is still > pending? Or is the ordering unimportant in general? Hmm... This is straightforward stateful notification. Imagine the following hierarchy. The numbers are the notification counters. A:0 / \ B:0 C:0 Let's say B generates an event, soon followed by C. If A's counter is read after both B and C's events, nothing is missed. Let's say it ends up generating two notifications and we end up walking down inbetween B and C's events. It would look like the following. A:1 / \ B:1 C:0 We first see A's 0 -> 1 and then start scanning the subtrees to find out the origin. We will notice B but let's say we visit C before C's event gets registered (otherwise, nothing is missed). But, no matter where you put C's event and notification, the followings hold. 1. A's count will be different from what was seen before. 2. There will be another notification queued on A. IOW, it's guaranteed that we'll notice and re-scan if we don't see C's event this time. The worst that can happen is scanning spuriously but that's true even for local events. This isn't a novel thing. It's how aggregated stateful notifications usually work (e.g. a lot of hardware interrupts behave this way). The notification is just saying "something might have changed here, please take a look" and the interlocking is achieved by following specific orders when propagating and reading the events. > I can imagine you can live with this model, but having a hierarchical > reporting without a source of the event just sounds too clumsy from my > POV. But I guess this is getting tangent to the original patch. It seems like your opinion is mostly based on misunderstanding. Let's keep the discussion focused on API stability. Thanks.
Hi Tejun, On Fri, Jan 25, 2019 at 10:28 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, Michal. > > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote: > > > What if a user wants to monitor any ooms in the subtree tho, which is > > > a valid use case? > > > > How is that information useful without know which memcg the oom applies > > to? > > For example, a workload manager watching over a subtree for a job with > nested memory limits set by the job itself. It wants to take action > (reporting and possibly other remediative actions) when something goes > wrong in the delegated subtree but isn't involved in how the subtree > is configured inside. > Why not make this configurable at the delegation boundary? As you mentioned, there are jobs who want centralized workload manager to watch over their subtrees while there can be jobs which want to monitor their subtree themselves. For example I can have a job which know how to act when one of the children cgroup goes OOM. However if the root of that job goes OOM then the centralized workload manager should do something about it. With this change, how to implement this scenario? How will the central manager differentiates between that a subtree of a job goes OOM or the root of that job? I guess from the discussion it seems like the centralized manager has to traverse that job's subtree to find the source of OOM. Why can't we let the implementation of centralized manager easier by allowing to configure the propagation of these notifications across delegation boundary. thanks, Shakeel
Hello, Shakeel. On Mon, Jan 28, 2019 at 07:59:33AM -0800, Shakeel Butt wrote: > Why not make this configurable at the delegation boundary? As you > mentioned, there are jobs who want centralized workload manager to > watch over their subtrees while there can be jobs which want to > monitor their subtree themselves. For example I can have a job which > know how to act when one of the children cgroup goes OOM. However if > the root of that job goes OOM then the centralized workload manager > should do something about it. With this change, how to implement this > scenario? How will the central manager differentiates between that a > subtree of a job goes OOM or the root of that job? I guess from the > discussion it seems like the centralized manager has to traverse that > job's subtree to find the source of OOM. > > Why can't we let the implementation of centralized manager easier by > allowing to configure the propagation of these notifications across > delegation boundary. I think the right way to achieve the above would be having separate recursive and local counters. Thanks.
Hi Tejun, On Mon, Jan 28, 2019 at 8:05 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, Shakeel. > > On Mon, Jan 28, 2019 at 07:59:33AM -0800, Shakeel Butt wrote: > > Why not make this configurable at the delegation boundary? As you > > mentioned, there are jobs who want centralized workload manager to > > watch over their subtrees while there can be jobs which want to > > monitor their subtree themselves. For example I can have a job which > > know how to act when one of the children cgroup goes OOM. However if > > the root of that job goes OOM then the centralized workload manager > > should do something about it. With this change, how to implement this > > scenario? How will the central manager differentiates between that a > > subtree of a job goes OOM or the root of that job? I guess from the > > discussion it seems like the centralized manager has to traverse that > > job's subtree to find the source of OOM. > > > > Why can't we let the implementation of centralized manager easier by > > allowing to configure the propagation of these notifications across > > delegation boundary. > > I think the right way to achieve the above would be having separate > recursive and local counters. > Do you envision a separate interface/file for recursive and local counters? That would make notifications simpler but that is an additional interface. Shakeel
Hello, On Mon, Jan 28, 2019 at 08:08:26AM -0800, Shakeel Butt wrote: > Do you envision a separate interface/file for recursive and local > counters? That would make notifications simpler but that is an > additional interface. I need to think more about it but my first throught is that a separate file would make more sense given that separating notifications could be useful. Thanks.
On Mon 28-01-19 07:41:50, Tejun Heo wrote: > Hello, Michal. > > On Mon, Jan 28, 2019 at 04:18:59PM +0100, Michal Hocko wrote: > > How do you make an atomic snapshot of the hierarchy state? Or you do > > not need it because event counters are monotonic and you are willing to > > sacrifice some lost or misinterpreted events? For example, you receive > > an oom event while the two children increase the oom event counter. How > > do you tell which one was the source of the event and which one is still > > pending? Or is the ordering unimportant in general? > > Hmm... This is straightforward stateful notification. Imagine the > following hierarchy. The numbers are the notification counters. > > A:0 > / \ > B:0 C:0 > > Let's say B generates an event, soon followed by C. If A's counter is > read after both B and C's events, nothing is missed. > > Let's say it ends up generating two notifications and we end up > walking down inbetween B and C's events. It would look like the > following. > > A:1 > / \ > B:1 C:0 > > We first see A's 0 -> 1 and then start scanning the subtrees to find > out the origin. We will notice B but let's say we visit C before C's > event gets registered (otherwise, nothing is missed). Yeah, that is quite clear. But it also assumes that the hierarchy is pretty stable but cgroups might go away at any time. I am not saying that the aggregated events are not useful I am just saying that it is quite non-trivial to use and catch all potential corner cases. Maybe I am overcomplicating it but one thing is quite clear to me. The existing semantic is really useful to watch for the reclaim behavior at the current level of the tree. You really do not have to care what is happening in the subtree when it is clear that the workload itself is underprovisioned etc. Considering that such a semantic already existis, somebody might depend on it and we likely want also aggregated semantic then I really do not see why to risk regressions rather than add a new memory.hierarchy_events and have both.
Hello, Michal. On Mon, Jan 28, 2019 at 06:05:26PM +0100, Michal Hocko wrote: > Yeah, that is quite clear. But it also assumes that the hierarchy is > pretty stable but cgroups might go away at any time. I am not saying > that the aggregated events are not useful I am just saying that it is > quite non-trivial to use and catch all potential corner cases. Maybe I It really isn't complicated and doesn't require stable subtree. > am overcomplicating it but one thing is quite clear to me. The existing > semantic is really useful to watch for the reclaim behavior at the > current level of the tree. You really do not have to care what is > happening in the subtree when it is clear that the workload itself > is underprovisioned etc. Considering that such a semantic already > existis, somebody might depend on it and we likely want also aggregated > semantic then I really do not see why to risk regressions rather than > add a new memory.hierarchy_events and have both. The problem then is that most other things are hierarchical including some fields in .events files, so if we try to add local stats and events, there's no good way to add them. Thanks.
On Mon 28-01-19 09:49:05, Tejun Heo wrote: > Hello, Michal. > > On Mon, Jan 28, 2019 at 06:05:26PM +0100, Michal Hocko wrote: > > Yeah, that is quite clear. But it also assumes that the hierarchy is > > pretty stable but cgroups might go away at any time. I am not saying > > that the aggregated events are not useful I am just saying that it is > > quite non-trivial to use and catch all potential corner cases. Maybe I > > It really isn't complicated and doesn't require stable subtree. > > > am overcomplicating it but one thing is quite clear to me. The existing > > semantic is really useful to watch for the reclaim behavior at the > > current level of the tree. You really do not have to care what is > > happening in the subtree when it is clear that the workload itself > > is underprovisioned etc. Considering that such a semantic already > > existis, somebody might depend on it and we likely want also aggregated > > semantic then I really do not see why to risk regressions rather than > > add a new memory.hierarchy_events and have both. > > The problem then is that most other things are hierarchical including > some fields in .events files, so if we try to add local stats and > events, there's no good way to add them. All memcg events are represented non-hierarchical AFAICS memcg_memory_event() simply accounts at the level when it happens. Or do I miss something? Or are you talking about .events files for other controllers?
Hello, On Tue, Jan 29, 2019 at 03:43:06PM +0100, Michal Hocko wrote: > All memcg events are represented non-hierarchical AFAICS > memcg_memory_event() simply accounts at the level when it happens. Or do > I miss something? Or are you talking about .events files for other > controllers? Yeah, cgroup.events and .stat files as some of the local stats would be useful too, so if we don't flip memory.events we'll end up with sth like cgroup.events.local, memory.events.tree and memory.stats.local, which is gonna be hilarious. If you aren't willing to change your mind, the only option seems to be introducing a mount option to gate the flip and additions of local files. Most likely, userspace will enable the option by default everywhere, so the end result will be exactly the same but I guess it'll better address your concern. Thanks.
On Tue 29-01-19 06:52:40, Tejun Heo wrote: > Hello, > > On Tue, Jan 29, 2019 at 03:43:06PM +0100, Michal Hocko wrote: > > All memcg events are represented non-hierarchical AFAICS > > memcg_memory_event() simply accounts at the level when it happens. Or do > > I miss something? Or are you talking about .events files for other > > controllers? > > Yeah, cgroup.events and .stat files as some of the local stats would > be useful too, so if we don't flip memory.events we'll end up with sth > like cgroup.events.local, memory.events.tree and memory.stats.local, > which is gonna be hilarious. Why cannot we simply have memory.events_tree and be done with it? Sure the file names are not goin to be consistent which is a minus but that ship has already sailed some time ago. > If you aren't willing to change your mind, the only option seems to be > introducing a mount option to gate the flip and additions of local > files. Most likely, userspace will enable the option by default > everywhere, so the end result will be exactly the same but I guess > it'll better address your concern. How does the consumer of the API learns about the mount type?
Hello, Michal. On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote: > > Yeah, cgroup.events and .stat files as some of the local stats would > > be useful too, so if we don't flip memory.events we'll end up with sth > > like cgroup.events.local, memory.events.tree and memory.stats.local, > > which is gonna be hilarious. > > Why cannot we simply have memory.events_tree and be done with it? Sure > the file names are not goin to be consistent which is a minus but that > ship has already sailed some time ago. Because the overall cost of shitty interface will be way higher in the longer term. cgroup2 interface is far from perfect but is way better than cgroup1 especially for the memory controller. Why do you think that is? > > If you aren't willing to change your mind, the only option seems to be > > introducing a mount option to gate the flip and additions of local > > files. Most likely, userspace will enable the option by default > > everywhere, so the end result will be exactly the same but I guess > > it'll better address your concern. > > How does the consumer of the API learns about the mount type? It's gonna be a mount flag exposed in /sys/kernel/cgroup/features. Thanks.
On Wed 30-01-19 09:06:58, Tejun Heo wrote: > Hello, Michal. > > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote: > > > Yeah, cgroup.events and .stat files as some of the local stats would > > > be useful too, so if we don't flip memory.events we'll end up with sth > > > like cgroup.events.local, memory.events.tree and memory.stats.local, > > > which is gonna be hilarious. > > > > Why cannot we simply have memory.events_tree and be done with it? Sure > > the file names are not goin to be consistent which is a minus but that > > ship has already sailed some time ago. > > Because the overall cost of shitty interface will be way higher in the > longer term. But we are discussing the file name effectively. I do not see a long term maintenance burden. Confusing? Probably yes but that is were the documentation would be helpful.
On Wed, Jan 30, 2019 at 06:41:17PM +0100, Michal Hocko wrote: > But we are discussing the file name effectively. I do not see a long > term maintenance burden. Confusing? Probably yes but that is were the Cost on user side. > documentation would be helpful. which is an a lot worse option with way higher total cost. If you aren't against making it a mount option (and you shouldn't), I think that'd be the best use of time for both of us. Thanks.
On Wed 30-01-19 09:52:22, Tejun Heo wrote: > On Wed, Jan 30, 2019 at 06:41:17PM +0100, Michal Hocko wrote: > > But we are discussing the file name effectively. I do not see a long > > term maintenance burden. Confusing? Probably yes but that is were the > > Cost on user side. > > > documentation would be helpful. > > which is an a lot worse option with way higher total cost. And how exactly does the cost get any smaller with a mount option. The consumer of the API will likely not know there are two modes and check for it to figure out how to interpret those values. Or maybe I still do not follow how exactly is the mount option supposed to work.
Hi Tejun, On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, Michal. > > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote: > > > Yeah, cgroup.events and .stat files as some of the local stats would > > > be useful too, so if we don't flip memory.events we'll end up with sth > > > like cgroup.events.local, memory.events.tree and memory.stats.local, > > > which is gonna be hilarious. > > > > Why cannot we simply have memory.events_tree and be done with it? Sure > > the file names are not goin to be consistent which is a minus but that > > ship has already sailed some time ago. > > Because the overall cost of shitty interface will be way higher in the > longer term. cgroup2 interface is far from perfect but is way better > than cgroup1 especially for the memory controller. Why do you think > that is? > I thought you are fine with the separate interface for the hierarchical events. https://lkml.kernel.org/r/20190128161201.GS50184@devbig004.ftw2.facebook.com Is that not the case? Shakeel
On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote: > On Fri 25-01-19 10:28:08, Tejun Heo wrote: > > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote: > > > Please note that I understand that this might be confusing with the rest > > > of the cgroup APIs but considering that this is the first time somebody > > > is actually complaining and the interface is "production ready" for more > > > than three years I am not really sure the situation is all that bad. > > > > cgroup2 uptake hasn't progressed that fast. None of the major distros > > or container frameworks are currently shipping with it although many > > are evaluating switching. I don't think I'm too mistaken in that we > > (FB) are at the bleeding edge in terms of adopting cgroup2 and its > > various new features and are hitting these corner cases and oversights > > in the process. If there are noticeable breakages arising from this > > change, we sure can backpaddle but I think the better course of action > > is fixing them up while we can. > > I do not really think you can go back. You cannot simply change semantic > back and forth because you just break new users. > > Really, I do not see the semantic changing after more than 3 years of > production ready interface. If you really believe we need a hierarchical > notification mechanism for the reclaim activity then add a new one. This discussion needs to be more nuanced. We change interfaces and user-visible behavior all the time when we think nobody is likely to rely on it. Sometimes we change them after decades of established behavior - for example the recent OOM killer change to not kill children over parents. The argument was made that it's very unlikely that we break any existing user setups relying specifically on this behavior we are trying to fix. I don't see a real dispute to this, other than a repetition of "we can't change it after three years". I also don't see a concrete description of a plausible scenario that this change might break. I would like to see a solid case for why this change is a notable risk to actual users (interface age is not a criterium for other changes) before discussing errata solutions.
On Wed, Jan 30, 2019 at 11:11:44AM -0800, Shakeel Butt wrote: > Hi Tejun, > > On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote: > > > > Hello, Michal. > > > > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote: > > > > Yeah, cgroup.events and .stat files as some of the local stats would > > > > be useful too, so if we don't flip memory.events we'll end up with sth > > > > like cgroup.events.local, memory.events.tree and memory.stats.local, > > > > which is gonna be hilarious. > > > > > > Why cannot we simply have memory.events_tree and be done with it? Sure > > > the file names are not goin to be consistent which is a minus but that > > > ship has already sailed some time ago. > > > > Because the overall cost of shitty interface will be way higher in the > > longer term. cgroup2 interface is far from perfect but is way better > > than cgroup1 especially for the memory controller. Why do you think > > that is? > > > > I thought you are fine with the separate interface for the hierarchical events. Every other file in cgroup2 is hierarchical, but for recursive memory.events you'd need to read memory.events_tree? Do we hate our users that much? :(
On Wed, Jan 30, 2019 at 02:27:12PM -0500, Johannes Weiner wrote: > On Wed, Jan 30, 2019 at 11:11:44AM -0800, Shakeel Butt wrote: > > Hi Tejun, > > > > On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > Hello, Michal. > > > > > > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote: > > > > > Yeah, cgroup.events and .stat files as some of the local stats would > > > > > be useful too, so if we don't flip memory.events we'll end up with sth > > > > > like cgroup.events.local, memory.events.tree and memory.stats.local, > > > > > which is gonna be hilarious. > > > > > > > > Why cannot we simply have memory.events_tree and be done with it? Sure > > > > the file names are not goin to be consistent which is a minus but that > > > > ship has already sailed some time ago. > > > > > > Because the overall cost of shitty interface will be way higher in the > > > longer term. cgroup2 interface is far from perfect but is way better > > > than cgroup1 especially for the memory controller. Why do you think > > > that is? > > > > > > > I thought you are fine with the separate interface for the hierarchical events. > > Every other file in cgroup2 is hierarchical, but for recursive > memory.events you'd need to read memory.events_tree? > > Do we hate our users that much? :( FTR, I would be okay with adding .local versions to existing files where such a behavior could be useful. But that seems to be a separate discussion from fixing memory.events here.
On Wed, Jan 30, 2019 at 11:30 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jan 30, 2019 at 02:27:12PM -0500, Johannes Weiner wrote: > > On Wed, Jan 30, 2019 at 11:11:44AM -0800, Shakeel Butt wrote: > > > Hi Tejun, > > > > > > On Wed, Jan 30, 2019 at 9:07 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > > > Hello, Michal. > > > > > > > > On Wed, Jan 30, 2019 at 05:50:58PM +0100, Michal Hocko wrote: > > > > > > Yeah, cgroup.events and .stat files as some of the local stats would > > > > > > be useful too, so if we don't flip memory.events we'll end up with sth > > > > > > like cgroup.events.local, memory.events.tree and memory.stats.local, > > > > > > which is gonna be hilarious. > > > > > > > > > > Why cannot we simply have memory.events_tree and be done with it? Sure > > > > > the file names are not goin to be consistent which is a minus but that > > > > > ship has already sailed some time ago. > > > > > > > > Because the overall cost of shitty interface will be way higher in the > > > > longer term. cgroup2 interface is far from perfect but is way better > > > > than cgroup1 especially for the memory controller. Why do you think > > > > that is? > > > > > > > > > > I thought you are fine with the separate interface for the hierarchical events. > > > > Every other file in cgroup2 is hierarchical, but for recursive > > memory.events you'd need to read memory.events_tree? > > > > Do we hate our users that much? :( > > FTR, I would be okay with adding .local versions to existing files > where such a behavior could be useful. But that seems to be a separate > discussion from fixing memory.events here. Oh ok, the dispute is on the name of the interface. I am fine with whatever the decision is made as we (Google) are still not using these interfaces. However what's the way forward here? thanks, Shakeel
On Wed 30-01-19 14:23:45, Johannes Weiner wrote: > On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote: > > On Fri 25-01-19 10:28:08, Tejun Heo wrote: > > > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote: > > > > Please note that I understand that this might be confusing with the rest > > > > of the cgroup APIs but considering that this is the first time somebody > > > > is actually complaining and the interface is "production ready" for more > > > > than three years I am not really sure the situation is all that bad. > > > > > > cgroup2 uptake hasn't progressed that fast. None of the major distros > > > or container frameworks are currently shipping with it although many > > > are evaluating switching. I don't think I'm too mistaken in that we > > > (FB) are at the bleeding edge in terms of adopting cgroup2 and its > > > various new features and are hitting these corner cases and oversights > > > in the process. If there are noticeable breakages arising from this > > > change, we sure can backpaddle but I think the better course of action > > > is fixing them up while we can. > > > > I do not really think you can go back. You cannot simply change semantic > > back and forth because you just break new users. > > > > Really, I do not see the semantic changing after more than 3 years of > > production ready interface. If you really believe we need a hierarchical > > notification mechanism for the reclaim activity then add a new one. > > This discussion needs to be more nuanced. > > We change interfaces and user-visible behavior all the time when we > think nobody is likely to rely on it. Sometimes we change them after > decades of established behavior - for example the recent OOM killer > change to not kill children over parents. That is an implementation detail of a kernel internal functionality. Most of changes in the kernel tend to have user visible effects. This is not what we are discussing here. We are talking about a change of user visibile API semantic change. And that is a completely different story. > The argument was made that it's very unlikely that we break any > existing user setups relying specifically on this behavior we are > trying to fix. I don't see a real dispute to this, other than a > repetition of "we can't change it after three years". > > I also don't see a concrete description of a plausible scenario that > this change might break. > > I would like to see a solid case for why this change is a notable risk > to actual users (interface age is not a criterium for other changes) > before discussing errata solutions. I thought I have already mentioned an example. Say you have an observer on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard limit) on the root of it. If you get an OOM event then you know that the whole hierarchy might be underprovisioned and perform some rebalancing. Now you really do not care that somewhere down the delegated tree there was an oom. Such a spurious event would just confuse the monitoring and lead to wrong decisions.
On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote: > On Wed 30-01-19 14:23:45, Johannes Weiner wrote: > > On Mon, Jan 28, 2019 at 01:51:51PM +0100, Michal Hocko wrote: > > > On Fri 25-01-19 10:28:08, Tejun Heo wrote: > > > > On Fri, Jan 25, 2019 at 06:37:13PM +0100, Michal Hocko wrote: > > > > > Please note that I understand that this might be confusing with the rest > > > > > of the cgroup APIs but considering that this is the first time somebody > > > > > is actually complaining and the interface is "production ready" for more > > > > > than three years I am not really sure the situation is all that bad. > > > > > > > > cgroup2 uptake hasn't progressed that fast. None of the major distros > > > > or container frameworks are currently shipping with it although many > > > > are evaluating switching. I don't think I'm too mistaken in that we > > > > (FB) are at the bleeding edge in terms of adopting cgroup2 and its > > > > various new features and are hitting these corner cases and oversights > > > > in the process. If there are noticeable breakages arising from this > > > > change, we sure can backpaddle but I think the better course of action > > > > is fixing them up while we can. > > > > > > I do not really think you can go back. You cannot simply change semantic > > > back and forth because you just break new users. > > > > > > Really, I do not see the semantic changing after more than 3 years of > > > production ready interface. If you really believe we need a hierarchical > > > notification mechanism for the reclaim activity then add a new one. > > > > This discussion needs to be more nuanced. > > > > We change interfaces and user-visible behavior all the time when we > > think nobody is likely to rely on it. Sometimes we change them after > > decades of established behavior - for example the recent OOM killer > > change to not kill children over parents. > > That is an implementation detail of a kernel internal functionality. > Most of changes in the kernel tend to have user visible effects. This is > not what we are discussing here. We are talking about a change of user > visibile API semantic change. And that is a completely different story. I think drawing such a strong line between these two is a mistake. The critical thing is whether we change something real people rely on. It's possible somebody relies on the child killing behavior. But it's fairly unlikely, which is why it's okay to risk the change. > > The argument was made that it's very unlikely that we break any > > existing user setups relying specifically on this behavior we are > > trying to fix. I don't see a real dispute to this, other than a > > repetition of "we can't change it after three years". > > > > I also don't see a concrete description of a plausible scenario that > > this change might break. > > > > I would like to see a solid case for why this change is a notable risk > > to actual users (interface age is not a criterium for other changes) > > before discussing errata solutions. > > I thought I have already mentioned an example. Say you have an observer > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard > limit) on the root of it. If you get an OOM event then you know that the > whole hierarchy might be underprovisioned and perform some rebalancing. > Now you really do not care that somewhere down the delegated tree there > was an oom. Such a spurious event would just confuse the monitoring and > lead to wrong decisions. You can construct a usecase like this, as per above with OOM, but it's incredibly unlikely for something like this to exist. There is plenty of evidence on adoption rate that supports this: we know where the big names in containerization are; we see the things we run into that have not been reported yet etc. Compare this to real problems this has already caused for us. Multi-level control and monitoring is a fundamental concept of the cgroup design, so naturally our infrastructure doesn't monitor and log at the individual job level (too much data, and also kind of pointless when the jobs are identical) but at aggregate parental levels. Because of this wart, we have missed problematic configurations when the low, high, max events were not propagated as expected (we log oom separately, so we still noticed those). Even once we knew about it, we had trouble tracking these configurations down for the same reason - the data isn't logged, and won't be logged, at this level. Adding a separate, hierarchical file would solve this one particular problem for us, but it wouldn't fix this pitfall for all future users of cgroup2 (which by all available evidence is still most of them) and would be a wart on the interface that we'd carry forever. Adding a note in cgroup-v2.txt doesn't make up for the fact that this behavior flies in the face of basic UX concepts that underly the hierarchical monitoring and control idea of the cgroup2fs. The fact that the current behavior MIGHT HAVE a valid application does not mean that THIS FILE should be providing it. It IS NOT an argument against this patch here, just an argument for a separate patch that adds this functionality in a way that is consistent with the rest of the interface (e.g. systematically adding .local files). The current semantics have real costs to real users. You cannot dismiss them or handwave them away with a hypothetical regression. I would really ask you to consider the real world usage and adoption data we have on cgroup2, rather than insist on a black and white answer to this situation.
On Wed 30-01-19 16:31:31, Johannes Weiner wrote: > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote: [...] > > I thought I have already mentioned an example. Say you have an observer > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard > > limit) on the root of it. If you get an OOM event then you know that the > > whole hierarchy might be underprovisioned and perform some rebalancing. > > Now you really do not care that somewhere down the delegated tree there > > was an oom. Such a spurious event would just confuse the monitoring and > > lead to wrong decisions. > > You can construct a usecase like this, as per above with OOM, but it's > incredibly unlikely for something like this to exist. There is plenty > of evidence on adoption rate that supports this: we know where the big > names in containerization are; we see the things we run into that have > not been reported yet etc. > > Compare this to real problems this has already caused for > us. Multi-level control and monitoring is a fundamental concept of the > cgroup design, so naturally our infrastructure doesn't monitor and log > at the individual job level (too much data, and also kind of pointless > when the jobs are identical) but at aggregate parental levels. > > Because of this wart, we have missed problematic configurations when > the low, high, max events were not propagated as expected (we log oom > separately, so we still noticed those). Even once we knew about it, we > had trouble tracking these configurations down for the same reason - > the data isn't logged, and won't be logged, at this level. Yes, I do understand that you might be interested in the hierarchical accounting. > Adding a separate, hierarchical file would solve this one particular > problem for us, but it wouldn't fix this pitfall for all future users > of cgroup2 (which by all available evidence is still most of them) and > would be a wart on the interface that we'd carry forever. I understand even this reasoning but if I have to chose between a risk of user breakage that would require to reimplement the monitoring or an API incosistency I vote for the first option. It is unfortunate but this is the way we deal with APIs and compatibility. > Adding a note in cgroup-v2.txt doesn't make up for the fact that this > behavior flies in the face of basic UX concepts that underly the > hierarchical monitoring and control idea of the cgroup2fs. > > The fact that the current behavior MIGHT HAVE a valid application does > not mean that THIS FILE should be providing it. It IS NOT an argument > against this patch here, just an argument for a separate patch that > adds this functionality in a way that is consistent with the rest of > the interface (e.g. systematically adding .local files). > > The current semantics have real costs to real users. You cannot > dismiss them or handwave them away with a hypothetical regression. > > I would really ask you to consider the real world usage and adoption > data we have on cgroup2, rather than insist on a black and white > answer to this situation. Those users requiring the hierarchical beahvior can use the new file without any risk of breakages so I really do not see why we should undertake the risk and do it the other way around.
On Thu, Jan 31, 2019 at 09:58:08AM +0100, Michal Hocko wrote: > On Wed 30-01-19 16:31:31, Johannes Weiner wrote: > > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote: > [...] > > > I thought I have already mentioned an example. Say you have an observer > > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard > > > limit) on the root of it. If you get an OOM event then you know that the > > > whole hierarchy might be underprovisioned and perform some rebalancing. > > > Now you really do not care that somewhere down the delegated tree there > > > was an oom. Such a spurious event would just confuse the monitoring and > > > lead to wrong decisions. > > > > You can construct a usecase like this, as per above with OOM, but it's > > incredibly unlikely for something like this to exist. There is plenty > > of evidence on adoption rate that supports this: we know where the big > > names in containerization are; we see the things we run into that have > > not been reported yet etc. > > > > Compare this to real problems this has already caused for > > us. Multi-level control and monitoring is a fundamental concept of the > > cgroup design, so naturally our infrastructure doesn't monitor and log > > at the individual job level (too much data, and also kind of pointless > > when the jobs are identical) but at aggregate parental levels. > > > > Because of this wart, we have missed problematic configurations when > > the low, high, max events were not propagated as expected (we log oom > > separately, so we still noticed those). Even once we knew about it, we > > had trouble tracking these configurations down for the same reason - > > the data isn't logged, and won't be logged, at this level. > > Yes, I do understand that you might be interested in the hierarchical > accounting. > > > Adding a separate, hierarchical file would solve this one particular > > problem for us, but it wouldn't fix this pitfall for all future users > > of cgroup2 (which by all available evidence is still most of them) and > > would be a wart on the interface that we'd carry forever. > > I understand even this reasoning but if I have to chose between a risk > of user breakage that would require to reimplement the monitoring or an > API incosistency I vote for the first option. It is unfortunate but this > is the way we deal with APIs and compatibility. I don't know why you keep repeating this, it's simply not how Linux API is maintained in practice. In cgroup2, we fixed io.stat to not conflate discard IO and write IO: 636620b66d5d4012c4a9c86206013964d3986c4f Linus changed the Vmalloc field semantics in /proc/meminfo after over a decade, without a knob to restore it in production: If this breaks anything, we'll obviously have to re-introduce the code to compute this all and add the caching patches on top. But if given the option, I'd really prefer to just remove this bad idea entirely rather than add even more code to work around our historical mistake that likely nobody really cares about. a5ad88ce8c7fae7ddc72ee49a11a75aa837788e0 Mel changed the zone_reclaim_mode default behavior after over a decade: Those that require zone_reclaim_mode are likely to be able to detect when it needs to be enabled and tune appropriately so lets have a sensible default for the bulk of users. 4f9b16a64753d0bb607454347036dc997fd03b82 Acked-by: Michal Hocko <mhocko@suse.cz> And then Mel changed the default zonelist ordering to pick saner behavior for most users, followed by a complete removal of the zone list ordering, after again, decades of existence of these things: commit c9bff3eebc09be23fbc868f5e6731666d23cbea3 Author: Michal Hocko <mhocko@suse.com> Date: Wed Sep 6 16:20:13 2017 -0700 mm, page_alloc: rip out ZONELIST_ORDER_ZONE And why did we do any of those things and risk user disruption every single time? Because the existing behavior was not a good default, a burden on people, and the risk of breakage was sufficiently low. I don't see how this case is different, and you haven't provided any arguments that would explain that. > > Adding a note in cgroup-v2.txt doesn't make up for the fact that this > > behavior flies in the face of basic UX concepts that underly the > > hierarchical monitoring and control idea of the cgroup2fs. > > > > The fact that the current behavior MIGHT HAVE a valid application does > > not mean that THIS FILE should be providing it. It IS NOT an argument > > against this patch here, just an argument for a separate patch that > > adds this functionality in a way that is consistent with the rest of > > the interface (e.g. systematically adding .local files). > > > > The current semantics have real costs to real users. You cannot > > dismiss them or handwave them away with a hypothetical regression. > > > > I would really ask you to consider the real world usage and adoption > > data we have on cgroup2, rather than insist on a black and white > > answer to this situation. > > Those users requiring the hierarchical beahvior can use the new file > without any risk of breakages so I really do not see why we should > undertake the risk and do it the other way around. Okay, so let's find a way forward here. 1. A new memory.events_tree file or similar. This would give us a way to get the desired hierarchical behavior. The downside is that it's suggesting that ${x} and ${x}_tree are the local and hierarchical versions of a cgroup file, and that's false everywhere else. Saying we would document it is a cop-out and doesn't actually make the interface less confusing (most people don't look at errata documentation until they've been burned by unexpected behavior). 2. A runtime switch (cgroup mount option, sysctl, what have you) that lets you switch between the local and the tree behavior. This would be able to provide the desired semantics in a clean interface, while still having the ability to support legacy users. 2a. A runtime switch that defaults to the local behavior. 2b. A runtime switch that defaults to the tree behavior. The choice between 2a and 2b comes down to how big we evaluate the risk that somebody has an existing dependency on the local behavior. Given what we know about cgroup2 usage, and considering our previous behavior in such matters, I'd say 2b is reasonable and in line with how we tend to handle these things. On the tiny chance that somebody is using the current behavior, they can flick the switch (until we add the .local files, or simply use the switch forever).
On Thu 31-01-19 11:22:48, Johannes Weiner wrote: > On Thu, Jan 31, 2019 at 09:58:08AM +0100, Michal Hocko wrote: > > On Wed 30-01-19 16:31:31, Johannes Weiner wrote: > > > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote: > > [...] > > > > I thought I have already mentioned an example. Say you have an observer > > > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard > > > > limit) on the root of it. If you get an OOM event then you know that the > > > > whole hierarchy might be underprovisioned and perform some rebalancing. > > > > Now you really do not care that somewhere down the delegated tree there > > > > was an oom. Such a spurious event would just confuse the monitoring and > > > > lead to wrong decisions. > > > > > > You can construct a usecase like this, as per above with OOM, but it's > > > incredibly unlikely for something like this to exist. There is plenty > > > of evidence on adoption rate that supports this: we know where the big > > > names in containerization are; we see the things we run into that have > > > not been reported yet etc. > > > > > > Compare this to real problems this has already caused for > > > us. Multi-level control and monitoring is a fundamental concept of the > > > cgroup design, so naturally our infrastructure doesn't monitor and log > > > at the individual job level (too much data, and also kind of pointless > > > when the jobs are identical) but at aggregate parental levels. > > > > > > Because of this wart, we have missed problematic configurations when > > > the low, high, max events were not propagated as expected (we log oom > > > separately, so we still noticed those). Even once we knew about it, we > > > had trouble tracking these configurations down for the same reason - > > > the data isn't logged, and won't be logged, at this level. > > > > Yes, I do understand that you might be interested in the hierarchical > > accounting. > > > > > Adding a separate, hierarchical file would solve this one particular > > > problem for us, but it wouldn't fix this pitfall for all future users > > > of cgroup2 (which by all available evidence is still most of them) and > > > would be a wart on the interface that we'd carry forever. > > > > I understand even this reasoning but if I have to chose between a risk > > of user breakage that would require to reimplement the monitoring or an > > API incosistency I vote for the first option. It is unfortunate but this > > is the way we deal with APIs and compatibility. > > I don't know why you keep repeating this, it's simply not how Linux > API is maintained in practice. > > In cgroup2, we fixed io.stat to not conflate discard IO and write IO: > 636620b66d5d4012c4a9c86206013964d3986c4f > > Linus changed the Vmalloc field semantics in /proc/meminfo after over > a decade, without a knob to restore it in production: > > If this breaks anything, we'll obviously have to re-introduce the code > to compute this all and add the caching patches on top. But if given > the option, I'd really prefer to just remove this bad idea entirely > rather than add even more code to work around our historical mistake > that likely nobody really cares about. > a5ad88ce8c7fae7ddc72ee49a11a75aa837788e0 > > Mel changed the zone_reclaim_mode default behavior after over a > decade: > > Those that require zone_reclaim_mode are likely to be able to > detect when it needs to be enabled and tune appropriately so lets > have a sensible default for the bulk of users. > 4f9b16a64753d0bb607454347036dc997fd03b82 > Acked-by: Michal Hocko <mhocko@suse.cz> > > And then Mel changed the default zonelist ordering to pick saner > behavior for most users, followed by a complete removal of the zone > list ordering, after again, decades of existence of these things: > > commit c9bff3eebc09be23fbc868f5e6731666d23cbea3 > Author: Michal Hocko <mhocko@suse.com> > Date: Wed Sep 6 16:20:13 2017 -0700 > > mm, page_alloc: rip out ZONELIST_ORDER_ZONE > > And why did we do any of those things and risk user disruption every > single time? Because the existing behavior was not a good default, a > burden on people, and the risk of breakage was sufficiently low. > > I don't see how this case is different, and you haven't provided any > arguments that would explain that. Because there is no simple way to revert in _this_ particular case. Once you change the semantic of the file you cannot simply make it non-hierarchical after somebody complains. You do not want to break both worlds. See the difference? [...] > > Those users requiring the hierarchical beahvior can use the new file > > without any risk of breakages so I really do not see why we should > > undertake the risk and do it the other way around. > > Okay, so let's find a way forward here. > > 1. A new memory.events_tree file or similar. This would give us a way > to get the desired hierarchical behavior. The downside is that it's > suggesting that ${x} and ${x}_tree are the local and hierarchical > versions of a cgroup file, and that's false everywhere else. Saying we > would document it is a cop-out and doesn't actually make the interface > less confusing (most people don't look at errata documentation until > they've been burned by unexpected behavior). > > 2. A runtime switch (cgroup mount option, sysctl, what have you) that > lets you switch between the local and the tree behavior. This would be > able to provide the desired semantics in a clean interface, while > still having the ability to support legacy users. With an obvious downside that one or the other usecase has to learn that the current semantic is different than expected which is again something that has to be documented so we are in the same "people don't look at errata documentation...". Another obvious problem is that you might have two workloads with different semantic expectations and then this option simply falls flat. > 2a. A runtime switch that defaults to the local behavior. > > 2b. A runtime switch that defaults to the tree behavior. > > The choice between 2a and 2b comes down to how big we evaluate the > risk that somebody has an existing dependency on the local behavior. > > Given what we know about cgroup2 usage, and considering our previous > behavior in such matters, I'd say 2b is reasonable and in line with > how we tend to handle these things. On the tiny chance that somebody > is using the current behavior, they can flick the switch (until we add > the .local files, or simply use the switch forever). My preference is 1 but if there is a _larger_ consensus of different cgroup v2 users that 2 is more preferred then I can live with that.
On Fri, Feb 01, 2019 at 11:27:41AM +0100, Michal Hocko wrote: > On Thu 31-01-19 11:22:48, Johannes Weiner wrote: > > On Thu, Jan 31, 2019 at 09:58:08AM +0100, Michal Hocko wrote: > > > On Wed 30-01-19 16:31:31, Johannes Weiner wrote: > > > > On Wed, Jan 30, 2019 at 09:05:59PM +0100, Michal Hocko wrote: > > > [...] > > > > > I thought I have already mentioned an example. Say you have an observer > > > > > on the top of a delegated cgroup hierarchy and you setup limits (e.g. hard > > > > > limit) on the root of it. If you get an OOM event then you know that the > > > > > whole hierarchy might be underprovisioned and perform some rebalancing. > > > > > Now you really do not care that somewhere down the delegated tree there > > > > > was an oom. Such a spurious event would just confuse the monitoring and > > > > > lead to wrong decisions. > > > > > > > > You can construct a usecase like this, as per above with OOM, but it's > > > > incredibly unlikely for something like this to exist. There is plenty > > > > of evidence on adoption rate that supports this: we know where the big > > > > names in containerization are; we see the things we run into that have > > > > not been reported yet etc. > > > > > > > > Compare this to real problems this has already caused for > > > > us. Multi-level control and monitoring is a fundamental concept of the > > > > cgroup design, so naturally our infrastructure doesn't monitor and log > > > > at the individual job level (too much data, and also kind of pointless > > > > when the jobs are identical) but at aggregate parental levels. > > > > > > > > Because of this wart, we have missed problematic configurations when > > > > the low, high, max events were not propagated as expected (we log oom > > > > separately, so we still noticed those). Even once we knew about it, we > > > > had trouble tracking these configurations down for the same reason - > > > > the data isn't logged, and won't be logged, at this level. > > > > > > Yes, I do understand that you might be interested in the hierarchical > > > accounting. > > > > > > > Adding a separate, hierarchical file would solve this one particular > > > > problem for us, but it wouldn't fix this pitfall for all future users > > > > of cgroup2 (which by all available evidence is still most of them) and > > > > would be a wart on the interface that we'd carry forever. > > > > > > I understand even this reasoning but if I have to chose between a risk > > > of user breakage that would require to reimplement the monitoring or an > > > API incosistency I vote for the first option. It is unfortunate but this > > > is the way we deal with APIs and compatibility. > > > > I don't know why you keep repeating this, it's simply not how Linux > > API is maintained in practice. > > > > In cgroup2, we fixed io.stat to not conflate discard IO and write IO: > > 636620b66d5d4012c4a9c86206013964d3986c4f > > > > Linus changed the Vmalloc field semantics in /proc/meminfo after over > > a decade, without a knob to restore it in production: > > > > If this breaks anything, we'll obviously have to re-introduce the code > > to compute this all and add the caching patches on top. But if given > > the option, I'd really prefer to just remove this bad idea entirely > > rather than add even more code to work around our historical mistake > > that likely nobody really cares about. > > a5ad88ce8c7fae7ddc72ee49a11a75aa837788e0 > > > > Mel changed the zone_reclaim_mode default behavior after over a > > decade: > > > > Those that require zone_reclaim_mode are likely to be able to > > detect when it needs to be enabled and tune appropriately so lets > > have a sensible default for the bulk of users. > > 4f9b16a64753d0bb607454347036dc997fd03b82 > > Acked-by: Michal Hocko <mhocko@suse.cz> > > > > And then Mel changed the default zonelist ordering to pick saner > > behavior for most users, followed by a complete removal of the zone > > list ordering, after again, decades of existence of these things: > > > > commit c9bff3eebc09be23fbc868f5e6731666d23cbea3 > > Author: Michal Hocko <mhocko@suse.com> > > Date: Wed Sep 6 16:20:13 2017 -0700 > > > > mm, page_alloc: rip out ZONELIST_ORDER_ZONE > > > > And why did we do any of those things and risk user disruption every > > single time? Because the existing behavior was not a good default, a > > burden on people, and the risk of breakage was sufficiently low. > > > > I don't see how this case is different, and you haven't provided any > > arguments that would explain that. > > Because there is no simple way to revert in _this_ particular case. Once > you change the semantic of the file you cannot simply make it > non-hierarchical after somebody complains. You do not want to break both > worlds. See the difference? Yes and no. We cannot revert if both cases are in use, but we can support both cases; add the .local file, or - for binary compatibility - add the compat mount flag to switch to the old behavior. In the vmalloc and the zonelist_order_zone cases above, any users that would rely on those features would have to blacklist certain kernel versions in which they are not available. In this case, any users that rely on the old behavior would have to mount cgroup2 with the compat knob. Arguably, it's easier to pass a mount option than it is to change the entire kernel. > > > Those users requiring the hierarchical beahvior can use the new file > > > without any risk of breakages so I really do not see why we should > > > undertake the risk and do it the other way around. > > > > Okay, so let's find a way forward here. > > > > 1. A new memory.events_tree file or similar. This would give us a way > > to get the desired hierarchical behavior. The downside is that it's > > suggesting that ${x} and ${x}_tree are the local and hierarchical > > versions of a cgroup file, and that's false everywhere else. Saying we > > would document it is a cop-out and doesn't actually make the interface > > less confusing (most people don't look at errata documentation until > > they've been burned by unexpected behavior). > > > > 2. A runtime switch (cgroup mount option, sysctl, what have you) that > > lets you switch between the local and the tree behavior. This would be > > able to provide the desired semantics in a clean interface, while > > still having the ability to support legacy users. > > With an obvious downside that one or the other usecase has to learn that > the current semantic is different than expected which is again something > that has to be documented so we are in the same "people don't look at > errata documentation...". Of course, but that's where the "which situation is more likely" comes in. That was the basis for all the historic changes I quoted above. > > 2a. A runtime switch that defaults to the local behavior. > > > > 2b. A runtime switch that defaults to the tree behavior. > > > > The choice between 2a and 2b comes down to how big we evaluate the > > risk that somebody has an existing dependency on the local behavior. > > > > Given what we know about cgroup2 usage, and considering our previous > > behavior in such matters, I'd say 2b is reasonable and in line with > > how we tend to handle these things. On the tiny chance that somebody > > is using the current behavior, they can flick the switch (until we add > > the .local files, or simply use the switch forever). > > My preference is 1 but if there is a _larger_ consensus of different > cgroup v2 users that 2 is more preferred then I can live with that. Thanks.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 380a212a8c52..5428b372def4 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -769,8 +769,10 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, static inline void memcg_memory_event(struct mem_cgroup *memcg, enum memcg_memory_event event) { - atomic_long_inc(&memcg->memory_events[event]); - cgroup_file_notify(&memcg->events_file); + do { + atomic_long_inc(&memcg->memory_events[event]); + cgroup_file_notify(&memcg->events_file); + } while ((memcg = parent_mem_cgroup(memcg))); } static inline void memcg_memory_event_mm(struct mm_struct *mm,