Message ID | 20180730180100.25079-1-guro@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | introduce memory.oom.group | expand |
On Mon, 30 Jul 2018, Roman Gushchin wrote: > This is a tiny implementation of cgroup-aware OOM killer, > which adds an ability to kill a cgroup as a single unit > and so guarantee the integrity of the workload. > > Although it has only a limited functionality in comparison > to what now resides in the mm tree (it doesn't change > the victim task selection algorithm, doesn't look > at memory stas on cgroup level, etc), it's also much > simpler and more straightforward. So, hopefully, we can > avoid having long debates here, as we had with the full > implementation. > > As it doesn't prevent any futher development, > and implements an useful and complete feature, > it looks as a sane way forward. > > This patchset is against Linus's tree to avoid conflicts > with the cgroup-aware OOM killer patchset in the mm tree. > > Two first patches are already in the mm tree. > The first one ("mm: introduce mem_cgroup_put() helper") > is totally fine, and the second's commit message has to be > changed to reflect that it's not a part of old patchset > anymore. > What's the plan with the cgroup aware oom killer? It has been sitting in the -mm tree for ages with no clear path to being merged. Are you suggesting this patchset as a preliminary series so the cgroup aware oom killer should be removed from the -mm tree and this should be merged instead? If so, what is the plan going forward for the cgroup aware oom killer? Are you planning on reviewing the patchset to fix the cgroup aware oom killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has been waiting for feedback since March?
On Mon, Jul 30, 2018 at 06:49:31PM -0700, David Rientjes wrote: > On Mon, 30 Jul 2018, Roman Gushchin wrote: > > > This is a tiny implementation of cgroup-aware OOM killer, > > which adds an ability to kill a cgroup as a single unit > > and so guarantee the integrity of the workload. > > > > Although it has only a limited functionality in comparison > > to what now resides in the mm tree (it doesn't change > > the victim task selection algorithm, doesn't look > > at memory stas on cgroup level, etc), it's also much > > simpler and more straightforward. So, hopefully, we can > > avoid having long debates here, as we had with the full > > implementation. > > > > As it doesn't prevent any futher development, > > and implements an useful and complete feature, > > it looks as a sane way forward. > > > > This patchset is against Linus's tree to avoid conflicts > > with the cgroup-aware OOM killer patchset in the mm tree. > > > > Two first patches are already in the mm tree. > > The first one ("mm: introduce mem_cgroup_put() helper") > > is totally fine, and the second's commit message has to be > > changed to reflect that it's not a part of old patchset > > anymore. > > What's the plan with the cgroup aware oom killer? It has been sitting in > the -mm tree for ages with no clear path to being merged. > > Are you suggesting this patchset as a preliminary series so the cgroup > aware oom killer should be removed from the -mm tree and this should be > merged instead? If so, what is the plan going forward for the cgroup > aware oom killer? > > Are you planning on reviewing the patchset to fix the cgroup aware oom > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has > been waiting for feedback since March? I would say it's been waiting for feedback since March because every person that could give meaningful feedback on it, incl. the memcg and cgroup maintainers, is happy with Roman's current patches in -mm, is not convinced by the arguments you have made over months of discussions, and has serious reservations about the configurable OOM policies you propose as follow-up fix to Roman's patches. This pared-down version of the patches proposed here is to accomodate you and table discussions about policy for now. But your patchset is highly unlikely to gain any sort of traction in the future. Also I don't think there is any controversy here over what the way forward should be. Roman's patches should have been merged months ago.
On Mon, Jul 30, 2018 at 06:49:31PM -0700, David Rientjes wrote: > On Mon, 30 Jul 2018, Roman Gushchin wrote: > > > This is a tiny implementation of cgroup-aware OOM killer, > > which adds an ability to kill a cgroup as a single unit > > and so guarantee the integrity of the workload. > > > > Although it has only a limited functionality in comparison > > to what now resides in the mm tree (it doesn't change > > the victim task selection algorithm, doesn't look > > at memory stas on cgroup level, etc), it's also much > > simpler and more straightforward. So, hopefully, we can > > avoid having long debates here, as we had with the full > > implementation. > > > > As it doesn't prevent any futher development, > > and implements an useful and complete feature, > > it looks as a sane way forward. > > > > This patchset is against Linus's tree to avoid conflicts > > with the cgroup-aware OOM killer patchset in the mm tree. > > > > Two first patches are already in the mm tree. > > The first one ("mm: introduce mem_cgroup_put() helper") > > is totally fine, and the second's commit message has to be > > changed to reflect that it's not a part of old patchset > > anymore. > > > > What's the plan with the cgroup aware oom killer? It has been sitting in > the -mm tree for ages with no clear path to being merged. It's because your nack, isn't it? Everybody else seem to be fine with it. > > Are you suggesting this patchset as a preliminary series so the cgroup > aware oom killer should be removed from the -mm tree and this should be > merged instead? If so, what is the plan going forward for the cgroup > aware oom killer? Answered below. > > Are you planning on reviewing the patchset to fix the cgroup aware oom > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has > been waiting for feedback since March? > I already did. As I said, I find the proposed oom_policy interface confusing. I'm not sure I understand why some memcg OOMs should be handled by memcg-aware OOMs, while other by the traditional per-process logic; and why this should be set on the OOMing memcg. IMO this adds nothing but confusion. If it's just a way to get rid of mount option, it doesn't look nice to me (neither I'm fan of the mount option). If you need an option to evaluate a cgroup as a whole, but kill only one task inside (the ability we've discussed before), let's make it clear. It's possible with the new memory.oom.group. Despite mentioning the lack of priority tuning in the list of problems, you do not propose anything. I agree it's hard, but why mentioning then? Patches which adjust root memory cgroup accounting and NUMA handling should be handled separately, they are really not about the interface. I've nothing against them. Again, I don't like the proposed interface, it doesn't feel clear. I think, this is the reason, why your patchset didn't collect any acks since March. I'm not blocking any progress here, it's not on me. Anyway, at this point I really think that this patch (memory.oom.group) is a reasonable way forward. It implements a useful and complete feature, doesn't block any further development and has a clean interface. So, you can build memory.oom.policy on top of it. Does this sound good?
On Tue, 31 Jul 2018, Roman Gushchin wrote: > > What's the plan with the cgroup aware oom killer? It has been sitting in > > the -mm tree for ages with no clear path to being merged. > > It's because your nack, isn't it? > Everybody else seem to be fine with it. > If they are fine with it, I'm not sure they have tested it :) Killing entire cgroups needlessly for mempolicy oom kills that will not free memory on target nodes is the first regression they may notice. It also unnecessarily uses oom_score_adj settings only when attached to the root mem cgroup. That may be fine in very specialized usecases but your bash shell being considered equal to a 96GB cgroup isn't very useful. These are all fixed in my follow-up patch series which you say you have reviewed later in this email. > > Are you planning on reviewing the patchset to fix the cgroup aware oom > > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has > > been waiting for feedback since March? > > > > I already did. > As I said, I find the proposed oom_policy interface confusing. > I'm not sure I understand why some memcg OOMs should be handled > by memcg-aware OOMs, while other by the traditional per-process > logic; and why this should be set on the OOMing memcg. > IMO this adds nothing but confusion. > If your entire review was the email to a single patch, I misinterpreted that as the entire review not being done, sorry. I volunteered to separate out the logic to determine if a cgroup should be considered on its own (kill the largest cgroup on the system) or whether to consider subtree usage as well into its own tunable. I haven't received an answer, yet, but it's a trivial patch on top of my series if you prefer. Just let me know so we can make progress. > it doesn't look nice to me (neither I'm fan of the mount option). > If you need an option to evaluate a cgroup as a whole, but kill > only one task inside (the ability we've discussed before), > let's make it clear. It's possible with the new memory.oom.group. > The purpose is for subtrees delegated to users so that they can continue to expect the same process being oom killed, with oom_score_adj respected, even though the ancestor oom policy is for cgroup aware targeting. It is perfectly legitimate, and necessary, for a user who controls their own subtree to prefer killing of the single largest process as it has always been done. Secondary to that is their ability to influence the decision with oom_score_adj, which they lose without my patches. > Patches which adjust root memory cgroup accounting and NUMA > handling should be handled separately, they are really not > about the interface. I've nothing against them. > That's good to know, it would be helpful if you would ack the patches that you are not objecting to. Your feedback about the overloading of "cgroup" and "tree" is well received and I can easily separate that into a tunable as I said. I do not know of any user that would want to specify "tree" without having cgroup aware behavior, however. If you would prefer this, please let me know! > Anyway, at this point I really think that this patch (memory.oom.group) > is a reasonable way forward. It implements a useful and complete feature, > doesn't block any further development and has a clean interface. > So, you can build memory.oom.policy on top of it. > Does this sound good? > I have no objection to this series, of course. The functionality of group oom was unchanged in my series. I'd very much appreciate a review of my patchset, though, so the cgroup-aware policy can be merged as well.
On Wed, Aug 01, 2018 at 02:51:25PM -0700, David Rientjes wrote: > On Tue, 31 Jul 2018, Roman Gushchin wrote: > > > > What's the plan with the cgroup aware oom killer? It has been sitting in > > > the -mm tree for ages with no clear path to being merged. > > > > It's because your nack, isn't it? > > Everybody else seem to be fine with it. > > > > If they are fine with it, I'm not sure they have tested it :) Killing > entire cgroups needlessly for mempolicy oom kills that will not free > memory on target nodes is the first regression they may notice. It also > unnecessarily uses oom_score_adj settings only when attached to the root > mem cgroup. That may be fine in very specialized usecases but your bash > shell being considered equal to a 96GB cgroup isn't very useful. These > are all fixed in my follow-up patch series which you say you have reviewed > later in this email. > > > > Are you planning on reviewing the patchset to fix the cgroup aware oom > > > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has > > > been waiting for feedback since March? > > > > > > > I already did. > > As I said, I find the proposed oom_policy interface confusing. > > I'm not sure I understand why some memcg OOMs should be handled > > by memcg-aware OOMs, while other by the traditional per-process > > logic; and why this should be set on the OOMing memcg. > > IMO this adds nothing but confusion. > > > > If your entire review was the email to a single patch, I misinterpreted > that as the entire review not being done, sorry. I volunteered to > separate out the logic to determine if a cgroup should be considered on > its own (kill the largest cgroup on the system) or whether to consider > subtree usage as well into its own tunable. I haven't received an > answer, yet, but it's a trivial patch on top of my series if you prefer. > Just let me know so we can make progress. > > > it doesn't look nice to me (neither I'm fan of the mount option). > > If you need an option to evaluate a cgroup as a whole, but kill > > only one task inside (the ability we've discussed before), > > let's make it clear. It's possible with the new memory.oom.group. > > > > The purpose is for subtrees delegated to users so that they can continue > to expect the same process being oom killed, with oom_score_adj > respected, even though the ancestor oom policy is for cgroup aware > targeting. It is perfectly legitimate, and necessary, for a user who > controls their own subtree to prefer killing of the single largest process > as it has always been done. Secondary to that is their ability to > influence the decision with oom_score_adj, which they lose without my > patches. > > > Patches which adjust root memory cgroup accounting and NUMA > > handling should be handled separately, they are really not > > about the interface. I've nothing against them. > > > > That's good to know, it would be helpful if you would ack the patches that > you are not objecting to. Your feedback about the overloading of "cgroup" > and "tree" is well received and I can easily separate that into a tunable > as I said. I do not know of any user that would want to specify "tree" > without having cgroup aware behavior, however. If you would prefer this, > please let me know! > > > Anyway, at this point I really think that this patch (memory.oom.group) > > is a reasonable way forward. It implements a useful and complete feature, > > doesn't block any further development and has a clean interface. > > So, you can build memory.oom.policy on top of it. > > Does this sound good? > > > > I have no objection to this series, of course. The functionality of group > oom was unchanged in my series. I'd very much appreciate a review of my > patchset, though, so the cgroup-aware policy can be merged as well. > Ok, I think that what we'll do here: 1) drop the current cgroup-aware OOM killer implementation from the mm tree 2) land memory.oom.group to the mm tree (your ack will be appreciated) 3) discuss and, hopefully, agree on memory.oom.policy interface 4) land memory.oom.policy Basically, with oom.group separated everything we need is another boolean knob, which means that the memcg should be evaluated together. Am I right? If so, the main problem to solve is how to handle the following case: A / \ B/memory.oom.evaluate_as_a_group* = 1 B C C/memory.oom.evaluate_as_a_group* = 0 / \ D E * I do not propose to use this name, just for example. In this case you have to compare tasks in C with cgroup B. And this is what I'd like to avoid. Maybe it should be enforced on A's level? I don't think it should be linked to the OOMing group, as in your patchset. I would really prefer to discuss the interface first, without going into code and implementation details code. It's not because I do not appreciate your work, only because it's hard to think about the interface when there are two big patchsets on top of each other. Thank you!
On Wed 01-08-18 14:51:25, David Rientjes wrote: > On Tue, 31 Jul 2018, Roman Gushchin wrote: > > > > What's the plan with the cgroup aware oom killer? It has been sitting in > > > the -mm tree for ages with no clear path to being merged. > > > > It's because your nack, isn't it? > > Everybody else seem to be fine with it. > > > > If they are fine with it, I'm not sure they have tested it :) Killing > entire cgroups needlessly for mempolicy oom kills that will not free > memory on target nodes is the first regression they may notice. I do not remember you would be mentioning this previously. Anyway the older implementation has considered the nodemask in memcg_oom_badness. You are right that a cpuset allocation could needlessly select a memcg with small or no memory from the target nodemask which is something I could have noticed during the review. If only I didn't have to spend all my energy to go through repetitive arguments of yours. Anyway this would be quite trivial to resolve in the same function by checking node_isset(node, current->mems_allowed). Thanks for your productive feedback again. Skipping the rest which is yet again repeating same arguments and it doesn't add anything new to the table.
On Wed, 1 Aug 2018, Roman Gushchin wrote: > Ok, I think that what we'll do here: > 1) drop the current cgroup-aware OOM killer implementation from the mm tree > 2) land memory.oom.group to the mm tree (your ack will be appreciated) > 3) discuss and, hopefully, agree on memory.oom.policy interface > 4) land memory.oom.policy > Yes, I'm fine proceeding this way, there's a clear separation between the policy and mechanism and they can be introduced independent of each other. As I said in my patchset, we can also introduce policies independent of each other and I have no objection to your design that addresses your specific usecase, with your own policy decisions, with the added caveat that we do so in a way that respects other usecases. Specifically, I would ask that the following be respected: - Subtrees delegated to users can still operate as they do today with per-process selection (largest, or influenced by oom_score_adj) so their victim selection is not changed out from under them. This requires the entire hierarchy is not locked into a specific policy, and also that a subtree is not locked in a specific policy. In other words, if an oom condition occurs in a user-controlled subtree they have the ability to get the same selection criteria as they do today. - Policies are implemented in a way that has an extensible API so that we do not unnecessarily limit or prohibit ourselves from making changes in the future or from extending the functionality by introducing other policy choices that are needed in the future. I hope that I'm not being unrealistic in assuming that you're fine with these since it can still preserve your goals. > Basically, with oom.group separated everything we need is another > boolean knob, which means that the memcg should be evaluated together. In a cgroup-aware oom killer world, yes, we need the ability to specify that the usage of the entire subtree should be compared as a single entity with other cgroups. That is necessary for user subtrees but may not be necessary for top-level cgroups depending on how you structure your unified cgroup hierarchy. So it needs to be configurable, as you suggest, and you are correct it can be different than oom.group. That's not the only thing we need though, as I'm sure you were expecting me to say :) We need the ability to preserve existing behavior, i.e. process based and not cgroup aware, for subtrees so that our users who have clear expectations and tune their oom_score_adj accordingly based on how the oom killer has always chosen processes for oom kill do not suddenly regress. So we need to define the policy for a subtree that is oom, and I suggest we do that as a characteristic of the cgroup that is oom ("process" vs "cgroup", and process would be the default to preserve what currently happens in a user subtree). Now, as users who rely on process selection are well aware, we have oom_score_adj to influence the decision of which process to oom kill. If our oom subtree is cgroup aware, we should have the ability to likewise influence that decision. For example, we have high priority applications that run at the top-level that use a lot of memory and strictly oom killing them in all scenarios because they use a lot of memory isn't appropriate. We need to be able to adjust the comparison of a cgroup (or subtree) when compared to other cgroups. I've also suggested, but did not implement in my patchset because I was trying to define the API and find common ground first, that we have a need for priority based selection. In other words, define the priority of a subtree regardless of cgroup usage. So with these four things, we have - an "oom.policy" tunable to define "cgroup" or "process" for that subtree (and plans for "priority" in the future), - your "oom.evaluate_as_group" tunable to account the usage of the subtree as the cgroup's own usage for comparison with others, - an "oom.adj" to adjust the usage of the cgroup (local or subtree) to protect important applications and bias against unimportant applications. This adds several tunables, which I didn't like, so I tried to overload oom.policy and oom.evaluate_as_group. When I referred to separating out the subtree usage accounting into a separate tunable, that is what I have referenced above. So when a cgroup is oom, oom.policy defines the selection. The cgroup here could be root for when the system is oom. If "process", nothing else matters, we iterate and find the largest process (modulo oom_score_adj) and kill it. We then look at oom.group and determine if additional processes should be oom killed. If "cgroup", we determine the local usage of each cgroup in the subtree. If oom.evaluate_as_group is enabled for a cgroup, we add the usage from each cgroup in the subtree to that cgroup. We then add oom.adj, which can be positive or negative, for the cgroup's overall score. Each cgroup then has a score that can be compared fairly to one another and the oom kill can occur. We then look at oom.group and determine if additional processes should be oom killed. With plans for an oom.policy of "priority", I would define that priority in oom.adj. Here, oom.evaluate_as_group can still be useful, which is great. If smaller priorities means higher preference for oom kill, we compare the oom.adj of all direct children and iterate the smallest. If oom.evaluate_as_group is set, the smallest oom.adj from the subtree is used. This is how I envisioned the functionality of the cgroup aware oom killer when I wrote my patchset and would be happy to hear your input or suggestions on it.
On Mon, Aug 06, 2018 at 02:34:06PM -0700, David Rientjes wrote: > On Wed, 1 Aug 2018, Roman Gushchin wrote: > > > Ok, I think that what we'll do here: > > 1) drop the current cgroup-aware OOM killer implementation from the mm tree > > 2) land memory.oom.group to the mm tree (your ack will be appreciated) > > 3) discuss and, hopefully, agree on memory.oom.policy interface > > 4) land memory.oom.policy > > > > Yes, I'm fine proceeding this way, there's a clear separation between the > policy and mechanism and they can be introduced independent of each other. > As I said in my patchset, we can also introduce policies independent of > each other and I have no objection to your design that addresses your > specific usecase, with your own policy decisions, with the added caveat > that we do so in a way that respects other usecases. > > Specifically, I would ask that the following be respected: > > - Subtrees delegated to users can still operate as they do today with > per-process selection (largest, or influenced by oom_score_adj) so > their victim selection is not changed out from under them. This > requires the entire hierarchy is not locked into a specific policy, > and also that a subtree is not locked in a specific policy. In other > words, if an oom condition occurs in a user-controlled subtree they > have the ability to get the same selection criteria as they do today. > > - Policies are implemented in a way that has an extensible API so that > we do not unnecessarily limit or prohibit ourselves from making changes > in the future or from extending the functionality by introducing other > policy choices that are needed in the future. > > I hope that I'm not being unrealistic in assuming that you're fine with > these since it can still preserve your goals. > > > Basically, with oom.group separated everything we need is another > > boolean knob, which means that the memcg should be evaluated together. > > In a cgroup-aware oom killer world, yes, we need the ability to specify > that the usage of the entire subtree should be compared as a single > entity with other cgroups. That is necessary for user subtrees but may > not be necessary for top-level cgroups depending on how you structure your > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > and you are correct it can be different than oom.group. > > That's not the only thing we need though, as I'm sure you were expecting > me to say :) > > We need the ability to preserve existing behavior, i.e. process based and > not cgroup aware, for subtrees so that our users who have clear > expectations and tune their oom_score_adj accordingly based on how the oom > killer has always chosen processes for oom kill do not suddenly regress. Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing this case? This basically means that if memcg is selected as target, the process inside will be selected using traditional per-process approach. > So we need to define the policy for a subtree that is oom, and I suggest > we do that as a characteristic of the cgroup that is oom ("process" vs > "cgroup", and process would be the default to preserve what currently > happens in a user subtree). I'm not entirely convinced here. I do agree, that some sub-tree may have a well tuned oom_score_adj, and it's preferable to keep the current behavior. At the same time I don't like the idea to look at the policy of the OOMing cgroup. Why exceeding of one limit should be handled different to exceeding of another? This seems to be a property of workload, not a limit. > > Now, as users who rely on process selection are well aware, we have > oom_score_adj to influence the decision of which process to oom kill. If > our oom subtree is cgroup aware, we should have the ability to likewise > influence that decision. For example, we have high priority applications > that run at the top-level that use a lot of memory and strictly oom > killing them in all scenarios because they use a lot of memory isn't > appropriate. We need to be able to adjust the comparison of a cgroup (or > subtree) when compared to other cgroups. > > I've also suggested, but did not implement in my patchset because I was > trying to define the API and find common ground first, that we have a need > for priority based selection. In other words, define the priority of a > subtree regardless of cgroup usage. > > So with these four things, we have > > - an "oom.policy" tunable to define "cgroup" or "process" for that > subtree (and plans for "priority" in the future), > > - your "oom.evaluate_as_group" tunable to account the usage of the > subtree as the cgroup's own usage for comparison with others, > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > to protect important applications and bias against unimportant > applications. > > This adds several tunables, which I didn't like, so I tried to overload > oom.policy and oom.evaluate_as_group. When I referred to separating out > the subtree usage accounting into a separate tunable, that is what I have > referenced above. IMO, merging multiple tunables into one doesn't make it saner. The real question how to make a reasonable interface with fever tunables. The reason behind introducing all these knobs is to provide a generic solution to define OOM handling rules, but then the question raises if the kernel is the best place for it. I really doubt that an interface with so many knobs has any chances to be merged. IMO, there should be a compromise between the simplicity (basically, the number of tunables and possible values) and functionality of the interface. You nacked my previous version, and unfortunately I don't have anything better so far. Thanks!
On Mon, 6 Aug 2018, Roman Gushchin wrote: > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > that the usage of the entire subtree should be compared as a single > > entity with other cgroups. That is necessary for user subtrees but may > > not be necessary for top-level cgroups depending on how you structure your > > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > > and you are correct it can be different than oom.group. > > > > That's not the only thing we need though, as I'm sure you were expecting > > me to say :) > > > > We need the ability to preserve existing behavior, i.e. process based and > > not cgroup aware, for subtrees so that our users who have clear > > expectations and tune their oom_score_adj accordingly based on how the oom > > killer has always chosen processes for oom kill do not suddenly regress. > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > this case? This basically means that if memcg is selected as target, > the process inside will be selected using traditional per-process approach. > No, that would overload the policy and mechanism. We want the ability to consider user-controlled subtrees as a single entity for comparison with other user subtrees to select which subtree to target. This does not imply that users want their entire subtree oom killed. > > So we need to define the policy for a subtree that is oom, and I suggest > > we do that as a characteristic of the cgroup that is oom ("process" vs > > "cgroup", and process would be the default to preserve what currently > > happens in a user subtree). > > I'm not entirely convinced here. > I do agree, that some sub-tree may have a well tuned oom_score_adj, > and it's preferable to keep the current behavior. > > At the same time I don't like the idea to look at the policy of the OOMing > cgroup. Why exceeding of one limit should be handled different to exceeding > of another? This seems to be a property of workload, not a limit. > The limit is the property of the mem cgroup, so it's logical that the policy when reaching that limit is a property of the same mem cgroup. Using the user-controlled subtree example, if we have /david and /roman, we can define our own policies on oom, we are not restricted to cgroup aware selection on the entire hierarchy. /david/oom.policy can be "process" so that I haven't regressed with earlier kernels, and /roman/oom.policy can be "cgroup" to target the largest cgroup in your subtree. Something needs to be oom killed when a mem cgroup at any level in the hierarchy is reached and reclaim has failed. What to do when that limit is reached is a property of that cgroup. > > Now, as users who rely on process selection are well aware, we have > > oom_score_adj to influence the decision of which process to oom kill. If > > our oom subtree is cgroup aware, we should have the ability to likewise > > influence that decision. For example, we have high priority applications > > that run at the top-level that use a lot of memory and strictly oom > > killing them in all scenarios because they use a lot of memory isn't > > appropriate. We need to be able to adjust the comparison of a cgroup (or > > subtree) when compared to other cgroups. > > > > I've also suggested, but did not implement in my patchset because I was > > trying to define the API and find common ground first, that we have a need > > for priority based selection. In other words, define the priority of a > > subtree regardless of cgroup usage. > > > > So with these four things, we have > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > subtree (and plans for "priority" in the future), > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > subtree as the cgroup's own usage for comparison with others, > > > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > > to protect important applications and bias against unimportant > > applications. > > > > This adds several tunables, which I didn't like, so I tried to overload > > oom.policy and oom.evaluate_as_group. When I referred to separating out > > the subtree usage accounting into a separate tunable, that is what I have > > referenced above. > > IMO, merging multiple tunables into one doesn't make it saner. > The real question how to make a reasonable interface with fever tunables. > > The reason behind introducing all these knobs is to provide > a generic solution to define OOM handling rules, but then the > question raises if the kernel is the best place for it. > > I really doubt that an interface with so many knobs has any chances > to be merged. > This is why I attempted to overload oom.policy and oom.evaluate_as_group: I could not think of a reasonable usecase where a subtree would be used to account for cgroup usage but not use a cgroup aware policy itself. You've objected to that, where memory.oom_policy == "tree" implied cgroup awareness in my patchset, so I've separated that out. > IMO, there should be a compromise between the simplicity (basically, > the number of tunables and possible values) and functionality > of the interface. You nacked my previous version, and unfortunately > I don't have anything better so far. > If you do not agree with the overloading and have a preference for single value tunables, then all three tunables are needed. This functionality could be represented as two or one tunable if they are not single value, but from the oom.group discussion you preferred single values. I assume you'd also object to adding and removing files based on oom.policy since oom.evaluate_as_group and oom.adj is only needed for oom.policy of "cgroup" or "priority", and they do not need to exist for the default oom.policy of "process".
On Tue 07-08-18 15:34:58, David Rientjes wrote: > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > that the usage of the entire subtree should be compared as a single > > > entity with other cgroups. That is necessary for user subtrees but may > > > not be necessary for top-level cgroups depending on how you structure your > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > > > and you are correct it can be different than oom.group. > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > me to say :) > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > not cgroup aware, for subtrees so that our users who have clear > > > expectations and tune their oom_score_adj accordingly based on how the oom > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > this case? This basically means that if memcg is selected as target, > > the process inside will be selected using traditional per-process approach. > > > > No, that would overload the policy and mechanism. We want the ability to > consider user-controlled subtrees as a single entity for comparison with > other user subtrees to select which subtree to target. This does not > imply that users want their entire subtree oom killed. Yeah, that's why oom.group == 0, no? Anyway, can we separate this discussion from the current series please? We are getting more and more tangent. Or do you still see the current state to be not mergeable?
On Wed, 8 Aug 2018, Michal Hocko wrote: > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > > that the usage of the entire subtree should be compared as a single > > > > entity with other cgroups. That is necessary for user subtrees but may > > > > not be necessary for top-level cgroups depending on how you structure your > > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > > > > and you are correct it can be different than oom.group. > > > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > > me to say :) > > > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > > not cgroup aware, for subtrees so that our users who have clear > > > > expectations and tune their oom_score_adj accordingly based on how the oom > > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > > this case? This basically means that if memcg is selected as target, > > > the process inside will be selected using traditional per-process approach. > > > > > > > No, that would overload the policy and mechanism. We want the ability to > > consider user-controlled subtrees as a single entity for comparison with > > other user subtrees to select which subtree to target. This does not > > imply that users want their entire subtree oom killed. > > Yeah, that's why oom.group == 0, no? > > Anyway, can we separate this discussion from the current series please? > We are getting more and more tangent. > > Or do you still see the current state to be not mergeable? I've said three times in this series that I am fine with it. Roman and I are discussing the API for making forward progress with the cgroup aware oom killer itself. When he responds, he can change the subject line if that would be helpful to you.
On Thu 09-08-18 13:10:10, David Rientjes wrote: > On Wed, 8 Aug 2018, Michal Hocko wrote: > > > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > > > that the usage of the entire subtree should be compared as a single > > > > > entity with other cgroups. That is necessary for user subtrees but may > > > > > not be necessary for top-level cgroups depending on how you structure your > > > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > > > > > and you are correct it can be different than oom.group. > > > > > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > > > me to say :) > > > > > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > > > not cgroup aware, for subtrees so that our users who have clear > > > > > expectations and tune their oom_score_adj accordingly based on how the oom > > > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > > > this case? This basically means that if memcg is selected as target, > > > > the process inside will be selected using traditional per-process approach. > > > > > > > > > > No, that would overload the policy and mechanism. We want the ability to > > > consider user-controlled subtrees as a single entity for comparison with > > > other user subtrees to select which subtree to target. This does not > > > imply that users want their entire subtree oom killed. > > > > Yeah, that's why oom.group == 0, no? > > > > Anyway, can we separate this discussion from the current series please? > > We are getting more and more tangent. > > > > Or do you still see the current state to be not mergeable? > > I've said three times in this series that I am fine with it. OK, that wasn't really clear to me because I haven't see any explicit ack from you (well except for the trivial helper patch). So I was not sure. > Roman and I > are discussing the API for making forward progress with the cgroup aware > oom killer itself. When he responds, he can change the subject line if > that would be helpful to you. I do not insist of course but it would be easier to follow if that discussion was separate.
Roman, have you had time to go through this? On Tue, 7 Aug 2018, David Rientjes wrote: > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > that the usage of the entire subtree should be compared as a single > > > entity with other cgroups. That is necessary for user subtrees but may > > > not be necessary for top-level cgroups depending on how you structure your > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > > > and you are correct it can be different than oom.group. > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > me to say :) > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > not cgroup aware, for subtrees so that our users who have clear > > > expectations and tune their oom_score_adj accordingly based on how the oom > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > this case? This basically means that if memcg is selected as target, > > the process inside will be selected using traditional per-process approach. > > > > No, that would overload the policy and mechanism. We want the ability to > consider user-controlled subtrees as a single entity for comparison with > other user subtrees to select which subtree to target. This does not > imply that users want their entire subtree oom killed. > > > > So we need to define the policy for a subtree that is oom, and I suggest > > > we do that as a characteristic of the cgroup that is oom ("process" vs > > > "cgroup", and process would be the default to preserve what currently > > > happens in a user subtree). > > > > I'm not entirely convinced here. > > I do agree, that some sub-tree may have a well tuned oom_score_adj, > > and it's preferable to keep the current behavior. > > > > At the same time I don't like the idea to look at the policy of the OOMing > > cgroup. Why exceeding of one limit should be handled different to exceeding > > of another? This seems to be a property of workload, not a limit. > > > > The limit is the property of the mem cgroup, so it's logical that the > policy when reaching that limit is a property of the same mem cgroup. > Using the user-controlled subtree example, if we have /david and /roman, > we can define our own policies on oom, we are not restricted to cgroup > aware selection on the entire hierarchy. /david/oom.policy can be > "process" so that I haven't regressed with earlier kernels, and > /roman/oom.policy can be "cgroup" to target the largest cgroup in your > subtree. > > Something needs to be oom killed when a mem cgroup at any level in the > hierarchy is reached and reclaim has failed. What to do when that limit > is reached is a property of that cgroup. > > > > Now, as users who rely on process selection are well aware, we have > > > oom_score_adj to influence the decision of which process to oom kill. If > > > our oom subtree is cgroup aware, we should have the ability to likewise > > > influence that decision. For example, we have high priority applications > > > that run at the top-level that use a lot of memory and strictly oom > > > killing them in all scenarios because they use a lot of memory isn't > > > appropriate. We need to be able to adjust the comparison of a cgroup (or > > > subtree) when compared to other cgroups. > > > > > > I've also suggested, but did not implement in my patchset because I was > > > trying to define the API and find common ground first, that we have a need > > > for priority based selection. In other words, define the priority of a > > > subtree regardless of cgroup usage. > > > > > > So with these four things, we have > > > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > > subtree (and plans for "priority" in the future), > > > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > > subtree as the cgroup's own usage for comparison with others, > > > > > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > > > to protect important applications and bias against unimportant > > > applications. > > > > > > This adds several tunables, which I didn't like, so I tried to overload > > > oom.policy and oom.evaluate_as_group. When I referred to separating out > > > the subtree usage accounting into a separate tunable, that is what I have > > > referenced above. > > > > IMO, merging multiple tunables into one doesn't make it saner. > > The real question how to make a reasonable interface with fever tunables. > > > > The reason behind introducing all these knobs is to provide > > a generic solution to define OOM handling rules, but then the > > question raises if the kernel is the best place for it. > > > > I really doubt that an interface with so many knobs has any chances > > to be merged. > > > > This is why I attempted to overload oom.policy and oom.evaluate_as_group: > I could not think of a reasonable usecase where a subtree would be used to > account for cgroup usage but not use a cgroup aware policy itself. You've > objected to that, where memory.oom_policy == "tree" implied cgroup > awareness in my patchset, so I've separated that out. > > > IMO, there should be a compromise between the simplicity (basically, > > the number of tunables and possible values) and functionality > > of the interface. You nacked my previous version, and unfortunately > > I don't have anything better so far. > > > > If you do not agree with the overloading and have a preference for single > value tunables, then all three tunables are needed. This functionality > could be represented as two or one tunable if they are not single value, > but from the oom.group discussion you preferred single values. > > I assume you'd also object to adding and removing files based on > oom.policy since oom.evaluate_as_group and oom.adj is only needed for > oom.policy of "cgroup" or "priority", and they do not need to exist for > the default oom.policy of "process". >
On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote: > Roman, have you had time to go through this? Hm, I thought we've finished this part of discussion, no? Anyway, let me repeat my position: I don't like the interface you've proposed in that follow-up patchset, and I explained why. If you've a new proposal, please, rebase it to the current mm tree, and we can discuss it separately. Alternatively, we can discuss the interface first (without the implementation), but, please, make a new thread with a fresh description of a proposed interface. Thanks! > > > On Tue, 7 Aug 2018, David Rientjes wrote: > > > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > > that the usage of the entire subtree should be compared as a single > > > > entity with other cgroups. That is necessary for user subtrees but may > > > > not be necessary for top-level cgroups depending on how you structure your > > > > unified cgroup hierarchy. So it needs to be configurable, as you suggest, > > > > and you are correct it can be different than oom.group. > > > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > > me to say :) > > > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > > not cgroup aware, for subtrees so that our users who have clear > > > > expectations and tune their oom_score_adj accordingly based on how the oom > > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > > this case? This basically means that if memcg is selected as target, > > > the process inside will be selected using traditional per-process approach. > > > > > > > No, that would overload the policy and mechanism. We want the ability to > > consider user-controlled subtrees as a single entity for comparison with > > other user subtrees to select which subtree to target. This does not > > imply that users want their entire subtree oom killed. > > > > > > So we need to define the policy for a subtree that is oom, and I suggest > > > > we do that as a characteristic of the cgroup that is oom ("process" vs > > > > "cgroup", and process would be the default to preserve what currently > > > > happens in a user subtree). > > > > > > I'm not entirely convinced here. > > > I do agree, that some sub-tree may have a well tuned oom_score_adj, > > > and it's preferable to keep the current behavior. > > > > > > At the same time I don't like the idea to look at the policy of the OOMing > > > cgroup. Why exceeding of one limit should be handled different to exceeding > > > of another? This seems to be a property of workload, not a limit. > > > > > > > The limit is the property of the mem cgroup, so it's logical that the > > policy when reaching that limit is a property of the same mem cgroup. > > Using the user-controlled subtree example, if we have /david and /roman, > > we can define our own policies on oom, we are not restricted to cgroup > > aware selection on the entire hierarchy. /david/oom.policy can be > > "process" so that I haven't regressed with earlier kernels, and > > /roman/oom.policy can be "cgroup" to target the largest cgroup in your > > subtree. > > > > Something needs to be oom killed when a mem cgroup at any level in the > > hierarchy is reached and reclaim has failed. What to do when that limit > > is reached is a property of that cgroup. > > > > > > Now, as users who rely on process selection are well aware, we have > > > > oom_score_adj to influence the decision of which process to oom kill. If > > > > our oom subtree is cgroup aware, we should have the ability to likewise > > > > influence that decision. For example, we have high priority applications > > > > that run at the top-level that use a lot of memory and strictly oom > > > > killing them in all scenarios because they use a lot of memory isn't > > > > appropriate. We need to be able to adjust the comparison of a cgroup (or > > > > subtree) when compared to other cgroups. > > > > > > > > I've also suggested, but did not implement in my patchset because I was > > > > trying to define the API and find common ground first, that we have a need > > > > for priority based selection. In other words, define the priority of a > > > > subtree regardless of cgroup usage. > > > > > > > > So with these four things, we have > > > > > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > > > subtree (and plans for "priority" in the future), > > > > > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > > > subtree as the cgroup's own usage for comparison with others, > > > > > > > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > > > > to protect important applications and bias against unimportant > > > > applications. > > > > > > > > This adds several tunables, which I didn't like, so I tried to overload > > > > oom.policy and oom.evaluate_as_group. When I referred to separating out > > > > the subtree usage accounting into a separate tunable, that is what I have > > > > referenced above. > > > > > > IMO, merging multiple tunables into one doesn't make it saner. > > > The real question how to make a reasonable interface with fever tunables. > > > > > > The reason behind introducing all these knobs is to provide > > > a generic solution to define OOM handling rules, but then the > > > question raises if the kernel is the best place for it. > > > > > > I really doubt that an interface with so many knobs has any chances > > > to be merged. > > > > > > > This is why I attempted to overload oom.policy and oom.evaluate_as_group: > > I could not think of a reasonable usecase where a subtree would be used to > > account for cgroup usage but not use a cgroup aware policy itself. You've > > objected to that, where memory.oom_policy == "tree" implied cgroup > > awareness in my patchset, so I've separated that out. > > > > > IMO, there should be a compromise between the simplicity (basically, > > > the number of tunables and possible values) and functionality > > > of the interface. You nacked my previous version, and unfortunately > > > I don't have anything better so far. > > > > > > > If you do not agree with the overloading and have a preference for single > > value tunables, then all three tunables are needed. This functionality > > could be represented as two or one tunable if they are not single value, > > but from the oom.group discussion you preferred single values. > > > > I assume you'd also object to adding and removing files based on > > oom.policy since oom.evaluate_as_group and oom.adj is only needed for > > oom.policy of "cgroup" or "priority", and they do not need to exist for > > the default oom.policy of "process". > >