[0/3] introduce memory.oom.group

Message ID	20180730180100.25079-1-guro@fb.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of prvs=17496daedc=guro@fb.com designates 67.231.145.42 as permitted sender) client-ip=67.231.145.42; From: Roman Gushchin <guro@fb.com> To: <linux-mm@kvack.org> CC: Michal Hocko <mhocko@suse.com>, Johannes Weiner <hannes@cmpxchg.org>, David Rientjes <rientjes@google.com>, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, Tejun Heo <tj@kernel.org>, <kernel-team@fb.com>, <linux-kernel@vger.kernel.org>, Roman Gushchin <guro@fb.com> Subject: [PATCH 0/3] introduce memory.oom.group Date: Mon, 30 Jul 2018 11:00:57 -0700 Message-ID: <20180730180100.25079-1-guro@fb.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: 1;SN1PR15MB0175;23:LYZA9cUN4nEmdSSKnfPhwsywtUT98xIWnwRpBqIWl5HQ8sItubP0gTwpRiq5QcG1z89ju/+UhXjoPsMt7kxTWZmEOlwpN69GWqnCdqTw2jIgIcu8lLPAAzIG3NXQx+WdRMCuD0foEmszvqdOVa7p7GyCnt5JFdCi8P3gcq8MgSIxq3hTMnle59TOLXuIiRC7nFiqnMGu3+IusKKHOzigc0sSSrAiuMF4h96RgMXDeduPXxobUb4Jd6qPf5NxGlTHflN6nDgNTSyucFKjYAB9LYm2cRhsTE1JWgu47bDfNtrG7MsfJMCbPr5i8HWcVoSKpuWWEyR+Ve/aiADbSbvwNt4vzMW6bJjfsRxQkXDk3O8OahQAjaxYEflHZcL16EgImwr4sHdY0SoncJbVe9/nv8VARmBvS6hMvFknnKYYlvh7djGjiDrNWRXyzJ8DhG01FSA9TzI3ZOzlgXTOmv+0vhB/fqfFOYP5R1yM3EW5+iA3z1s/74RIP1OqJ3CCGjjps6GF4Ji2VL4V4MoalTDR1j8PbHcEDQSjn/xKARK5HPtZQ9nAwKMP2oQpUJnEZ/sy2kLvWTdMSVSXRS7uQV6wk3dQCPtFFywTAwu0CcVI3/MnUpYSl2MqwA1SUnjJ6NNpAqBBimEIuHPxxYpxzcU4heoL2c6wuW2ThYWAnUHTQSePUh6wyIsslTZ8YfcJz54jcFCrFBmnyApG6h2Vn7eYT69hXYAh9EVPaDZVpAIDgMLYR1/68zONWUJ1L8zsBrt8ISk8pyJQYf26ZLDgMiexKJpwdmRY3N9Et3qW6MYhgdeoXzXJm5YsHcnxavyWg3OVZUTTcmfOJzWdvKtKEE10DeLo+1rkHez4/wjlA2LfeFAyLYJ8v9J3IOg2lwYO3M7m4EwFt15IHNH89+Tgdpav7L0oQIn+Rx7jKxmvAzORqZn8xO0m3QTZn0S9klJIKNia6/nrLBW634yjpjFbdjyz3kjAk3+SWABuTCYkJQnL8MfXA5S/hhYnAiNeYDhY0V6Z9ZEyhNcy+sB0iVjredt2FwyrwoBGGyDEF0mbt2Tc2KHCix0TgYwmtzw+uED198qQuC67ng4NGKHpivGhzr55C5FA1L2dx8tDRlce9N4urojkTOhsFBGCBosjVUKreY/13vKAqD0jj1e8V9oTN63CcfnwoqmHnZoqbt4mHBqawGocgw/90xd2YmRe5mqz5ZRvrhZjIx+hjUvOniUnCgJtig== X-Microsoft-Antispam-Message-Info: 3YN2FwBT9MxcmCtDltIRySMvJfDpxMfvTp++023aVmAusBZDrZujZaRb/Md83pmVm4o1/oCmDkmLusJHX62T36LdLPEYcWjhRGuwyI7kin7gusacC9k/4glibF81/x9lPsYZudtod4zbeeHeBN4jzRZLjCkBzScgteCz4lto2m/xnvwy8vSubw5u6xH/IHsHv+BFQ7Nouslu1UIJQNlEIj1e7aoIPF1+GjfPHVb6DIqUt5z+W8iPkHw0+c1IJaO6nPDqx0lzAbrz+gYSw2sYsPGd2alQrYtDyig2aLtZpvjPjT2BZzJiEQVV40kOLN54bRxeWv+9CVxaKmFBHHsfTpg+qSUvmLoZ2UlI4pFuORU= X-Microsoft-Exchange-Diagnostics: 1;SN1PR15MB0175;6:azZaDc2EHoxEgPIftw7OVQsPaw7QU2J2kwFUtQgehXLqCdd1Nkj4iC8et+RecLVCGPrPJIyPxjFuVrfGFqQKxjUue7J3NdqCEMIYK4REeUYOPtrvUfqNxWQHD8IKS6EyD7s8YtQtO2fpIFeirVrSP1FmnK6GdwLMbShE5ZV8WsUvG3O41qYq/xzVKEXTY4bX6jMqQtkfaIS+bbco2VYGqRdBuvi4Mp1SEVZhn0ydyZBt1DrrL9rVJgwYPNKersX85YuLoQQ8IAYXTSVRWO3o24Ypnl5lrSEY+C+5UQiffwqNcGzdmN0vicq1qs3EYLGFlMpHhwbuvMP8AeJOJF0IvxSKQEixu8g9LlD5AI9ulkX0Gq0uW2Iov1V2Xlsypfgsrnppae7SuDZxQu4yr5PPthfjrDSfvEPRPWHvV/PhH0Inh6McgU49UMt4PikU2wEPPsTp1EeIqS550t5GudYQlQ==;5:BO9Hn1/Jd/52iQ8Z5v6jLLJN+TSQlKG0OJOHe3lQjV4LRSsJhwz26Qktq4WLYCUGiJ7xDPdg6nD6b5O5RG/cKhuC7vsQ40AJdMHcWqyNd+8AOmaorgV9qfIv4+DwAB7JbhhlN4f6aTh8K9noE6OGozBc/+xzn7pqNhxFZ3clLhw=;7:5yrL3RsRB1eOvfFH+JHGfrBg7lqFcydZOlSerlnPgAoZzmXmDUSj77FaersNz0UO64tcUMYp38glUjI3cBdA6APOa9WSQeFbx9rXRfx7VvKN1vCQHVdYztbz4tPY8rP9/jHpDaJsYsZwjOu8jhDUAMwtuAYuDPgFniFZy9i04DN3/vYtpglr+QPjdoTKj9lD6B9yS+y3Ws16hQujQBuwSxderEjNRuuEeyPu6gBpRqQifAP7dRQdR3UOyJMrUXxx SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;SN1PR15MB0175;20:xtQ4wRnG2TDkb6QsI6UK5Qv867e2ozJPJJAURFQHAFpR4KRp0HBKqILC2+nBGyjXA+qC3LiK7eo5x+gKy23qePWHAZv2TUHC78BBVkWZql2mCxm9XdemvhVBODaaIAl26QogGFNXmCnfZ9Grt6Xy8jJLDndwNoh+UODxnQ7eMoc= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Jul 2018 18:01:22.3068 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: ea4e6ebd-abd8-45b2-09e5-08d5f64676ae X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	introduce memory.oom.group \| expand [0/3] introduce memory.oom.group [1/3] mm: introduce mem_cgroup_put() helper [2/3] mm, oom: refactor oom_kill_process() [3/3] mm, oom: introduce memory.oom.group

Roman Gushchin July 30, 2018, 6 p.m. UTC

This is a tiny implementation of cgroup-aware OOM killer,
which adds an ability to kill a cgroup as a single unit
and so guarantee the integrity of the workload.

Although it has only a limited functionality in comparison
to what now resides in the mm tree (it doesn't change
the victim task selection algorithm, doesn't look
at memory stas on cgroup level, etc), it's also much
simpler and more straightforward. So, hopefully, we can
avoid having long debates here, as we had with the full
implementation.

As it doesn't prevent any futher development,
and implements an useful and complete feature,
it looks as a sane way forward.

This patchset is against Linus's tree to avoid conflicts
with the cgroup-aware OOM killer patchset in the mm tree.

Two first patches are already in the mm tree.
The first one ("mm: introduce mem_cgroup_put() helper")
is totally fine, and the second's commit message has to be
changed to reflect that it's not a part of old patchset
anymore.

Roman Gushchin (3):
  mm: introduce mem_cgroup_put() helper
  mm, oom: refactor oom_kill_process()
  mm, oom: introduce memory.oom.group

 Documentation/admin-guide/cgroup-v2.rst |  16 ++++
 include/linux/memcontrol.h              |  22 +++++
 mm/memcontrol.c                         |  84 ++++++++++++++++++
 mm/oom_kill.c                           | 152 ++++++++++++++++++++------------
 4 files changed, 216 insertions(+), 58 deletions(-)

David Rientjes July 31, 2018, 1:49 a.m. UTC | #1

On Mon, 30 Jul 2018, Roman Gushchin wrote:

> This is a tiny implementation of cgroup-aware OOM killer,
> which adds an ability to kill a cgroup as a single unit
> and so guarantee the integrity of the workload.
> 
> Although it has only a limited functionality in comparison
> to what now resides in the mm tree (it doesn't change
> the victim task selection algorithm, doesn't look
> at memory stas on cgroup level, etc), it's also much
> simpler and more straightforward. So, hopefully, we can
> avoid having long debates here, as we had with the full
> implementation.
> 
> As it doesn't prevent any futher development,
> and implements an useful and complete feature,
> it looks as a sane way forward.
> 
> This patchset is against Linus's tree to avoid conflicts
> with the cgroup-aware OOM killer patchset in the mm tree.
> 
> Two first patches are already in the mm tree.
> The first one ("mm: introduce mem_cgroup_put() helper")
> is totally fine, and the second's commit message has to be
> changed to reflect that it's not a part of old patchset
> anymore.
> 

What's the plan with the cgroup aware oom killer?  It has been sitting in 
the -mm tree for ages with no clear path to being merged.

Are you suggesting this patchset as a preliminary series so the cgroup 
aware oom killer should be removed from the -mm tree and this should be 
merged instead?  If so, what is the plan going forward for the cgroup 
aware oom killer?

Are you planning on reviewing the patchset to fix the cgroup aware oom 
killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has 
been waiting for feedback since March?

Johannes Weiner July 31, 2018, 3:54 p.m. UTC | #2

On Mon, Jul 30, 2018 at 06:49:31PM -0700, David Rientjes wrote:
> On Mon, 30 Jul 2018, Roman Gushchin wrote:
> 
> > This is a tiny implementation of cgroup-aware OOM killer,
> > which adds an ability to kill a cgroup as a single unit
> > and so guarantee the integrity of the workload.
> > 
> > Although it has only a limited functionality in comparison
> > to what now resides in the mm tree (it doesn't change
> > the victim task selection algorithm, doesn't look
> > at memory stas on cgroup level, etc), it's also much
> > simpler and more straightforward. So, hopefully, we can
> > avoid having long debates here, as we had with the full
> > implementation.
> > 
> > As it doesn't prevent any futher development,
> > and implements an useful and complete feature,
> > it looks as a sane way forward.
> > 
> > This patchset is against Linus's tree to avoid conflicts
> > with the cgroup-aware OOM killer patchset in the mm tree.
> > 
> > Two first patches are already in the mm tree.
> > The first one ("mm: introduce mem_cgroup_put() helper")
> > is totally fine, and the second's commit message has to be
> > changed to reflect that it's not a part of old patchset
> > anymore.
> 
> What's the plan with the cgroup aware oom killer?  It has been sitting in 
> the -mm tree for ages with no clear path to being merged.
> 
> Are you suggesting this patchset as a preliminary series so the cgroup 
> aware oom killer should be removed from the -mm tree and this should be 
> merged instead?  If so, what is the plan going forward for the cgroup 
> aware oom killer?
> 
> Are you planning on reviewing the patchset to fix the cgroup aware oom 
> killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has 
> been waiting for feedback since March?

I would say it's been waiting for feedback since March because every
person that could give meaningful feedback on it, incl. the memcg and
cgroup maintainers, is happy with Roman's current patches in -mm, is
not convinced by the arguments you have made over months of
discussions, and has serious reservations about the configurable OOM
policies you propose as follow-up fix to Roman's patches.

This pared-down version of the patches proposed here is to accomodate
you and table discussions about policy for now. But your patchset is
highly unlikely to gain any sort of traction in the future.

Also I don't think there is any controversy here over what the way
forward should be. Roman's patches should have been merged months ago.

Roman Gushchin July 31, 2018, 11:51 p.m. UTC | #3

On Mon, Jul 30, 2018 at 06:49:31PM -0700, David Rientjes wrote:
> On Mon, 30 Jul 2018, Roman Gushchin wrote:
> 
> > This is a tiny implementation of cgroup-aware OOM killer,
> > which adds an ability to kill a cgroup as a single unit
> > and so guarantee the integrity of the workload.
> > 
> > Although it has only a limited functionality in comparison
> > to what now resides in the mm tree (it doesn't change
> > the victim task selection algorithm, doesn't look
> > at memory stas on cgroup level, etc), it's also much
> > simpler and more straightforward. So, hopefully, we can
> > avoid having long debates here, as we had with the full
> > implementation.
> > 
> > As it doesn't prevent any futher development,
> > and implements an useful and complete feature,
> > it looks as a sane way forward.
> > 
> > This patchset is against Linus's tree to avoid conflicts
> > with the cgroup-aware OOM killer patchset in the mm tree.
> > 
> > Two first patches are already in the mm tree.
> > The first one ("mm: introduce mem_cgroup_put() helper")
> > is totally fine, and the second's commit message has to be
> > changed to reflect that it's not a part of old patchset
> > anymore.
> > 
> 
> What's the plan with the cgroup aware oom killer?  It has been sitting in 
> the -mm tree for ages with no clear path to being merged.

It's because your nack, isn't it?
Everybody else seem to be fine with it.

> 
> Are you suggesting this patchset as a preliminary series so the cgroup 
> aware oom killer should be removed from the -mm tree and this should be 
> merged instead?  If so, what is the plan going forward for the cgroup 
> aware oom killer?

Answered below.

> 
> Are you planning on reviewing the patchset to fix the cgroup aware oom 
> killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has 
> been waiting for feedback since March?
> 

I already did.
As I said, I find the proposed oom_policy interface confusing.
I'm not sure I understand why some memcg OOMs should be handled
by memcg-aware OOMs, while other by the traditional per-process
logic; and why this should be set on the OOMing memcg.
IMO this adds nothing but confusion.

If it's just a way to get rid of mount option,
it doesn't look nice to me (neither I'm fan of the mount option).
If you need an option to evaluate a cgroup as a whole, but kill
only one task inside (the ability we've discussed before),
let's make it clear. It's possible with the new memory.oom.group.

Despite mentioning the lack of priority tuning in the list of
problems, you do not propose anything. I agree it's hard, but
why mentioning then?

Patches which adjust root memory cgroup accounting and NUMA
handling should be handled separately, they are really not
about the interface. I've nothing against them.

Again, I don't like the proposed interface, it doesn't feel
clear. I think, this is the reason, why your patchset didn't
collect any acks since March.
I'm not blocking any progress here, it's not on me.

Anyway, at this point I really think that this patch (memory.oom.group)
is a reasonable way forward. It implements a useful and complete feature,
doesn't block any further development and has a clean interface.
So, you can build memory.oom.policy on top of it.
Does this sound good?

David Rientjes Aug. 1, 2018, 9:51 p.m. UTC | #4

On Tue, 31 Jul 2018, Roman Gushchin wrote:

> > What's the plan with the cgroup aware oom killer?  It has been sitting in 
> > the -mm tree for ages with no clear path to being merged.
> 
> It's because your nack, isn't it?
> Everybody else seem to be fine with it.
> 

If they are fine with it, I'm not sure they have tested it :)  Killing 
entire cgroups needlessly for mempolicy oom kills that will not free 
memory on target nodes is the first regression they may notice.  It also 
unnecessarily uses oom_score_adj settings only when attached to the root 
mem cgroup.  That may be fine in very specialized usecases but your bash 
shell being considered equal to a 96GB cgroup isn't very useful.  These 
are all fixed in my follow-up patch series which you say you have reviewed 
later in this email.

> > Are you planning on reviewing the patchset to fix the cgroup aware oom 
> > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has 
> > been waiting for feedback since March?
> > 
> 
> I already did.
> As I said, I find the proposed oom_policy interface confusing.
> I'm not sure I understand why some memcg OOMs should be handled
> by memcg-aware OOMs, while other by the traditional per-process
> logic; and why this should be set on the OOMing memcg.
> IMO this adds nothing but confusion.
> 

If your entire review was the email to a single patch, I misinterpreted 
that as the entire review not being done, sorry.  I volunteered to 
separate out the logic to determine if a cgroup should be considered on 
its own (kill the largest cgroup on the system) or whether to consider 
subtree usage as well into its own tunable.  I haven't received an 
answer, yet, but it's a trivial patch on top of my series if you prefer.  
Just let me know so we can make progress.

> it doesn't look nice to me (neither I'm fan of the mount option).
> If you need an option to evaluate a cgroup as a whole, but kill
> only one task inside (the ability we've discussed before),
> let's make it clear. It's possible with the new memory.oom.group.
> 

The purpose is for subtrees delegated to users so that they can continue 
to expect the same process being oom killed, with oom_score_adj 
respected, even though the ancestor oom policy is for cgroup aware 
targeting.  It is perfectly legitimate, and necessary, for a user who 
controls their own subtree to prefer killing of the single largest process 
as it has always been done.  Secondary to that is their ability to 
influence the decision with oom_score_adj, which they lose without my 
patches.

> Patches which adjust root memory cgroup accounting and NUMA
> handling should be handled separately, they are really not
> about the interface. I've nothing against them.
> 

That's good to know, it would be helpful if you would ack the patches that 
you are not objecting to.  Your feedback about the overloading of "cgroup" 
and "tree" is well received and I can easily separate that into a tunable 
as I said.  I do not know of any user that would want to specify "tree" 
without having cgroup aware behavior, however.  If you would prefer this, 
please let me know!

> Anyway, at this point I really think that this patch (memory.oom.group)
> is a reasonable way forward. It implements a useful and complete feature,
> doesn't block any further development and has a clean interface.
> So, you can build memory.oom.policy on top of it.
> Does this sound good?
> 

I have no objection to this series, of course.  The functionality of group 
oom was unchanged in my series.  I'd very much appreciate a review of my 
patchset, though, so the cgroup-aware policy can be merged as well.

Roman Gushchin Aug. 1, 2018, 10:47 p.m. UTC | #5

On Wed, Aug 01, 2018 at 02:51:25PM -0700, David Rientjes wrote:
> On Tue, 31 Jul 2018, Roman Gushchin wrote:
> 
> > > What's the plan with the cgroup aware oom killer?  It has been sitting in 
> > > the -mm tree for ages with no clear path to being merged.
> > 
> > It's because your nack, isn't it?
> > Everybody else seem to be fine with it.
> > 
> 
> If they are fine with it, I'm not sure they have tested it :)  Killing 
> entire cgroups needlessly for mempolicy oom kills that will not free 
> memory on target nodes is the first regression they may notice.  It also 
> unnecessarily uses oom_score_adj settings only when attached to the root 
> mem cgroup.  That may be fine in very specialized usecases but your bash 
> shell being considered equal to a 96GB cgroup isn't very useful.  These 
> are all fixed in my follow-up patch series which you say you have reviewed 
> later in this email.
> 
> > > Are you planning on reviewing the patchset to fix the cgroup aware oom 
> > > killer at https://marc.info/?l=linux-kernel&m=153152325411865 which has 
> > > been waiting for feedback since March?
> > > 
> > 
> > I already did.
> > As I said, I find the proposed oom_policy interface confusing.
> > I'm not sure I understand why some memcg OOMs should be handled
> > by memcg-aware OOMs, while other by the traditional per-process
> > logic; and why this should be set on the OOMing memcg.
> > IMO this adds nothing but confusion.
> > 
> 
> If your entire review was the email to a single patch, I misinterpreted 
> that as the entire review not being done, sorry.  I volunteered to 
> separate out the logic to determine if a cgroup should be considered on 
> its own (kill the largest cgroup on the system) or whether to consider 
> subtree usage as well into its own tunable.  I haven't received an 
> answer, yet, but it's a trivial patch on top of my series if you prefer.  
> Just let me know so we can make progress.
> 
> > it doesn't look nice to me (neither I'm fan of the mount option).
> > If you need an option to evaluate a cgroup as a whole, but kill
> > only one task inside (the ability we've discussed before),
> > let's make it clear. It's possible with the new memory.oom.group.
> > 
> 
> The purpose is for subtrees delegated to users so that they can continue 
> to expect the same process being oom killed, with oom_score_adj 
> respected, even though the ancestor oom policy is for cgroup aware 
> targeting.  It is perfectly legitimate, and necessary, for a user who 
> controls their own subtree to prefer killing of the single largest process 
> as it has always been done.  Secondary to that is their ability to 
> influence the decision with oom_score_adj, which they lose without my 
> patches.
> 
> > Patches which adjust root memory cgroup accounting and NUMA
> > handling should be handled separately, they are really not
> > about the interface. I've nothing against them.
> > 
> 
> That's good to know, it would be helpful if you would ack the patches that 
> you are not objecting to.  Your feedback about the overloading of "cgroup" 
> and "tree" is well received and I can easily separate that into a tunable 
> as I said.  I do not know of any user that would want to specify "tree" 
> without having cgroup aware behavior, however.  If you would prefer this, 
> please let me know!
> 
> > Anyway, at this point I really think that this patch (memory.oom.group)
> > is a reasonable way forward. It implements a useful and complete feature,
> > doesn't block any further development and has a clean interface.
> > So, you can build memory.oom.policy on top of it.
> > Does this sound good?
> > 
> 
> I have no objection to this series, of course.  The functionality of group 
> oom was unchanged in my series.  I'd very much appreciate a review of my 
> patchset, though, so the cgroup-aware policy can be merged as well.
> 

Ok, I think that what we'll do here:
1) drop the current cgroup-aware OOM killer implementation from the mm tree
2) land memory.oom.group to the mm tree (your ack will be appreciated)
3) discuss and, hopefully, agree on memory.oom.policy interface
4) land memory.oom.policy

Basically, with oom.group separated everything we need is another
boolean knob, which means that the memcg should be evaluated together.
Am I right? If so, the main problem to solve is how to handle the following
case:
      A
     / \             B/memory.oom.evaluate_as_a_group* = 1
    B   C            C/memory.oom.evaluate_as_a_group* = 0
   / \
  D   E        * I do not propose to use this name, just for example.

In this case you have to compare tasks in C with cgroup B.
And this is what I'd like to avoid. Maybe it should be enforced on A's level?
I don't think it should be linked to the OOMing group, as in your patchset.

I would really prefer to discuss the interface first, without going
into code and implementation details code. It's not because I do not
appreciate your work, only because it's hard to think about the interface
when there are two big patchsets on top of each other.

Thank you!

Michal Hocko Aug. 2, 2018, 8 a.m. UTC | #6

On Wed 01-08-18 14:51:25, David Rientjes wrote:
> On Tue, 31 Jul 2018, Roman Gushchin wrote:
> 
> > > What's the plan with the cgroup aware oom killer?  It has been sitting in 
> > > the -mm tree for ages with no clear path to being merged.
> > 
> > It's because your nack, isn't it?
> > Everybody else seem to be fine with it.
> > 
> 
> If they are fine with it, I'm not sure they have tested it :)  Killing 
> entire cgroups needlessly for mempolicy oom kills that will not free 
> memory on target nodes is the first regression they may notice.

I do not remember you would be mentioning this previously. Anyway the
older implementation has considered the nodemask in memcg_oom_badness.
You are right that a cpuset allocation could needlessly select a memcg
with small or no memory from the target nodemask which is something I
could have noticed during the review. If only I didn't have to spend all
my energy to go through repetitive arguments of yours. Anyway this would
be quite trivial to resolve in the same function by checking
node_isset(node, current->mems_allowed).

Thanks for your productive feedback again.

Skipping the rest which is yet again repeating same arguments and it
doesn't add anything new to the table.

David Rientjes Aug. 6, 2018, 9:34 p.m. UTC | #7

On Wed, 1 Aug 2018, Roman Gushchin wrote:

> Ok, I think that what we'll do here:
> 1) drop the current cgroup-aware OOM killer implementation from the mm tree
> 2) land memory.oom.group to the mm tree (your ack will be appreciated)
> 3) discuss and, hopefully, agree on memory.oom.policy interface
> 4) land memory.oom.policy
> 

Yes, I'm fine proceeding this way, there's a clear separation between the 
policy and mechanism and they can be introduced independent of each other.  
As I said in my patchset, we can also introduce policies independent of 
each other and I have no objection to your design that addresses your 
specific usecase, with your own policy decisions, with the added caveat 
that we do so in a way that respects other usecases.

Specifically, I would ask that the following be respected:

 - Subtrees delegated to users can still operate as they do today with
   per-process selection (largest, or influenced by oom_score_adj) so
   their victim selection is not changed out from under them.  This
   requires the entire hierarchy is not locked into a specific policy,
   and also that a subtree is not locked in a specific policy.  In other
   words, if an oom condition occurs in a user-controlled subtree they
   have the ability to get the same selection criteria as they do today.

 - Policies are implemented in a way that has an extensible API so that
   we do not unnecessarily limit or prohibit ourselves from making changes
   in the future or from extending the functionality by introducing other
   policy choices that are needed in the future.

I hope that I'm not being unrealistic in assuming that you're fine with 
these since it can still preserve your goals.

> Basically, with oom.group separated everything we need is another
> boolean knob, which means that the memcg should be evaluated together.

In a cgroup-aware oom killer world, yes, we need the ability to specify 
that the usage of the entire subtree should be compared as a single 
entity with other cgroups.  That is necessary for user subtrees but may 
not be necessary for top-level cgroups depending on how you structure your 
unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
and you are correct it can be different than oom.group.

That's not the only thing we need though, as I'm sure you were expecting 
me to say :)

We need the ability to preserve existing behavior, i.e. process based and 
not cgroup aware, for subtrees so that our users who have clear 
expectations and tune their oom_score_adj accordingly based on how the oom 
killer has always chosen processes for oom kill do not suddenly regress.  
So we need to define the policy for a subtree that is oom, and I suggest 
we do that as a characteristic of the cgroup that is oom ("process" vs 
"cgroup", and process would be the default to preserve what currently 
happens in a user subtree).

Now, as users who rely on process selection are well aware, we have 
oom_score_adj to influence the decision of which process to oom kill.  If 
our oom subtree is cgroup aware, we should have the ability to likewise 
influence that decision.  For example, we have high priority applications 
that run at the top-level that use a lot of memory and strictly oom 
killing them in all scenarios because they use a lot of memory isn't 
appropriate.  We need to be able to adjust the comparison of a cgroup (or 
subtree) when compared to other cgroups.

I've also suggested, but did not implement in my patchset because I was 
trying to define the API and find common ground first, that we have a need 
for priority based selection.  In other words, define the priority of a 
subtree regardless of cgroup usage.

So with these four things, we have

 - an "oom.policy" tunable to define "cgroup" or "process" for that 
   subtree (and plans for "priority" in the future),

 - your "oom.evaluate_as_group" tunable to account the usage of the
   subtree as the cgroup's own usage for comparison with others,

 - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
   to protect important applications and bias against unimportant
   applications.

This adds several tunables, which I didn't like, so I tried to overload 
oom.policy and oom.evaluate_as_group.  When I referred to separating out 
the subtree usage accounting into a separate tunable, that is what I have 
referenced above.

So when a cgroup is oom, oom.policy defines the selection.  The cgroup 
here could be root for when the system is oom.  If "process", nothing else 
matters, we iterate and find the largest process (modulo oom_score_adj) 
and kill it.  We then look at oom.group and determine if additional 
processes should be oom killed.

If "cgroup", we determine the local usage of each cgroup in the subtree.  
If oom.evaluate_as_group is enabled for a cgroup, we add the usage from 
each cgroup in the subtree to that cgroup.  We then add oom.adj, which can 
be positive or negative, for the cgroup's overall score.  Each cgroup then 
has a score that can be compared fairly to one another and the oom kill 
can occur.  We then look at oom.group and determine if additional 
processes should be oom killed.

With plans for an oom.policy of "priority", I would define that priority 
in oom.adj.  Here, oom.evaluate_as_group can still be useful, which is 
great.  If smaller priorities means higher preference for oom kill, we 
compare the oom.adj of all direct children and iterate the smallest.  If 
oom.evaluate_as_group is set, the smallest oom.adj from the subtree is 
used.

This is how I envisioned the functionality of the cgroup aware oom killer 
when I wrote my patchset and would be happy to hear your input or 
suggestions on it.

Roman Gushchin Aug. 7, 2018, 12:30 a.m. UTC | #8

On Mon, Aug 06, 2018 at 02:34:06PM -0700, David Rientjes wrote:
> On Wed, 1 Aug 2018, Roman Gushchin wrote:
> 
> > Ok, I think that what we'll do here:
> > 1) drop the current cgroup-aware OOM killer implementation from the mm tree
> > 2) land memory.oom.group to the mm tree (your ack will be appreciated)
> > 3) discuss and, hopefully, agree on memory.oom.policy interface
> > 4) land memory.oom.policy
> > 
> 
> Yes, I'm fine proceeding this way, there's a clear separation between the 
> policy and mechanism and they can be introduced independent of each other.  
> As I said in my patchset, we can also introduce policies independent of 
> each other and I have no objection to your design that addresses your 
> specific usecase, with your own policy decisions, with the added caveat 
> that we do so in a way that respects other usecases.
> 
> Specifically, I would ask that the following be respected:
> 
>  - Subtrees delegated to users can still operate as they do today with
>    per-process selection (largest, or influenced by oom_score_adj) so
>    their victim selection is not changed out from under them.  This
>    requires the entire hierarchy is not locked into a specific policy,
>    and also that a subtree is not locked in a specific policy.  In other
>    words, if an oom condition occurs in a user-controlled subtree they
>    have the ability to get the same selection criteria as they do today.
> 
>  - Policies are implemented in a way that has an extensible API so that
>    we do not unnecessarily limit or prohibit ourselves from making changes
>    in the future or from extending the functionality by introducing other
>    policy choices that are needed in the future.
> 
> I hope that I'm not being unrealistic in assuming that you're fine with 
> these since it can still preserve your goals.
> 
> > Basically, with oom.group separated everything we need is another
> > boolean knob, which means that the memcg should be evaluated together.
> 
> In a cgroup-aware oom killer world, yes, we need the ability to specify 
> that the usage of the entire subtree should be compared as a single 
> entity with other cgroups.  That is necessary for user subtrees but may 
> not be necessary for top-level cgroups depending on how you structure your 
> unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
> and you are correct it can be different than oom.group.
> 
> That's not the only thing we need though, as I'm sure you were expecting 
> me to say :)
> 
> We need the ability to preserve existing behavior, i.e. process based and 
> not cgroup aware, for subtrees so that our users who have clear 
> expectations and tune their oom_score_adj accordingly based on how the oom 
> killer has always chosen processes for oom kill do not suddenly regress.

Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
this case? This basically means that if memcg is selected as target,
the process inside will be selected using traditional per-process approach.

> So we need to define the policy for a subtree that is oom, and I suggest 
> we do that as a characteristic of the cgroup that is oom ("process" vs 
> "cgroup", and process would be the default to preserve what currently 
> happens in a user subtree).

I'm not entirely convinced here.
I do agree, that some sub-tree may have a well tuned oom_score_adj,
and it's preferable to keep the current behavior.

At the same time I don't like the idea to look at the policy of the OOMing
cgroup. Why exceeding of one limit should be handled different to exceeding
of another? This seems to be a property of workload, not a limit.

> 
> Now, as users who rely on process selection are well aware, we have 
> oom_score_adj to influence the decision of which process to oom kill.  If 
> our oom subtree is cgroup aware, we should have the ability to likewise 
> influence that decision.  For example, we have high priority applications 
> that run at the top-level that use a lot of memory and strictly oom 
> killing them in all scenarios because they use a lot of memory isn't 
> appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> subtree) when compared to other cgroups.
> 
> I've also suggested, but did not implement in my patchset because I was 
> trying to define the API and find common ground first, that we have a need 
> for priority based selection.  In other words, define the priority of a 
> subtree regardless of cgroup usage.
> 
> So with these four things, we have
> 
>  - an "oom.policy" tunable to define "cgroup" or "process" for that 
>    subtree (and plans for "priority" in the future),
> 
>  - your "oom.evaluate_as_group" tunable to account the usage of the
>    subtree as the cgroup's own usage for comparison with others,
> 
>  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
>    to protect important applications and bias against unimportant
>    applications.
> 
> This adds several tunables, which I didn't like, so I tried to overload 
> oom.policy and oom.evaluate_as_group.  When I referred to separating out 
> the subtree usage accounting into a separate tunable, that is what I have 
> referenced above.

IMO, merging multiple tunables into one doesn't make it saner.
The real question how to make a reasonable interface with fever tunables.

The reason behind introducing all these knobs is to provide
a generic solution to define OOM handling rules, but then the
question raises if the kernel is the best place for it.

I really doubt that an interface with so many knobs has any chances
to be merged.

IMO, there should be a compromise between the simplicity (basically,
the number of tunables and possible values) and functionality
of the interface. You nacked my previous version, and unfortunately
I don't have anything better so far.

Thanks!

David Rientjes Aug. 7, 2018, 10:34 p.m. UTC | #9

On Mon, 6 Aug 2018, Roman Gushchin wrote:

> > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > that the usage of the entire subtree should be compared as a single 
> > entity with other cgroups.  That is necessary for user subtrees but may 
> > not be necessary for top-level cgroups depending on how you structure your 
> > unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
> > and you are correct it can be different than oom.group.
> > 
> > That's not the only thing we need though, as I'm sure you were expecting 
> > me to say :)
> > 
> > We need the ability to preserve existing behavior, i.e. process based and 
> > not cgroup aware, for subtrees so that our users who have clear 
> > expectations and tune their oom_score_adj accordingly based on how the oom 
> > killer has always chosen processes for oom kill do not suddenly regress.
> 
> Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> this case? This basically means that if memcg is selected as target,
> the process inside will be selected using traditional per-process approach.
> 

No, that would overload the policy and mechanism.  We want the ability to 
consider user-controlled subtrees as a single entity for comparison with 
other user subtrees to select which subtree to target.  This does not 
imply that users want their entire subtree oom killed.

> > So we need to define the policy for a subtree that is oom, and I suggest 
> > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > "cgroup", and process would be the default to preserve what currently 
> > happens in a user subtree).
> 
> I'm not entirely convinced here.
> I do agree, that some sub-tree may have a well tuned oom_score_adj,
> and it's preferable to keep the current behavior.
> 
> At the same time I don't like the idea to look at the policy of the OOMing
> cgroup. Why exceeding of one limit should be handled different to exceeding
> of another? This seems to be a property of workload, not a limit.
> 

The limit is the property of the mem cgroup, so it's logical that the 
policy when reaching that limit is a property of the same mem cgroup.
Using the user-controlled subtree example, if we have /david and /roman, 
we can define our own policies on oom, we are not restricted to cgroup 
aware selection on the entire hierarchy.  /david/oom.policy can be 
"process" so that I haven't regressed with earlier kernels, and 
/roman/oom.policy can be "cgroup" to target the largest cgroup in your 
subtree.

Something needs to be oom killed when a mem cgroup at any level in the 
hierarchy is reached and reclaim has failed.  What to do when that limit 
is reached is a property of that cgroup.

> > Now, as users who rely on process selection are well aware, we have 
> > oom_score_adj to influence the decision of which process to oom kill.  If 
> > our oom subtree is cgroup aware, we should have the ability to likewise 
> > influence that decision.  For example, we have high priority applications 
> > that run at the top-level that use a lot of memory and strictly oom 
> > killing them in all scenarios because they use a lot of memory isn't 
> > appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> > subtree) when compared to other cgroups.
> > 
> > I've also suggested, but did not implement in my patchset because I was 
> > trying to define the API and find common ground first, that we have a need 
> > for priority based selection.  In other words, define the priority of a 
> > subtree regardless of cgroup usage.
> > 
> > So with these four things, we have
> > 
> >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> >    subtree (and plans for "priority" in the future),
> > 
> >  - your "oom.evaluate_as_group" tunable to account the usage of the
> >    subtree as the cgroup's own usage for comparison with others,
> > 
> >  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> >    to protect important applications and bias against unimportant
> >    applications.
> > 
> > This adds several tunables, which I didn't like, so I tried to overload 
> > oom.policy and oom.evaluate_as_group.  When I referred to separating out 
> > the subtree usage accounting into a separate tunable, that is what I have 
> > referenced above.
> 
> IMO, merging multiple tunables into one doesn't make it saner.
> The real question how to make a reasonable interface with fever tunables.
> 
> The reason behind introducing all these knobs is to provide
> a generic solution to define OOM handling rules, but then the
> question raises if the kernel is the best place for it.
> 
> I really doubt that an interface with so many knobs has any chances
> to be merged.
> 

This is why I attempted to overload oom.policy and oom.evaluate_as_group: 
I could not think of a reasonable usecase where a subtree would be used to 
account for cgroup usage but not use a cgroup aware policy itself.  You've 
objected to that, where memory.oom_policy == "tree" implied cgroup 
awareness in my patchset, so I've separated that out.

> IMO, there should be a compromise between the simplicity (basically,
> the number of tunables and possible values) and functionality
> of the interface. You nacked my previous version, and unfortunately
> I don't have anything better so far.
> 

If you do not agree with the overloading and have a preference for single 
value tunables, then all three tunables are needed.  This functionality 
could be represented as two or one tunable if they are not single value, 
but from the oom.group discussion you preferred single values.

I assume you'd also object to adding and removing files based on 
oom.policy since oom.evaluate_as_group and oom.adj is only needed for 
oom.policy of "cgroup" or "priority", and they do not need to exist for 
the default oom.policy of "process".

Michal Hocko Aug. 8, 2018, 10:59 a.m. UTC | #10

On Tue 07-08-18 15:34:58, David Rientjes wrote:
> On Mon, 6 Aug 2018, Roman Gushchin wrote:
> 
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > that the usage of the entire subtree should be compared as a single 
> > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > not be necessary for top-level cgroups depending on how you structure your 
> > > unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
> > > and you are correct it can be different than oom.group.
> > > 
> > > That's not the only thing we need though, as I'm sure you were expecting 
> > > me to say :)
> > > 
> > > We need the ability to preserve existing behavior, i.e. process based and 
> > > not cgroup aware, for subtrees so that our users who have clear 
> > > expectations and tune their oom_score_adj accordingly based on how the oom 
> > > killer has always chosen processes for oom kill do not suddenly regress.
> > 
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> > 
> 
> No, that would overload the policy and mechanism.  We want the ability to 
> consider user-controlled subtrees as a single entity for comparison with 
> other user subtrees to select which subtree to target.  This does not 
> imply that users want their entire subtree oom killed.

Yeah, that's why oom.group == 0, no?

Anyway, can we separate this discussion from the current series please?
We are getting more and more tangent.

Or do you still see the current state to be not mergeable?

David Rientjes Aug. 9, 2018, 8:10 p.m. UTC | #11

On Wed, 8 Aug 2018, Michal Hocko wrote:

> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > > that the usage of the entire subtree should be compared as a single 
> > > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > > not be necessary for top-level cgroups depending on how you structure your 
> > > > unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
> > > > and you are correct it can be different than oom.group.
> > > > 
> > > > That's not the only thing we need though, as I'm sure you were expecting 
> > > > me to say :)
> > > > 
> > > > We need the ability to preserve existing behavior, i.e. process based and 
> > > > not cgroup aware, for subtrees so that our users who have clear 
> > > > expectations and tune their oom_score_adj accordingly based on how the oom 
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > 
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process approach.
> > > 
> > 
> > No, that would overload the policy and mechanism.  We want the ability to 
> > consider user-controlled subtrees as a single entity for comparison with 
> > other user subtrees to select which subtree to target.  This does not 
> > imply that users want their entire subtree oom killed.
> 
> Yeah, that's why oom.group == 0, no?
> 
> Anyway, can we separate this discussion from the current series please?
> We are getting more and more tangent.
> 
> Or do you still see the current state to be not mergeable?

I've said three times in this series that I am fine with it.  Roman and I 
are discussing the API for making forward progress with the cgroup aware 
oom killer itself.  When he responds, he can change the subject line if 
that would be helpful to you.

Michal Hocko Aug. 10, 2018, 7:03 a.m. UTC | #12

On Thu 09-08-18 13:10:10, David Rientjes wrote:
> On Wed, 8 Aug 2018, Michal Hocko wrote:
> 
> > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > > > that the usage of the entire subtree should be compared as a single 
> > > > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > > > not be necessary for top-level cgroups depending on how you structure your 
> > > > > unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
> > > > > and you are correct it can be different than oom.group.
> > > > > 
> > > > > That's not the only thing we need though, as I'm sure you were expecting 
> > > > > me to say :)
> > > > > 
> > > > > We need the ability to preserve existing behavior, i.e. process based and 
> > > > > not cgroup aware, for subtrees so that our users who have clear 
> > > > > expectations and tune their oom_score_adj accordingly based on how the oom 
> > > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > > 
> > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > > > this case? This basically means that if memcg is selected as target,
> > > > the process inside will be selected using traditional per-process approach.
> > > > 
> > > 
> > > No, that would overload the policy and mechanism.  We want the ability to 
> > > consider user-controlled subtrees as a single entity for comparison with 
> > > other user subtrees to select which subtree to target.  This does not 
> > > imply that users want their entire subtree oom killed.
> > 
> > Yeah, that's why oom.group == 0, no?
> > 
> > Anyway, can we separate this discussion from the current series please?
> > We are getting more and more tangent.
> > 
> > Or do you still see the current state to be not mergeable?
> 
> I've said three times in this series that I am fine with it.

OK, that wasn't really clear to me because I haven't see any explicit
ack from you (well except for the trivial helper patch). So I was not
sure.

> Roman and I 
> are discussing the API for making forward progress with the cgroup aware 
> oom killer itself.  When he responds, he can change the subject line if 
> that would be helpful to you.

I do not insist of course but it would be easier to follow if that
discussion was separate.

David Rientjes Aug. 19, 2018, 11:26 p.m. UTC | #13

Roman, have you had time to go through this?


On Tue, 7 Aug 2018, David Rientjes wrote:

> On Mon, 6 Aug 2018, Roman Gushchin wrote:
> 
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > that the usage of the entire subtree should be compared as a single 
> > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > not be necessary for top-level cgroups depending on how you structure your 
> > > unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
> > > and you are correct it can be different than oom.group.
> > > 
> > > That's not the only thing we need though, as I'm sure you were expecting 
> > > me to say :)
> > > 
> > > We need the ability to preserve existing behavior, i.e. process based and 
> > > not cgroup aware, for subtrees so that our users who have clear 
> > > expectations and tune their oom_score_adj accordingly based on how the oom 
> > > killer has always chosen processes for oom kill do not suddenly regress.
> > 
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> > 
> 
> No, that would overload the policy and mechanism.  We want the ability to 
> consider user-controlled subtrees as a single entity for comparison with 
> other user subtrees to select which subtree to target.  This does not 
> imply that users want their entire subtree oom killed.
> 
> > > So we need to define the policy for a subtree that is oom, and I suggest 
> > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > "cgroup", and process would be the default to preserve what currently 
> > > happens in a user subtree).
> > 
> > I'm not entirely convinced here.
> > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > and it's preferable to keep the current behavior.
> > 
> > At the same time I don't like the idea to look at the policy of the OOMing
> > cgroup. Why exceeding of one limit should be handled different to exceeding
> > of another? This seems to be a property of workload, not a limit.
> > 
> 
> The limit is the property of the mem cgroup, so it's logical that the 
> policy when reaching that limit is a property of the same mem cgroup.
> Using the user-controlled subtree example, if we have /david and /roman, 
> we can define our own policies on oom, we are not restricted to cgroup 
> aware selection on the entire hierarchy.  /david/oom.policy can be 
> "process" so that I haven't regressed with earlier kernels, and 
> /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> subtree.
> 
> Something needs to be oom killed when a mem cgroup at any level in the 
> hierarchy is reached and reclaim has failed.  What to do when that limit 
> is reached is a property of that cgroup.
> 
> > > Now, as users who rely on process selection are well aware, we have 
> > > oom_score_adj to influence the decision of which process to oom kill.  If 
> > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > influence that decision.  For example, we have high priority applications 
> > > that run at the top-level that use a lot of memory and strictly oom 
> > > killing them in all scenarios because they use a lot of memory isn't 
> > > appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> > > subtree) when compared to other cgroups.
> > > 
> > > I've also suggested, but did not implement in my patchset because I was 
> > > trying to define the API and find common ground first, that we have a need 
> > > for priority based selection.  In other words, define the priority of a 
> > > subtree regardless of cgroup usage.
> > > 
> > > So with these four things, we have
> > > 
> > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > >    subtree (and plans for "priority" in the future),
> > > 
> > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > >    subtree as the cgroup's own usage for comparison with others,
> > > 
> > >  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > >    to protect important applications and bias against unimportant
> > >    applications.
> > > 
> > > This adds several tunables, which I didn't like, so I tried to overload 
> > > oom.policy and oom.evaluate_as_group.  When I referred to separating out 
> > > the subtree usage accounting into a separate tunable, that is what I have 
> > > referenced above.
> > 
> > IMO, merging multiple tunables into one doesn't make it saner.
> > The real question how to make a reasonable interface with fever tunables.
> > 
> > The reason behind introducing all these knobs is to provide
> > a generic solution to define OOM handling rules, but then the
> > question raises if the kernel is the best place for it.
> > 
> > I really doubt that an interface with so many knobs has any chances
> > to be merged.
> > 
> 
> This is why I attempted to overload oom.policy and oom.evaluate_as_group: 
> I could not think of a reasonable usecase where a subtree would be used to 
> account for cgroup usage but not use a cgroup aware policy itself.  You've 
> objected to that, where memory.oom_policy == "tree" implied cgroup 
> awareness in my patchset, so I've separated that out.
> 
> > IMO, there should be a compromise between the simplicity (basically,
> > the number of tunables and possible values) and functionality
> > of the interface. You nacked my previous version, and unfortunately
> > I don't have anything better so far.
> > 
> 
> If you do not agree with the overloading and have a preference for single 
> value tunables, then all three tunables are needed.  This functionality 
> could be represented as two or one tunable if they are not single value, 
> but from the oom.group discussion you preferred single values.
> 
> I assume you'd also object to adding and removing files based on 
> oom.policy since oom.evaluate_as_group and oom.adj is only needed for 
> oom.policy of "cgroup" or "priority", and they do not need to exist for 
> the default oom.policy of "process".
>

Roman Gushchin Aug. 20, 2018, 7:05 p.m. UTC | #14

On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote:
> Roman, have you had time to go through this?

Hm, I thought we've finished this part of discussion, no?
Anyway, let me repeat my position: I don't like the interface
you've proposed in that follow-up patchset, and I explained why.
If you've a new proposal, please, rebase it to the current
mm tree, and we can discuss it separately.
Alternatively, we can discuss the interface first (without
the implementation), but, please, make a new thread with a
fresh description of a proposed interface.

Thanks!

> 
> 
> On Tue, 7 Aug 2018, David Rientjes wrote:
> 
> > On Mon, 6 Aug 2018, Roman Gushchin wrote:
> > 
> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > > that the usage of the entire subtree should be compared as a single 
> > > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > > not be necessary for top-level cgroups depending on how you structure your 
> > > > unified cgroup hierarchy.  So it needs to be configurable, as you suggest, 
> > > > and you are correct it can be different than oom.group.
> > > > 
> > > > That's not the only thing we need though, as I'm sure you were expecting 
> > > > me to say :)
> > > > 
> > > > We need the ability to preserve existing behavior, i.e. process based and 
> > > > not cgroup aware, for subtrees so that our users who have clear 
> > > > expectations and tune their oom_score_adj accordingly based on how the oom 
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > 
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process approach.
> > > 
> > 
> > No, that would overload the policy and mechanism.  We want the ability to 
> > consider user-controlled subtrees as a single entity for comparison with 
> > other user subtrees to select which subtree to target.  This does not 
> > imply that users want their entire subtree oom killed.
> > 
> > > > So we need to define the policy for a subtree that is oom, and I suggest 
> > > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > > "cgroup", and process would be the default to preserve what currently 
> > > > happens in a user subtree).
> > > 
> > > I'm not entirely convinced here.
> > > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > > and it's preferable to keep the current behavior.
> > > 
> > > At the same time I don't like the idea to look at the policy of the OOMing
> > > cgroup. Why exceeding of one limit should be handled different to exceeding
> > > of another? This seems to be a property of workload, not a limit.
> > > 
> > 
> > The limit is the property of the mem cgroup, so it's logical that the 
> > policy when reaching that limit is a property of the same mem cgroup.
> > Using the user-controlled subtree example, if we have /david and /roman, 
> > we can define our own policies on oom, we are not restricted to cgroup 
> > aware selection on the entire hierarchy.  /david/oom.policy can be 
> > "process" so that I haven't regressed with earlier kernels, and 
> > /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> > subtree.
> > 
> > Something needs to be oom killed when a mem cgroup at any level in the 
> > hierarchy is reached and reclaim has failed.  What to do when that limit 
> > is reached is a property of that cgroup.
> > 
> > > > Now, as users who rely on process selection are well aware, we have 
> > > > oom_score_adj to influence the decision of which process to oom kill.  If 
> > > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > > influence that decision.  For example, we have high priority applications 
> > > > that run at the top-level that use a lot of memory and strictly oom 
> > > > killing them in all scenarios because they use a lot of memory isn't 
> > > > appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> > > > subtree) when compared to other cgroups.
> > > > 
> > > > I've also suggested, but did not implement in my patchset because I was 
> > > > trying to define the API and find common ground first, that we have a need 
> > > > for priority based selection.  In other words, define the priority of a 
> > > > subtree regardless of cgroup usage.
> > > > 
> > > > So with these four things, we have
> > > > 
> > > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > > >    subtree (and plans for "priority" in the future),
> > > > 
> > > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > > >    subtree as the cgroup's own usage for comparison with others,
> > > > 
> > > >  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > > >    to protect important applications and bias against unimportant
> > > >    applications.
> > > > 
> > > > This adds several tunables, which I didn't like, so I tried to overload 
> > > > oom.policy and oom.evaluate_as_group.  When I referred to separating out 
> > > > the subtree usage accounting into a separate tunable, that is what I have 
> > > > referenced above.
> > > 
> > > IMO, merging multiple tunables into one doesn't make it saner.
> > > The real question how to make a reasonable interface with fever tunables.
> > > 
> > > The reason behind introducing all these knobs is to provide
> > > a generic solution to define OOM handling rules, but then the
> > > question raises if the kernel is the best place for it.
> > > 
> > > I really doubt that an interface with so many knobs has any chances
> > > to be merged.
> > > 
> > 
> > This is why I attempted to overload oom.policy and oom.evaluate_as_group: 
> > I could not think of a reasonable usecase where a subtree would be used to 
> > account for cgroup usage but not use a cgroup aware policy itself.  You've 
> > objected to that, where memory.oom_policy == "tree" implied cgroup 
> > awareness in my patchset, so I've separated that out.
> > 
> > > IMO, there should be a compromise between the simplicity (basically,
> > > the number of tunables and possible values) and functionality
> > > of the interface. You nacked my previous version, and unfortunately
> > > I don't have anything better so far.
> > > 
> > 
> > If you do not agree with the overloading and have a preference for single 
> > value tunables, then all three tunables are needed.  This functionality 
> > could be represented as two or one tunable if they are not single value, 
> > but from the oom.group discussion you preferred single values.
> > 
> > I assume you'd also object to adding and removing files based on 
> > oom.policy since oom.evaluate_as_group and oom.adj is only needed for 
> > oom.policy of "cgroup" or "priority", and they do not need to exist for 
> > the default oom.policy of "process".
> >

[0/3] introduce memory.oom.group

Message

Comments