Message ID | alpine.DEB.2.21.2003101454580.142656@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, oom: make a last minute check to prevent unnecessary memcg oom kills | expand |
On Tue 10-03-20 14:55:50, David Rientjes wrote: > Killing a user process as a result of hitting memcg limits is a serious > decision that is unfortunately needed only when no forward progress in > reclaiming memory can be made. > > Deciding the appropriate oom victim can take a sufficient amount of time > that allows another process that is exiting to actually uncharge to the > same memcg hierarchy and prevent unnecessarily killing user processes. > > An example is to prevent *multiple* unnecessary oom kills on a system > with two cores where the oom kill occurs when there is an abundance of > free memory available: > > Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0 > <immediately after> > repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 > CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130 > Call Trace: > dump_stack+0x78/0xb6 > dump_header+0x55/0x240 > oom_kill_process+0xc5/0x170 > out_of_memory+0x305/0x4a0 > try_charge+0x77b/0xac0 > mem_cgroup_try_charge+0x10a/0x220 > mem_cgroup_try_charge_delay+0x1e/0x40 > handle_mm_fault+0xdf2/0x15f0 > do_user_addr_fault+0x21f/0x420 > async_page_fault+0x2f/0x40 > memory: usage 61336kB, limit 102400kB, failcnt 74 > > Notice the second memcg oom kill shows usage is >40MB below its limit of > 100MB but a process is still unnecessarily killed because the decision has > already been made to oom kill by calling out_of_memory() before the > initial victim had a chance to uncharge its memory. Could you be more specific about the specific workload please? > Make a last minute check to determine if an oom kill is really needed to > prevent unnecessary oom killing. I really see no reason why the memcg oom should behave differently from the global case. In both cases there will be a point of no return. Where-ever it is done it will be racy and the oom victim selection will play the race window role. There is simply no way around that without making the whole thing completely synchronous. This all looks like a micro optimization and I would really like to see a relevant real world usecase presented before new special casing is added. > > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: stable@vger.kernel.org > Signed-off-by: David Rientjes <rientjes@google.com> > --- > include/linux/memcontrol.h | 7 +++++++ > mm/memcontrol.c | 2 +- > mm/oom_kill.c | 16 +++++++++++++--- > 3 files changed, 21 insertions(+), 4 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); > int mem_cgroup_scan_tasks(struct mem_cgroup *, > int (*)(struct task_struct *, void *), void *); > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); > + > static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) > { > if (mem_cgroup_disabled()) > @@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, > return 0; > } > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > +{ > + return 0; > +} > + > static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) > { > return 0; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > * Returns the maximum amount of memory @mem can be charged with, in > * pages. > */ > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > { > unsigned long margin = 0; > unsigned long count; > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -972,9 +972,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > } > task_unlock(victim); > > - if (__ratelimit(&oom_rs)) > - dump_header(oc, victim); > - > /* > * Do we need to kill the entire memory cgroup? > * Or even one of the ancestor memory cgroups? > @@ -982,6 +979,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > */ > oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); > > + if (is_memcg_oom(oc)) { > + cond_resched(); > + > + /* One last check: do we *really* need to kill? */ > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) { > + put_task_struct(victim); > + return; > + } > + } > + > + if (__ratelimit(&oom_rs)) > + dump_header(oc, victim); > + > __oom_kill_process(victim, message); > > /*
On Tue, 10 Mar 2020, Michal Hocko wrote: > On Tue 10-03-20 14:55:50, David Rientjes wrote: > > Killing a user process as a result of hitting memcg limits is a serious > > decision that is unfortunately needed only when no forward progress in > > reclaiming memory can be made. > > > > Deciding the appropriate oom victim can take a sufficient amount of time > > that allows another process that is exiting to actually uncharge to the > > same memcg hierarchy and prevent unnecessarily killing user processes. > > > > An example is to prevent *multiple* unnecessary oom kills on a system > > with two cores where the oom kill occurs when there is an abundance of > > free memory available: > > > > Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0 > > <immediately after> > > repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 > > CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130 > > Call Trace: > > dump_stack+0x78/0xb6 > > dump_header+0x55/0x240 > > oom_kill_process+0xc5/0x170 > > out_of_memory+0x305/0x4a0 > > try_charge+0x77b/0xac0 > > mem_cgroup_try_charge+0x10a/0x220 > > mem_cgroup_try_charge_delay+0x1e/0x40 > > handle_mm_fault+0xdf2/0x15f0 > > do_user_addr_fault+0x21f/0x420 > > async_page_fault+0x2f/0x40 > > memory: usage 61336kB, limit 102400kB, failcnt 74 > > > > Notice the second memcg oom kill shows usage is >40MB below its limit of > > 100MB but a process is still unnecessarily killed because the decision has > > already been made to oom kill by calling out_of_memory() before the > > initial victim had a chance to uncharge its memory. > > Could you be more specific about the specific workload please? > Robert, could you elaborate on the user-visible effects of this issue that caused it to initially get reported? > > Make a last minute check to determine if an oom kill is really needed to > > prevent unnecessary oom killing. > > I really see no reason why the memcg oom should behave differently from > the global case. In both cases there will be a point of no return. > Where-ever it is done it will be racy and the oom victim selection will > play the race window role. There is simply no way around that without > making the whole thing completely synchronous. This all looks like a > micro optimization and I would really like to see a relevant real world > usecase presented before new special casing is added. > The patch certainly prevents unnecessary oom kills when there is a pending victim that uncharges its memory between invoking the oom killer and finding MMF_OOM_SKIP in the list of eligible tasks and its much more common on systems with limited cpu cores. Adding support for the global case is more difficult because we rely on multiple heuristics, not only watermarks, to determine if we can allocate. It would likely require using a lot of the logic from the page allocator (alloc flags, watermark check, mempolicy awareness, cpusets) to make it work reliably and not miss a corner-case where we actually don't end up on oom killing anything at all. Memcg charging, on the other hand, is much simpler as exhibited by this patch since it's only about the number of pages to charge and avoiding unnecessarily oom killing a user process is of paramount importance to that user. Basically: it becomes very difficult for a cloud provider to say "look, your process was oom killed and it shows usage is 60MB in a memcg limited to 100MB" :) > > > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Michal Hocko <mhocko@kernel.org> > > Cc: stable@vger.kernel.org > > Signed-off-by: David Rientjes <rientjes@google.com> > > --- > > include/linux/memcontrol.h | 7 +++++++ > > mm/memcontrol.c | 2 +- > > mm/oom_kill.c | 16 +++++++++++++--- > > 3 files changed, 21 insertions(+), 4 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); > > int mem_cgroup_scan_tasks(struct mem_cgroup *, > > int (*)(struct task_struct *, void *), void *); > > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); > > + > > static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) > > { > > if (mem_cgroup_disabled()) > > @@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, > > return 0; > > } > > > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > +{ > > + return 0; > > +} > > + > > static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) > > { > > return 0; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > > * Returns the maximum amount of memory @mem can be charged with, in > > * pages. > > */ > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > { > > unsigned long margin = 0; > > unsigned long count; > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -972,9 +972,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > } > > task_unlock(victim); > > > > - if (__ratelimit(&oom_rs)) > > - dump_header(oc, victim); > > - > > /* > > * Do we need to kill the entire memory cgroup? > > * Or even one of the ancestor memory cgroups? > > @@ -982,6 +979,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message) > > */ > > oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); > > > > + if (is_memcg_oom(oc)) { > > + cond_resched(); > > + > > + /* One last check: do we *really* need to kill? */ > > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) { > > + put_task_struct(victim); > > + return; > > + } > > + } > > + > > + if (__ratelimit(&oom_rs)) > > + dump_header(oc, victim); > > + > > __oom_kill_process(victim, message); > > > > /* > > -- > Michal Hocko > SUSE Labs >
On Tue 10-03-20 15:54:44, David Rientjes wrote: > On Tue, 10 Mar 2020, Michal Hocko wrote: > > > On Tue 10-03-20 14:55:50, David Rientjes wrote: > > > Killing a user process as a result of hitting memcg limits is a serious > > > decision that is unfortunately needed only when no forward progress in > > > reclaiming memory can be made. > > > > > > Deciding the appropriate oom victim can take a sufficient amount of time > > > that allows another process that is exiting to actually uncharge to the > > > same memcg hierarchy and prevent unnecessarily killing user processes. > > > > > > An example is to prevent *multiple* unnecessary oom kills on a system > > > with two cores where the oom kill occurs when there is an abundance of > > > free memory available: > > > > > > Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0 > > > <immediately after> > > > repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 > > > CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130 > > > Call Trace: > > > dump_stack+0x78/0xb6 > > > dump_header+0x55/0x240 > > > oom_kill_process+0xc5/0x170 > > > out_of_memory+0x305/0x4a0 > > > try_charge+0x77b/0xac0 > > > mem_cgroup_try_charge+0x10a/0x220 > > > mem_cgroup_try_charge_delay+0x1e/0x40 > > > handle_mm_fault+0xdf2/0x15f0 > > > do_user_addr_fault+0x21f/0x420 > > > async_page_fault+0x2f/0x40 > > > memory: usage 61336kB, limit 102400kB, failcnt 74 > > > > > > Notice the second memcg oom kill shows usage is >40MB below its limit of > > > 100MB but a process is still unnecessarily killed because the decision has > > > already been made to oom kill by calling out_of_memory() before the > > > initial victim had a chance to uncharge its memory. > > > > Could you be more specific about the specific workload please? > > > > Robert, could you elaborate on the user-visible effects of this issue that > caused it to initially get reported? Yes please, real life usecases are important when adding hacks like this one and we should have a clear data to support the check actually helps (in how many instances etc...)
On 2020/03/11 7:54, David Rientjes wrote: > The patch certainly prevents unnecessary oom kills when there is a pending > victim that uncharges its memory between invoking the oom killer and > finding MMF_OOM_SKIP in the list of eligible tasks and its much more > common on systems with limited cpu cores. I think that it is dump_header() which currently spends much time (due to synchronous printing) enough to make "the second memcg oom kill shows usage is >40MB below its limit of 100MB" happen. Shouldn't we call dump_header() and then do the last check and end with "but did not kill anybody" message?
On Wed, 11 Mar 2020, Tetsuo Handa wrote: > > The patch certainly prevents unnecessary oom kills when there is a pending > > victim that uncharges its memory between invoking the oom killer and > > finding MMF_OOM_SKIP in the list of eligible tasks and its much more > > common on systems with limited cpu cores. > > I think that it is dump_header() which currently spends much time (due to > synchronous printing) enough to make "the second memcg oom kill shows usage > is >40MB below its limit of 100MB" happen. Shouldn't we call dump_header() > and then do the last check and end with "but did not kill anybody" message? > Lol, I actually did that for internal testing as well :) I didn't like how it spammed the kernel log and then basically said "just kidding, nothing was oom killed." But if you think this would helpful I can propose it as v2. --- include/linux/memcontrol.h | 7 +++++++ mm/memcontrol.c | 2 +- mm/oom_kill.c | 14 +++++++++++++- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); int mem_cgroup_scan_tasks(struct mem_cgroup *, int (*)(struct task_struct *, void *), void *); +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); + static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) @@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, return 0; } +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +{ + return 0; +} + static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, * Returns the maximum amount of memory @mem can be charged with, in * pages. */ -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) { unsigned long margin = 0; unsigned long count; diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -934,7 +934,6 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) mmdrop(mm); put_task_struct(victim); } -#undef K /* * Kill provided task unless it's secured by setting @@ -982,6 +981,18 @@ static void oom_kill_process(struct oom_control *oc, const char *message) */ oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); + /* One last check: do we *really* need to kill? */ + if (is_memcg_oom(oc)) { + unsigned long margin = mem_cgroup_margin(oc->memcg); + + if (unlikely(margin >= (1 << oc->order))) { + put_task_struct(victim); + pr_info("Suppressed oom kill, %lukB of memory can be charged\n", + K(margin)); + return; + } + } + __oom_kill_process(victim, message); /* @@ -994,6 +1005,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) mem_cgroup_put(oom_group); } } +#undef K /* * Determines whether the kernel must panic because of the panic_on_oom sysctl.
On Wed 11-03-20 09:39:01, Michal Hocko wrote: > On Tue 10-03-20 15:54:44, David Rientjes wrote: > > On Tue, 10 Mar 2020, Michal Hocko wrote: > > > > > On Tue 10-03-20 14:55:50, David Rientjes wrote: > > > > Killing a user process as a result of hitting memcg limits is a serious > > > > decision that is unfortunately needed only when no forward progress in > > > > reclaiming memory can be made. > > > > > > > > Deciding the appropriate oom victim can take a sufficient amount of time > > > > that allows another process that is exiting to actually uncharge to the > > > > same memcg hierarchy and prevent unnecessarily killing user processes. > > > > > > > > An example is to prevent *multiple* unnecessary oom kills on a system > > > > with two cores where the oom kill occurs when there is an abundance of > > > > free memory available: > > > > > > > > Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0 > > > > <immediately after> > > > > repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 > > > > CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130 > > > > Call Trace: > > > > dump_stack+0x78/0xb6 > > > > dump_header+0x55/0x240 > > > > oom_kill_process+0xc5/0x170 > > > > out_of_memory+0x305/0x4a0 > > > > try_charge+0x77b/0xac0 > > > > mem_cgroup_try_charge+0x10a/0x220 > > > > mem_cgroup_try_charge_delay+0x1e/0x40 > > > > handle_mm_fault+0xdf2/0x15f0 > > > > do_user_addr_fault+0x21f/0x420 > > > > async_page_fault+0x2f/0x40 > > > > memory: usage 61336kB, limit 102400kB, failcnt 74 > > > > > > > > Notice the second memcg oom kill shows usage is >40MB below its limit of > > > > 100MB but a process is still unnecessarily killed because the decision has > > > > already been made to oom kill by calling out_of_memory() before the > > > > initial victim had a chance to uncharge its memory. > > > > > > Could you be more specific about the specific workload please? > > > > > > > Robert, could you elaborate on the user-visible effects of this issue that > > caused it to initially get reported? > > Yes please, real life usecases are important when adding hacks like this > one and we should have a clear data to support the check actually helps > (in how many instances etc...) Friendly ping.
On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote: > > Robert, could you elaborate on the user-visible effects of this issue that > caused it to initially get reported? > Ami (now cc'ed) knows more, but here is my understanding. The use case involves a Docker container running multiple processes. The container has a memory limit set. The container contains two long-lived, important processes p1 and p2, and some arbitrary, dynamic number of usually ephemeral processes p3,...,pn. These processes are structured in a hierarchy that looks like p1->p2->[p3,...,pn]; p1 is a parent of p2, and p2 is the parent for all of the ephemeral processes p3,...,pn. Since p1 and p2 are long-lived and important, the user does not want p1 and p2 to be oom-killed. However, p3,...,pn are expected to use a lot of memory, and it's ok for those processes to be oom-killed. If the user sets oom_score_adj on p1 and p2 to make them very unlikely to be oom-killed, p3,...,pn will inherit the oom_score_adj value, which is bad. Additionally, setting oom_score_adj on p3,...,pn is tricky, since processes in the Docker container (specifically p1 and p2) don't have permissions to set oom_score_adj on p3,...,pn. The ephemeral nature of p3,...,pn also makes setting oom_score_adj on them tricky after they launch. So, the user hopes that when one of p3,...,pn triggers an oom condition in the Docker container, the oom killer will almost always kill processes from p3,...,pn (and not kill p1 or p2, which are both important and unlikely to trigger an oom condition). The issue of more processes being killed than are strictly necessary is resulting in p1 or p2 being killed much more frequently when one of p3,...,pn triggers an oom condition, and p1 or p2 being killed is very disruptive for the user (my understanding is that p1 or p2 going down with high frequency results in significant unhealthiness in the user's service). The change proposed here has not been run in a production system, and so I don't think anyone has data that conclusively demonstrates that this change will solve the user's problem. But, from observations made in their production system, the user is confident that addressing this aggressive oom killing will solve their problem, and we have data that shows this change does considerably reduce the frequency of aggressive oom killing (from 61/100 oom killing events down to 0/100 with this change). Hope this gives a bit more context. Thanks, -Robert
On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer <rkolchmeyer@google.com> wrote: > > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote: > > > > Robert, could you elaborate on the user-visible effects of this issue that > > caused it to initially get reported? > > Ami (now cc'ed) knows more, but here is my understanding. Robert's description of the mechanics we observed is accurate. We discovered this regression in the oom-killer's behavior when attempting to upgrade our system. The fraction of the system that went unhealthy due to this issue was approximately equal to the _sum_ of all other causes of unhealth, which are many and varied, but each of which contribute only a small amount of unhealth. This issue forced a rollback to the previous kernel where we ~never see this behavior, returning our unhealth levels to the previous background levels. Cheers, -a
On Tue 17-03-20 11:25:52, Robert Kolchmeyer wrote: > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote: > > > > Robert, could you elaborate on the user-visible effects of this issue that > > caused it to initially get reported? > > > > Ami (now cc'ed) knows more, but here is my understanding. The use case > involves a Docker container running multiple processes. The container > has a memory limit set. The container contains two long-lived, > important processes p1 and p2, and some arbitrary, dynamic number of > usually ephemeral processes p3,...,pn. These processes are structured > in a hierarchy that looks like p1->p2->[p3,...,pn]; p1 is a parent of > p2, and p2 is the parent for all of the ephemeral processes p3,...,pn. > > Since p1 and p2 are long-lived and important, the user does not want > p1 and p2 to be oom-killed. However, p3,...,pn are expected to use a > lot of memory, and it's ok for those processes to be oom-killed. > > If the user sets oom_score_adj on p1 and p2 to make them very unlikely > to be oom-killed, p3,...,pn will inherit the oom_score_adj value, > which is bad. Additionally, setting oom_score_adj on p3,...,pn is > tricky, since processes in the Docker container (specifically p1 and > p2) don't have permissions to set oom_score_adj on p3,...,pn. The > ephemeral nature of p3,...,pn also makes setting oom_score_adj on them > tricky after they launch. Thanks for the clarification. > So, the user hopes that when one of p3,...,pn triggers an oom > condition in the Docker container, the oom killer will almost always > kill processes from p3,...,pn (and not kill p1 or p2, which are both > important and unlikely to trigger an oom condition). The issue of more > processes being killed than are strictly necessary is resulting in p1 > or p2 being killed much more frequently when one of p3,...,pn triggers > an oom condition, and p1 or p2 being killed is very disruptive for the > user (my understanding is that p1 or p2 going down with high frequency > results in significant unhealthiness in the user's service). Do you have any logs showing this condition? I am interested because from your description it seems like p1/p2 shouldn't be usually those which trigger the oom, right? That suggests that it should be mostly p3, ... pn to be in the kernel triggering the oom and therefore they shouldn't vanish.
On Tue 17-03-20 12:00:45, Ami Fischman wrote: > On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer > <rkolchmeyer@google.com> wrote: > > > > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote: > > > > > > Robert, could you elaborate on the user-visible effects of this issue that > > > caused it to initially get reported? > > > > Ami (now cc'ed) knows more, but here is my understanding. > > Robert's description of the mechanics we observed is accurate. > > We discovered this regression in the oom-killer's behavior when > attempting to upgrade our system. The fraction of the system that > went unhealthy due to this issue was approximately equal to the > _sum_ of all other causes of unhealth, which are many and varied, > but each of which contribute only a small amount of > unhealth. This issue forced a rollback to the previous kernel > where we ~never see this behavior, returning our unhealth levels > to the previous background levels. Could you be more specific on the good vs. bad kernel versions? Because I do not remember any oom changes that would affect the time-to-check-time-to-kill race. The timing might be slightly different in each kernel version of course.
On Wed, Mar 18, 2020 at 2:57 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 17-03-20 12:00:45, Ami Fischman wrote: > > On Tue, Mar 17, 2020 at 11:26 AM Robert Kolchmeyer > > <rkolchmeyer@google.com> wrote: > > > > > > On Tue, Mar 10, 2020 at 3:54 PM David Rientjes <rientjes@google.com> wrote: > > > > > > > > Robert, could you elaborate on the user-visible effects of this issue that > > > > caused it to initially get reported? > > > > > > Ami (now cc'ed) knows more, but here is my understanding. > > > > Robert's description of the mechanics we observed is accurate. > > > > We discovered this regression in the oom-killer's behavior when > > attempting to upgrade our system. The fraction of the system that > > went unhealthy due to this issue was approximately equal to the > > _sum_ of all other causes of unhealth, which are many and varied, > > but each of which contribute only a small amount of > > unhealth. This issue forced a rollback to the previous kernel > > where we ~never see this behavior, returning our unhealth levels > > to the previous background levels. > > Could you be more specific on the good vs. bad kernel versions? Because > I do not remember any oom changes that would affect the > time-to-check-time-to-kill race. The timing might be slightly different > in each kernel version of course. The original upgrade attempt included a large window of kernel versions: 4.14.137 to 4.19.91. In attempting to narrow down the failure we found that in tests of 10 runs we went from 0/10 failures to 1/10 failure with the update from https://chromium.googlesource.com/chromiumos/third_party/kernel/+/74fab24be8994bb5bb8d1aa8828f50e16bb38346 (based on 4.19.60) to https://chromium.googlesource.com/chromiumos/third_party/kernel/+/6e0fef1b46bb91c196be56365d9af72e52bb4675 (also based on 4.19.60) and then we went from 1/10 failures to 9/10 failures with the upgrade to https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a33dffa8e5c47b877e4daece938a81e3cc810b90 (which I believe is based on 4.19.72). (this was all before we had the minimal repro yielding Robert's 61/100->0/100 stat in his previous email)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *); int mem_cgroup_scan_tasks(struct mem_cgroup *, int (*)(struct task_struct *, void *), void *); +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); + static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { if (mem_cgroup_disabled()) @@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg, return 0; } +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +{ + return 0; +} + static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) { return 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, * Returns the maximum amount of memory @mem can be charged with, in * pages. */ -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) { unsigned long margin = 0; unsigned long count; diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -972,9 +972,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message) } task_unlock(victim); - if (__ratelimit(&oom_rs)) - dump_header(oc, victim); - /* * Do we need to kill the entire memory cgroup? * Or even one of the ancestor memory cgroups? @@ -982,6 +979,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message) */ oom_group = mem_cgroup_get_oom_group(victim, oc->memcg); + if (is_memcg_oom(oc)) { + cond_resched(); + + /* One last check: do we *really* need to kill? */ + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) { + put_task_struct(victim); + return; + } + } + + if (__ratelimit(&oom_rs)) + dump_header(oc, victim); + __oom_kill_process(victim, message); /*
Killing a user process as a result of hitting memcg limits is a serious decision that is unfortunately needed only when no forward progress in reclaiming memory can be made. Deciding the appropriate oom victim can take a sufficient amount of time that allows another process that is exiting to actually uncharge to the same memcg hierarchy and prevent unnecessarily killing user processes. An example is to prevent *multiple* unnecessary oom kills on a system with two cores where the oom kill occurs when there is an abundance of free memory available: Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0 <immediately after> repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130 Call Trace: dump_stack+0x78/0xb6 dump_header+0x55/0x240 oom_kill_process+0xc5/0x170 out_of_memory+0x305/0x4a0 try_charge+0x77b/0xac0 mem_cgroup_try_charge+0x10a/0x220 mem_cgroup_try_charge_delay+0x1e/0x40 handle_mm_fault+0xdf2/0x15f0 do_user_addr_fault+0x21f/0x420 async_page_fault+0x2f/0x40 memory: usage 61336kB, limit 102400kB, failcnt 74 Notice the second memcg oom kill shows usage is >40MB below its limit of 100MB but a process is still unnecessarily killed because the decision has already been made to oom kill by calling out_of_memory() before the initial victim had a chance to uncharge its memory. Make a last minute check to determine if an oom kill is really needed to prevent unnecessary oom killing. Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/memcontrol.h | 7 +++++++ mm/memcontrol.c | 2 +- mm/oom_kill.c | 16 +++++++++++++--- 3 files changed, 21 insertions(+), 4 deletions(-)