diff mbox series

[v2] mm, memcg: skip killing processes under memcg protection at first scan

Message ID 1566177486-2649-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm, memcg: skip killing processes under memcg protection at first scan | expand

Commit Message

Yafang Shao Aug. 19, 2019, 1:18 a.m. UTC
In the current memory.min design, the system is going to do OOM instead
of reclaiming the reclaimable pages protected by memory.min if the
system is lack of free memory. While under this condition, the OOM
killer may kill the processes in the memcg protected by memory.min.
This behavior is very weird.
In order to make it more reasonable, I make some changes in the OOM
killer. In this patch, the OOM killer will do two-round scan. It will
skip the processes under memcg protection at the first scan, and if it
can't kill any processes it will rescan all the processes.

Regarding the overhead this change may takes, I don't think it will be a
problem because this only happens under system  memory pressure and
the OOM killer can't find any proper victims which are not under memcg
protection.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
---
 include/linux/memcontrol.h |  6 ++++++
 mm/memcontrol.c            | 16 ++++++++++++++++
 mm/oom_kill.c              | 23 +++++++++++++++++++++--
 3 files changed, 43 insertions(+), 2 deletions(-)

Comments

Roman Gushchin Aug. 19, 2019, 9:12 p.m. UTC | #1
On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> In the current memory.min design, the system is going to do OOM instead
> of reclaiming the reclaimable pages protected by memory.min if the
> system is lack of free memory. While under this condition, the OOM
> killer may kill the processes in the memcg protected by memory.min.
> This behavior is very weird.
> In order to make it more reasonable, I make some changes in the OOM
> killer. In this patch, the OOM killer will do two-round scan. It will
> skip the processes under memcg protection at the first scan, and if it
> can't kill any processes it will rescan all the processes.
> 
> Regarding the overhead this change may takes, I don't think it will be a
> problem because this only happens under system  memory pressure and
> the OOM killer can't find any proper victims which are not under memcg
> protection.

Hi Yafang!

The idea makes sense at the first glance, but actually I'm worried
about mixing per-memcg and per-process characteristics.
Actually, it raises many questions:
1) if we do respect memory.min, why not memory.low too?
2) if the task is 200Gb large, does 10Mb memory protection make any
difference? if so, why would we respect it?
3) if it works for global OOMs, why not memcg-level OOMs?
4) if the task is prioritized to be killed by OOM (via oom_score_adj),
why even small memory.protection prevents it completely?
5) if there are two tasks similar in size and both protected,
should we prefer one with the smaller protection?
etc.

Actually, I think that it makes more sense to build a completely
cgroup-aware OOM killer, which will select the OOM victim scanning
the memcg tree, not individual tasks. And then it can easily respect
memory.low/min in a reasonable way.
But I failed to reach the upstream consensus on how it should work.
You can search for "memcg-aware OOM killer" in the lkml archive,
there was a ton of discussions and many many patchset versions.


The code itself can be simplified a bit too, but I think it's
not that important now.

Thanks!
Yafang Shao Aug. 20, 2019, 1:16 a.m. UTC | #2
On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > In the current memory.min design, the system is going to do OOM instead
> > of reclaiming the reclaimable pages protected by memory.min if the
> > system is lack of free memory. While under this condition, the OOM
> > killer may kill the processes in the memcg protected by memory.min.
> > This behavior is very weird.
> > In order to make it more reasonable, I make some changes in the OOM
> > killer. In this patch, the OOM killer will do two-round scan. It will
> > skip the processes under memcg protection at the first scan, and if it
> > can't kill any processes it will rescan all the processes.
> >
> > Regarding the overhead this change may takes, I don't think it will be a
> > problem because this only happens under system  memory pressure and
> > the OOM killer can't find any proper victims which are not under memcg
> > protection.
>
> Hi Yafang!
>
> The idea makes sense at the first glance, but actually I'm worried
> about mixing per-memcg and per-process characteristics.
> Actually, it raises many questions:
> 1) if we do respect memory.min, why not memory.low too?

memroy.low is different with memory.min, as the OOM killer will not be
invoked when it is reached.
If memory.low should be considered as well, we can use
mem_cgroup_protected() here to repclace task_under_memcg_protection()
here.

> 2) if the task is 200Gb large, does 10Mb memory protection make any
> difference? if so, why would we respect it?

Same with above, only consider it when the proctecion is enabled.

> 3) if it works for global OOMs, why not memcg-level OOMs?

memcg OOM is when the memory limit is reached and it can't find
something to relcaim in the memcg and have to kill processes in this
memcg.
That is different with global OOM, because the global OOM can chose
processes outside the memcg but the memcg OOM can't.

> 4) if the task is prioritized to be killed by OOM (via oom_score_adj),
> why even small memory.protection prevents it completely?

Would you pls. show me some examples that when we will set both
memory.min(meaning the porcesses in this memcg is very important) and
higher oom_score_adj(meaning the porcesses in this memcg is not
improtant at all) ?
Note that the memory.min don't know which processes is important,
while it only knows is if this process in this memcg.

> 5) if there are two tasks similar in size and both protected,
> should we prefer one with the smaller protection?
> etc.

Same with the answer in 1).

>
> Actually, I think that it makes more sense to build a completely
> cgroup-aware OOM killer, which will select the OOM victim scanning
> the memcg tree, not individual tasks. And then it can easily respect
> memory.low/min in a reasonable way.


I haven't taken a close look at the memcg-aware OOM killer, but even
with cgroup-aware OOM killer I think it still can't answer your
question 4).

> But I failed to reach the upstream consensus on how it should work.
> You can search for "memcg-aware OOM killer" in the lkml archive,
> there was a ton of discussions and many many patchset versions.
>

I will take a close look at it.  Thanks for your reference.

>
> The code itself can be simplified a bit too, but I think it's
> not that important now.
>
> Thanks!


Thanks
Yafang
Roman Gushchin Aug. 20, 2019, 1:39 a.m. UTC | #3
On Tue, Aug 20, 2019 at 09:16:01AM +0800, Yafang Shao wrote:
> On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > In the current memory.min design, the system is going to do OOM instead
> > > of reclaiming the reclaimable pages protected by memory.min if the
> > > system is lack of free memory. While under this condition, the OOM
> > > killer may kill the processes in the memcg protected by memory.min.
> > > This behavior is very weird.
> > > In order to make it more reasonable, I make some changes in the OOM
> > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > skip the processes under memcg protection at the first scan, and if it
> > > can't kill any processes it will rescan all the processes.
> > >
> > > Regarding the overhead this change may takes, I don't think it will be a
> > > problem because this only happens under system  memory pressure and
> > > the OOM killer can't find any proper victims which are not under memcg
> > > protection.
> >
> > Hi Yafang!
> >
> > The idea makes sense at the first glance, but actually I'm worried
> > about mixing per-memcg and per-process characteristics.
> > Actually, it raises many questions:
> > 1) if we do respect memory.min, why not memory.low too?
> 
> memroy.low is different with memory.min, as the OOM killer will not be
> invoked when it is reached.
> If memory.low should be considered as well, we can use
> mem_cgroup_protected() here to repclace task_under_memcg_protection()
> here.
> 
> > 2) if the task is 200Gb large, does 10Mb memory protection make any
> > difference? if so, why would we respect it?
> 
> Same with above, only consider it when the proctecion is enabled.

Right, but memory.min is a number, not a boolean flag. It defines
how much memory is protected. You're using it as an on-off knob,
which is sub-optimal from my point of view.

> 
> > 3) if it works for global OOMs, why not memcg-level OOMs?
> 
> memcg OOM is when the memory limit is reached and it can't find
> something to relcaim in the memcg and have to kill processes in this
> memcg.
> That is different with global OOM, because the global OOM can chose
> processes outside the memcg but the memcg OOM can't.

Imagine the following hierarchy:
     /
     |
     A         memory.max = 10G, memory.min = 2G
    / \
   B   C       memory.min = 1G, memory.min = 0

Say, you have memcg OOM in A, why B's memory min is not respected?
How it's different to the system-wide OOM?

> 
> > 4) if the task is prioritized to be killed by OOM (via oom_score_adj),
> > why even small memory.protection prevents it completely?
> 
> Would you pls. show me some examples that when we will set both
> memory.min(meaning the porcesses in this memcg is very important) and
> higher oom_score_adj(meaning the porcesses in this memcg is not
> improtant at all) ?
> Note that the memory.min don't know which processes is important,
> while it only knows is if this process in this memcg.

For instance, to prefer a specific process to be killed in case
of memcg OOM.
Also, memory.min can be used mostly to preserve the pagecache,
and an OOM kill means nothing but some anon memory leak.
In this case, it makes no sense to protect the leaked task.

> 
> > 5) if there are two tasks similar in size and both protected,
> > should we prefer one with the smaller protection?
> > etc.
> 
> Same with the answer in 1).

So the problem is not that your patch is incorrect (or the idea is bad),
but you're defining a new policy, which will be impossible or very hard
to change further (as any other policy).

So it's important to define it very well. Using the memory.min
number as a binary flag for selecting tasks seems a bit limited.


Thanks!
Yafang Shao Aug. 20, 2019, 2:01 a.m. UTC | #4
On Tue, Aug 20, 2019 at 9:40 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Aug 20, 2019 at 09:16:01AM +0800, Yafang Shao wrote:
> > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > In the current memory.min design, the system is going to do OOM instead
> > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > system is lack of free memory. While under this condition, the OOM
> > > > killer may kill the processes in the memcg protected by memory.min.
> > > > This behavior is very weird.
> > > > In order to make it more reasonable, I make some changes in the OOM
> > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > skip the processes under memcg protection at the first scan, and if it
> > > > can't kill any processes it will rescan all the processes.
> > > >
> > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > problem because this only happens under system  memory pressure and
> > > > the OOM killer can't find any proper victims which are not under memcg
> > > > protection.
> > >
> > > Hi Yafang!
> > >
> > > The idea makes sense at the first glance, but actually I'm worried
> > > about mixing per-memcg and per-process characteristics.
> > > Actually, it raises many questions:
> > > 1) if we do respect memory.min, why not memory.low too?
> >
> > memroy.low is different with memory.min, as the OOM killer will not be
> > invoked when it is reached.
> > If memory.low should be considered as well, we can use
> > mem_cgroup_protected() here to repclace task_under_memcg_protection()
> > here.
> >
> > > 2) if the task is 200Gb large, does 10Mb memory protection make any
> > > difference? if so, why would we respect it?
> >
> > Same with above, only consider it when the proctecion is enabled.
>
> Right, but memory.min is a number, not a boolean flag. It defines
> how much memory is protected. You're using it as an on-off knob,
> which is sub-optimal from my point of view.
>

I mean using mem_cgroup_protected(), sam with memory.min is
implementad in the global reclaim path.

> >
> > > 3) if it works for global OOMs, why not memcg-level OOMs?
> >
> > memcg OOM is when the memory limit is reached and it can't find
> > something to relcaim in the memcg and have to kill processes in this
> > memcg.
> > That is different with global OOM, because the global OOM can chose
> > processes outside the memcg but the memcg OOM can't.
>
> Imagine the following hierarchy:
>      /
>      |
>      A         memory.max = 10G, memory.min = 2G
>     / \
>    B   C       memory.min = 1G, memory.min = 0
>
> Say, you have memcg OOM in A, why B's memory min is not respected?
> How it's different to the system-wide OOM?
>

Ah, this should be considered as well. Thanks for pointing out.

> >
> > > 4) if the task is prioritized to be killed by OOM (via oom_score_adj),
> > > why even small memory.protection prevents it completely?
> >
> > Would you pls. show me some examples that when we will set both
> > memory.min(meaning the porcesses in this memcg is very important) and
> > higher oom_score_adj(meaning the porcesses in this memcg is not
> > improtant at all) ?
> > Note that the memory.min don't know which processes is important,
> > while it only knows is if this process in this memcg.
>
> For instance, to prefer a specific process to be killed in case
> of memcg OOM.
> Also, memory.min can be used mostly to preserve the pagecache,
> and an OOM kill means nothing but some anon memory leak.
> In this case, it makes no sense to protect the leaked task.
>

But actually what memory.min protected is the memory usage, instead of
pagecache,
e.g. if the anon memory is higher than memory.min, then memroy.min
can't protect file memory when swap is off.

Even there is no anon memory leak, the OOM killer can also be invoked
due to excess use of memroy.
Plus, the memory.min can also protect the leaked anon memroy in
current implementation.

> >
> > > 5) if there are two tasks similar in size and both protected,
> > > should we prefer one with the smaller protection?
> > > etc.
> >
> > Same with the answer in 1).
>
> So the problem is not that your patch is incorrect (or the idea is bad),
> but you're defining a new policy, which will be impossible or very hard
> to change further (as any other policy).
>
> So it's important to define it very well. Using the memory.min
> number as a binary flag for selecting tasks seems a bit limited.
>
>
> Thanks!
Yafang Shao Aug. 20, 2019, 2:40 a.m. UTC | #5
On Tue, Aug 20, 2019 at 10:01 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Aug 20, 2019 at 9:40 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Tue, Aug 20, 2019 at 09:16:01AM +0800, Yafang Shao wrote:
> > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > > In the current memory.min design, the system is going to do OOM instead
> > > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > > system is lack of free memory. While under this condition, the OOM
> > > > > killer may kill the processes in the memcg protected by memory.min.
> > > > > This behavior is very weird.
> > > > > In order to make it more reasonable, I make some changes in the OOM
> > > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > > skip the processes under memcg protection at the first scan, and if it
> > > > > can't kill any processes it will rescan all the processes.
> > > > >
> > > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > > problem because this only happens under system  memory pressure and
> > > > > the OOM killer can't find any proper victims which are not under memcg
> > > > > protection.
> > > >
> > > > Hi Yafang!
> > > >
> > > > The idea makes sense at the first glance, but actually I'm worried
> > > > about mixing per-memcg and per-process characteristics.
> > > > Actually, it raises many questions:
> > > > 1) if we do respect memory.min, why not memory.low too?
> > >
> > > memroy.low is different with memory.min, as the OOM killer will not be
> > > invoked when it is reached.
> > > If memory.low should be considered as well, we can use
> > > mem_cgroup_protected() here to repclace task_under_memcg_protection()
> > > here.
> > >
> > > > 2) if the task is 200Gb large, does 10Mb memory protection make any
> > > > difference? if so, why would we respect it?
> > >
> > > Same with above, only consider it when the proctecion is enabled.
> >
> > Right, but memory.min is a number, not a boolean flag. It defines
> > how much memory is protected. You're using it as an on-off knob,
> > which is sub-optimal from my point of view.
> >
>
> I mean using mem_cgroup_protected(), sam with memory.min is
> implementad in the global reclaim path.
>
> > >
> > > > 3) if it works for global OOMs, why not memcg-level OOMs?
> > >
> > > memcg OOM is when the memory limit is reached and it can't find
> > > something to relcaim in the memcg and have to kill processes in this
> > > memcg.
> > > That is different with global OOM, because the global OOM can chose
> > > processes outside the memcg but the memcg OOM can't.
> >
> > Imagine the following hierarchy:
> >      /
> >      |
> >      A         memory.max = 10G, memory.min = 2G
> >     / \
> >    B   C       memory.min = 1G, memory.min = 0
> >
> > Say, you have memcg OOM in A, why B's memory min is not respected?
> > How it's different to the system-wide OOM?
> >
>
> Ah, this should be considered as well. Thanks for pointing out.
>
> > >
> > > > 4) if the task is prioritized to be killed by OOM (via oom_score_adj),
> > > > why even small memory.protection prevents it completely?
> > >
> > > Would you pls. show me some examples that when we will set both
> > > memory.min(meaning the porcesses in this memcg is very important) and
> > > higher oom_score_adj(meaning the porcesses in this memcg is not
> > > improtant at all) ?
> > > Note that the memory.min don't know which processes is important,
> > > while it only knows is if this process in this memcg.
> >
> > For instance, to prefer a specific process to be killed in case
> > of memcg OOM.
> > Also, memory.min can be used mostly to preserve the pagecache,
> > and an OOM kill means nothing but some anon memory leak.
> > In this case, it makes no sense to protect the leaked task.
> >
>
> But actually what memory.min protected is the memory usage, instead of
> pagecache,
> e.g. if the anon memory is higher than memory.min, then memroy.min
> can't protect file memory when swap is off.
>
> Even there is no anon memory leak, the OOM killer can also be invoked
> due to excess use of memroy.
> Plus, the memory.min can also protect the leaked anon memroy in
> current implementation.
>

BTW, if there are two different memcgs open the same file, the memcg
proection will not work if one memcg is protected while another memcg
is not protected.
But that may be a rare case.

> > >
> > > > 5) if there are two tasks similar in size and both protected,
> > > > should we prefer one with the smaller protection?
> > > > etc.
> > >
> > > Same with the answer in 1).
> >
> > So the problem is not that your patch is incorrect (or the idea is bad),
> > but you're defining a new policy, which will be impossible or very hard
> > to change further (as any other policy).
> >
> > So it's important to define it very well. Using the memory.min
> > number as a binary flag for selecting tasks seems a bit limited.
> >
> >
> > Thanks!
Michal Hocko Aug. 20, 2019, 6:40 a.m. UTC | #6
On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> >
> > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > In the current memory.min design, the system is going to do OOM instead
> > > of reclaiming the reclaimable pages protected by memory.min if the
> > > system is lack of free memory. While under this condition, the OOM
> > > killer may kill the processes in the memcg protected by memory.min.
> > > This behavior is very weird.
> > > In order to make it more reasonable, I make some changes in the OOM
> > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > skip the processes under memcg protection at the first scan, and if it
> > > can't kill any processes it will rescan all the processes.
> > >
> > > Regarding the overhead this change may takes, I don't think it will be a
> > > problem because this only happens under system  memory pressure and
> > > the OOM killer can't find any proper victims which are not under memcg
> > > protection.
> >
> > Hi Yafang!
> >
> > The idea makes sense at the first glance, but actually I'm worried
> > about mixing per-memcg and per-process characteristics.
> > Actually, it raises many questions:
> > 1) if we do respect memory.min, why not memory.low too?
> 
> memroy.low is different with memory.min, as the OOM killer will not be
> invoked when it is reached.

Responded in other email thread (please do not post two versions of the
patch on the same day because it makes conversation too scattered and
confusing).

Think of min limit protection as some sort of a more inteligent mlock.
It protects from the regular memory reclaim and it can lead to the OOM
situation (be it global or memcg) but by no means it doesn't prevent
from the system to kill the workload if there is a need. Those two
decisions are simply orthogonal IMHO. The later is a an emergency action
while the former is to help guanratee a runtime behavior of the workload.

To be completely fair, the OOM killer is a sort of the memory reclaim as
well so strictly speaking both mlock and memcg min protection could be
considered but from any practical aspect I can think of I simply do not
see a strong usecase that would justify a more complex oom behavior.
People will be simply confused that the selection is less deterministic
and therefore more confusing.
Yafang Shao Aug. 20, 2019, 7:15 a.m. UTC | #7
On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > >
> > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > In the current memory.min design, the system is going to do OOM instead
> > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > system is lack of free memory. While under this condition, the OOM
> > > > killer may kill the processes in the memcg protected by memory.min.
> > > > This behavior is very weird.
> > > > In order to make it more reasonable, I make some changes in the OOM
> > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > skip the processes under memcg protection at the first scan, and if it
> > > > can't kill any processes it will rescan all the processes.
> > > >
> > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > problem because this only happens under system  memory pressure and
> > > > the OOM killer can't find any proper victims which are not under memcg
> > > > protection.
> > >
> > > Hi Yafang!
> > >
> > > The idea makes sense at the first glance, but actually I'm worried
> > > about mixing per-memcg and per-process characteristics.
> > > Actually, it raises many questions:
> > > 1) if we do respect memory.min, why not memory.low too?
> >
> > memroy.low is different with memory.min, as the OOM killer will not be
> > invoked when it is reached.
>
> Responded in other email thread (please do not post two versions of the
> patch on the same day because it makes conversation too scattered and
> confusing).
>
(This is an issue about time zone :-) )

> Think of min limit protection as some sort of a more inteligent mlock.

Per my perspective, it is a less inteligent mlock, because what it
protected may be a garbage memory.
As I said before, what it protected is the memroy usage, rather than a
specified file memory or anon memory or somethin else.

The advantage of it is easy to use.

> It protects from the regular memory reclaim and it can lead to the OOM
> situation (be it global or memcg) but by no means it doesn't prevent
> from the system to kill the workload if there is a need. Those two
> decisions are simply orthogonal IMHO. The later is a an emergency action
> while the former is to help guanratee a runtime behavior of the workload.
>

If it can handle OOM memory reclaim, it will be more inteligent.

> To be completely fair, the OOM killer is a sort of the memory reclaim as
> well so strictly speaking both mlock and memcg min protection could be
> considered but from any practical aspect I can think of I simply do not
> see a strong usecase that would justify a more complex oom behavior.
> People will be simply confused that the selection is less deterministic
> and therefore more confusing.
> --

So what about ajusting the oom_socore_adj automatically when we set
memory.min or mlock ?

Thanks
Yafang
Michal Hocko Aug. 20, 2019, 7:27 a.m. UTC | #8
On Tue 20-08-19 15:15:54, Yafang Shao wrote:
> On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > > >
> > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > > In the current memory.min design, the system is going to do OOM instead
> > > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > > system is lack of free memory. While under this condition, the OOM
> > > > > killer may kill the processes in the memcg protected by memory.min.
> > > > > This behavior is very weird.
> > > > > In order to make it more reasonable, I make some changes in the OOM
> > > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > > skip the processes under memcg protection at the first scan, and if it
> > > > > can't kill any processes it will rescan all the processes.
> > > > >
> > > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > > problem because this only happens under system  memory pressure and
> > > > > the OOM killer can't find any proper victims which are not under memcg
> > > > > protection.
> > > >
> > > > Hi Yafang!
> > > >
> > > > The idea makes sense at the first glance, but actually I'm worried
> > > > about mixing per-memcg and per-process characteristics.
> > > > Actually, it raises many questions:
> > > > 1) if we do respect memory.min, why not memory.low too?
> > >
> > > memroy.low is different with memory.min, as the OOM killer will not be
> > > invoked when it is reached.
> >
> > Responded in other email thread (please do not post two versions of the
> > patch on the same day because it makes conversation too scattered and
> > confusing).
> >
> (This is an issue about time zone :-) )

Normally we wait few days until feedback on the particular patch is
settled before a new version is posted.

> > Think of min limit protection as some sort of a more inteligent mlock.
> 
> Per my perspective, it is a less inteligent mlock, because what it
> protected may be a garbage memory.
> As I said before, what it protected is the memroy usage, rather than a
> specified file memory or anon memory or somethin else.
> 
> The advantage of it is easy to use.
> 
> > It protects from the regular memory reclaim and it can lead to the OOM
> > situation (be it global or memcg) but by no means it doesn't prevent
> > from the system to kill the workload if there is a need. Those two
> > decisions are simply orthogonal IMHO. The later is a an emergency action
> > while the former is to help guanratee a runtime behavior of the workload.
> >
> 
> If it can handle OOM memory reclaim, it will be more inteligent.

Can we get back to an actual usecase please?
 
> > To be completely fair, the OOM killer is a sort of the memory reclaim as
> > well so strictly speaking both mlock and memcg min protection could be
> > considered but from any practical aspect I can think of I simply do not
> > see a strong usecase that would justify a more complex oom behavior.
> > People will be simply confused that the selection is less deterministic
> > and therefore more confusing.
> > --
> 
> So what about ajusting the oom_socore_adj automatically when we set
> memory.min or mlock ?

oom_score_adj is a _user_ tuning. The kernel has no business in
auto-tuning it. It should just consume the value.
Yafang Shao Aug. 20, 2019, 7:49 a.m. UTC | #9
On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 20-08-19 15:15:54, Yafang Shao wrote:
> > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > > > >
> > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > > > In the current memory.min design, the system is going to do OOM instead
> > > > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > > > system is lack of free memory. While under this condition, the OOM
> > > > > > killer may kill the processes in the memcg protected by memory.min.
> > > > > > This behavior is very weird.
> > > > > > In order to make it more reasonable, I make some changes in the OOM
> > > > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > > > skip the processes under memcg protection at the first scan, and if it
> > > > > > can't kill any processes it will rescan all the processes.
> > > > > >
> > > > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > > > problem because this only happens under system  memory pressure and
> > > > > > the OOM killer can't find any proper victims which are not under memcg
> > > > > > protection.
> > > > >
> > > > > Hi Yafang!
> > > > >
> > > > > The idea makes sense at the first glance, but actually I'm worried
> > > > > about mixing per-memcg and per-process characteristics.
> > > > > Actually, it raises many questions:
> > > > > 1) if we do respect memory.min, why not memory.low too?
> > > >
> > > > memroy.low is different with memory.min, as the OOM killer will not be
> > > > invoked when it is reached.
> > >
> > > Responded in other email thread (please do not post two versions of the
> > > patch on the same day because it makes conversation too scattered and
> > > confusing).
> > >
> > (This is an issue about time zone :-) )
>
> Normally we wait few days until feedback on the particular patch is
> settled before a new version is posted.
>
> > > Think of min limit protection as some sort of a more inteligent mlock.
> >
> > Per my perspective, it is a less inteligent mlock, because what it
> > protected may be a garbage memory.
> > As I said before, what it protected is the memroy usage, rather than a
> > specified file memory or anon memory or somethin else.
> >
> > The advantage of it is easy to use.
> >
> > > It protects from the regular memory reclaim and it can lead to the OOM
> > > situation (be it global or memcg) but by no means it doesn't prevent
> > > from the system to kill the workload if there is a need. Those two
> > > decisions are simply orthogonal IMHO. The later is a an emergency action
> > > while the former is to help guanratee a runtime behavior of the workload.
> > >
> >
> > If it can handle OOM memory reclaim, it will be more inteligent.
>
> Can we get back to an actual usecase please?
>

No real usecase.
What we concerned is if it can lead to more OOMs but can't protect
itself in OOM then this behavior seems a little wierd.
Setting oom_score_adj is another choice,  but there's no memcg-level
oom_score_adj.
memory.min is memcg-level, while oom_score_adj is process-level, that
is wierd as well.

> > > To be completely fair, the OOM killer is a sort of the memory reclaim as
> > > well so strictly speaking both mlock and memcg min protection could be
> > > considered but from any practical aspect I can think of I simply do not
> > > see a strong usecase that would justify a more complex oom behavior.
> > > People will be simply confused that the selection is less deterministic
> > > and therefore more confusing.
> > > --
> >
> > So what about ajusting the oom_socore_adj automatically when we set
> > memory.min or mlock ?
>
> oom_score_adj is a _user_ tuning. The kernel has no business in
> auto-tuning it. It should just consume the value.
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 20, 2019, 8:34 a.m. UTC | #10
On Tue 20-08-19 15:49:20, Yafang Shao wrote:
> On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 20-08-19 15:15:54, Yafang Shao wrote:
> > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > > > > >
> > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > > > > In the current memory.min design, the system is going to do OOM instead
> > > > > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > > > > system is lack of free memory. While under this condition, the OOM
> > > > > > > killer may kill the processes in the memcg protected by memory.min.
> > > > > > > This behavior is very weird.
> > > > > > > In order to make it more reasonable, I make some changes in the OOM
> > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > > > > skip the processes under memcg protection at the first scan, and if it
> > > > > > > can't kill any processes it will rescan all the processes.
> > > > > > >
> > > > > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > > > > problem because this only happens under system  memory pressure and
> > > > > > > the OOM killer can't find any proper victims which are not under memcg
> > > > > > > protection.
> > > > > >
> > > > > > Hi Yafang!
> > > > > >
> > > > > > The idea makes sense at the first glance, but actually I'm worried
> > > > > > about mixing per-memcg and per-process characteristics.
> > > > > > Actually, it raises many questions:
> > > > > > 1) if we do respect memory.min, why not memory.low too?
> > > > >
> > > > > memroy.low is different with memory.min, as the OOM killer will not be
> > > > > invoked when it is reached.
> > > >
> > > > Responded in other email thread (please do not post two versions of the
> > > > patch on the same day because it makes conversation too scattered and
> > > > confusing).
> > > >
> > > (This is an issue about time zone :-) )
> >
> > Normally we wait few days until feedback on the particular patch is
> > settled before a new version is posted.
> >
> > > > Think of min limit protection as some sort of a more inteligent mlock.
> > >
> > > Per my perspective, it is a less inteligent mlock, because what it
> > > protected may be a garbage memory.
> > > As I said before, what it protected is the memroy usage, rather than a
> > > specified file memory or anon memory or somethin else.
> > >
> > > The advantage of it is easy to use.
> > >
> > > > It protects from the regular memory reclaim and it can lead to the OOM
> > > > situation (be it global or memcg) but by no means it doesn't prevent
> > > > from the system to kill the workload if there is a need. Those two
> > > > decisions are simply orthogonal IMHO. The later is a an emergency action
> > > > while the former is to help guanratee a runtime behavior of the workload.
> > > >
> > >
> > > If it can handle OOM memory reclaim, it will be more inteligent.
> >
> > Can we get back to an actual usecase please?
> >
> 
> No real usecase.
> What we concerned is if it can lead to more OOMs but can't protect
> itself in OOM then this behavior seems a little wierd.

This is a natural side effect of protecting memory from the reclaim.
Read mlock kind of protection. Weird? I dunno. Unexpected, no!

> Setting oom_score_adj is another choice,  but there's no memcg-level
> oom_score_adj.
> memory.min is memcg-level, while oom_score_adj is process-level, that
> is wierd as well.

OOM, is per process operation. Sure we have that group kill option but
then still the selection is per-process.

Without any clear usecase in sight I do not think it makes sense to
pursue this further.

Thanks!
Yafang Shao Aug. 20, 2019, 8:55 a.m. UTC | #11
On Tue, Aug 20, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 20-08-19 15:49:20, Yafang Shao wrote:
> > On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 20-08-19 15:15:54, Yafang Shao wrote:
> > > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> > > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > > > > > >
> > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > > > > > In the current memory.min design, the system is going to do OOM instead
> > > > > > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > > > > > system is lack of free memory. While under this condition, the OOM
> > > > > > > > killer may kill the processes in the memcg protected by memory.min.
> > > > > > > > This behavior is very weird.
> > > > > > > > In order to make it more reasonable, I make some changes in the OOM
> > > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > > > > > skip the processes under memcg protection at the first scan, and if it
> > > > > > > > can't kill any processes it will rescan all the processes.
> > > > > > > >
> > > > > > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > > > > > problem because this only happens under system  memory pressure and
> > > > > > > > the OOM killer can't find any proper victims which are not under memcg
> > > > > > > > protection.
> > > > > > >
> > > > > > > Hi Yafang!
> > > > > > >
> > > > > > > The idea makes sense at the first glance, but actually I'm worried
> > > > > > > about mixing per-memcg and per-process characteristics.
> > > > > > > Actually, it raises many questions:
> > > > > > > 1) if we do respect memory.min, why not memory.low too?
> > > > > >
> > > > > > memroy.low is different with memory.min, as the OOM killer will not be
> > > > > > invoked when it is reached.
> > > > >
> > > > > Responded in other email thread (please do not post two versions of the
> > > > > patch on the same day because it makes conversation too scattered and
> > > > > confusing).
> > > > >
> > > > (This is an issue about time zone :-) )
> > >
> > > Normally we wait few days until feedback on the particular patch is
> > > settled before a new version is posted.
> > >
> > > > > Think of min limit protection as some sort of a more inteligent mlock.
> > > >
> > > > Per my perspective, it is a less inteligent mlock, because what it
> > > > protected may be a garbage memory.
> > > > As I said before, what it protected is the memroy usage, rather than a
> > > > specified file memory or anon memory or somethin else.
> > > >
> > > > The advantage of it is easy to use.
> > > >
> > > > > It protects from the regular memory reclaim and it can lead to the OOM
> > > > > situation (be it global or memcg) but by no means it doesn't prevent
> > > > > from the system to kill the workload if there is a need. Those two
> > > > > decisions are simply orthogonal IMHO. The later is a an emergency action
> > > > > while the former is to help guanratee a runtime behavior of the workload.
> > > > >
> > > >
> > > > If it can handle OOM memory reclaim, it will be more inteligent.
> > >
> > > Can we get back to an actual usecase please?
> > >
> >
> > No real usecase.
> > What we concerned is if it can lead to more OOMs but can't protect
> > itself in OOM then this behavior seems a little wierd.
>
> This is a natural side effect of protecting memory from the reclaim.
> Read mlock kind of protection. Weird? I dunno. Unexpected, no!
>
> > Setting oom_score_adj is another choice,  but there's no memcg-level
> > oom_score_adj.
> > memory.min is memcg-level, while oom_score_adj is process-level, that
> > is wierd as well.
>
> OOM, is per process operation. Sure we have that group kill option but
> then still the selection is per-process.
>
> Without any clear usecase in sight I do not think it makes sense to
> pursue this further.
>

As there's a memory.oom.group option to select killing all processes
in a memcg, why not introduce a memcg level memcg.oom.score_adj?
Then we can set different scores to different memcgs.
Because we always deploy lots of containers on a single host, when OOM
occurs it will better to prefer killing the low priority containers
(with higher memcg.oom.score_adj) first.


Thanks
Yafang
Michal Hocko Aug. 20, 2019, 9:17 a.m. UTC | #12
On Tue 20-08-19 16:55:12, Yafang Shao wrote:
> On Tue, Aug 20, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 20-08-19 15:49:20, Yafang Shao wrote:
> > > On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 20-08-19 15:15:54, Yafang Shao wrote:
> > > > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> > > > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > > > > > > >
> > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > > > > > > In the current memory.min design, the system is going to do OOM instead
> > > > > > > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > > > > > > system is lack of free memory. While under this condition, the OOM
> > > > > > > > > killer may kill the processes in the memcg protected by memory.min.
> > > > > > > > > This behavior is very weird.
> > > > > > > > > In order to make it more reasonable, I make some changes in the OOM
> > > > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > > > > > > skip the processes under memcg protection at the first scan, and if it
> > > > > > > > > can't kill any processes it will rescan all the processes.
> > > > > > > > >
> > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > > > > > > problem because this only happens under system  memory pressure and
> > > > > > > > > the OOM killer can't find any proper victims which are not under memcg
> > > > > > > > > protection.
> > > > > > > >
> > > > > > > > Hi Yafang!
> > > > > > > >
> > > > > > > > The idea makes sense at the first glance, but actually I'm worried
> > > > > > > > about mixing per-memcg and per-process characteristics.
> > > > > > > > Actually, it raises many questions:
> > > > > > > > 1) if we do respect memory.min, why not memory.low too?
> > > > > > >
> > > > > > > memroy.low is different with memory.min, as the OOM killer will not be
> > > > > > > invoked when it is reached.
> > > > > >
> > > > > > Responded in other email thread (please do not post two versions of the
> > > > > > patch on the same day because it makes conversation too scattered and
> > > > > > confusing).
> > > > > >
> > > > > (This is an issue about time zone :-) )
> > > >
> > > > Normally we wait few days until feedback on the particular patch is
> > > > settled before a new version is posted.
> > > >
> > > > > > Think of min limit protection as some sort of a more inteligent mlock.
> > > > >
> > > > > Per my perspective, it is a less inteligent mlock, because what it
> > > > > protected may be a garbage memory.
> > > > > As I said before, what it protected is the memroy usage, rather than a
> > > > > specified file memory or anon memory or somethin else.
> > > > >
> > > > > The advantage of it is easy to use.
> > > > >
> > > > > > It protects from the regular memory reclaim and it can lead to the OOM
> > > > > > situation (be it global or memcg) but by no means it doesn't prevent
> > > > > > from the system to kill the workload if there is a need. Those two
> > > > > > decisions are simply orthogonal IMHO. The later is a an emergency action
> > > > > > while the former is to help guanratee a runtime behavior of the workload.
> > > > > >
> > > > >
> > > > > If it can handle OOM memory reclaim, it will be more inteligent.
> > > >
> > > > Can we get back to an actual usecase please?
> > > >
> > >
> > > No real usecase.
> > > What we concerned is if it can lead to more OOMs but can't protect
> > > itself in OOM then this behavior seems a little wierd.
> >
> > This is a natural side effect of protecting memory from the reclaim.
> > Read mlock kind of protection. Weird? I dunno. Unexpected, no!
> >
> > > Setting oom_score_adj is another choice,  but there's no memcg-level
> > > oom_score_adj.
> > > memory.min is memcg-level, while oom_score_adj is process-level, that
> > > is wierd as well.
> >
> > OOM, is per process operation. Sure we have that group kill option but
> > then still the selection is per-process.
> >
> > Without any clear usecase in sight I do not think it makes sense to
> > pursue this further.
> >
> 
> As there's a memory.oom.group option to select killing all processes
> in a memcg, why not introduce a memcg level memcg.oom.score_adj?

Because the oom selection is process based as already mentioned. There
was a long discussion about memcg based oom victim selection last year
but no consensus has been achieved.

> Then we can set different scores to different memcgs.
> Because we always deploy lots of containers on a single host, when OOM
> occurs it will better to prefer killing the low priority containers
> (with higher memcg.oom.score_adj) first.

How would you define low priority container with score_adj?
Yafang Shao Aug. 20, 2019, 9:26 a.m. UTC | #13
On Tue, Aug 20, 2019 at 5:17 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 20-08-19 16:55:12, Yafang Shao wrote:
> > On Tue, Aug 20, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 20-08-19 15:49:20, Yafang Shao wrote:
> > > > On Tue, Aug 20, 2019 at 3:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 20-08-19 15:15:54, Yafang Shao wrote:
> > > > > > On Tue, Aug 20, 2019 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Tue 20-08-19 09:16:01, Yafang Shao wrote:
> > > > > > > > On Tue, Aug 20, 2019 at 5:12 AM Roman Gushchin <guro@fb.com> wrote:
> > > > > > > > >
> > > > > > > > > On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > > > > > > > > > In the current memory.min design, the system is going to do OOM instead
> > > > > > > > > > of reclaiming the reclaimable pages protected by memory.min if the
> > > > > > > > > > system is lack of free memory. While under this condition, the OOM
> > > > > > > > > > killer may kill the processes in the memcg protected by memory.min.
> > > > > > > > > > This behavior is very weird.
> > > > > > > > > > In order to make it more reasonable, I make some changes in the OOM
> > > > > > > > > > killer. In this patch, the OOM killer will do two-round scan. It will
> > > > > > > > > > skip the processes under memcg protection at the first scan, and if it
> > > > > > > > > > can't kill any processes it will rescan all the processes.
> > > > > > > > > >
> > > > > > > > > > Regarding the overhead this change may takes, I don't think it will be a
> > > > > > > > > > problem because this only happens under system  memory pressure and
> > > > > > > > > > the OOM killer can't find any proper victims which are not under memcg
> > > > > > > > > > protection.
> > > > > > > > >
> > > > > > > > > Hi Yafang!
> > > > > > > > >
> > > > > > > > > The idea makes sense at the first glance, but actually I'm worried
> > > > > > > > > about mixing per-memcg and per-process characteristics.
> > > > > > > > > Actually, it raises many questions:
> > > > > > > > > 1) if we do respect memory.min, why not memory.low too?
> > > > > > > >
> > > > > > > > memroy.low is different with memory.min, as the OOM killer will not be
> > > > > > > > invoked when it is reached.
> > > > > > >
> > > > > > > Responded in other email thread (please do not post two versions of the
> > > > > > > patch on the same day because it makes conversation too scattered and
> > > > > > > confusing).
> > > > > > >
> > > > > > (This is an issue about time zone :-) )
> > > > >
> > > > > Normally we wait few days until feedback on the particular patch is
> > > > > settled before a new version is posted.
> > > > >
> > > > > > > Think of min limit protection as some sort of a more inteligent mlock.
> > > > > >
> > > > > > Per my perspective, it is a less inteligent mlock, because what it
> > > > > > protected may be a garbage memory.
> > > > > > As I said before, what it protected is the memroy usage, rather than a
> > > > > > specified file memory or anon memory or somethin else.
> > > > > >
> > > > > > The advantage of it is easy to use.
> > > > > >
> > > > > > > It protects from the regular memory reclaim and it can lead to the OOM
> > > > > > > situation (be it global or memcg) but by no means it doesn't prevent
> > > > > > > from the system to kill the workload if there is a need. Those two
> > > > > > > decisions are simply orthogonal IMHO. The later is a an emergency action
> > > > > > > while the former is to help guanratee a runtime behavior of the workload.
> > > > > > >
> > > > > >
> > > > > > If it can handle OOM memory reclaim, it will be more inteligent.
> > > > >
> > > > > Can we get back to an actual usecase please?
> > > > >
> > > >
> > > > No real usecase.
> > > > What we concerned is if it can lead to more OOMs but can't protect
> > > > itself in OOM then this behavior seems a little wierd.
> > >
> > > This is a natural side effect of protecting memory from the reclaim.
> > > Read mlock kind of protection. Weird? I dunno. Unexpected, no!
> > >
> > > > Setting oom_score_adj is another choice,  but there's no memcg-level
> > > > oom_score_adj.
> > > > memory.min is memcg-level, while oom_score_adj is process-level, that
> > > > is wierd as well.
> > >
> > > OOM, is per process operation. Sure we have that group kill option but
> > > then still the selection is per-process.
> > >
> > > Without any clear usecase in sight I do not think it makes sense to
> > > pursue this further.
> > >
> >
> > As there's a memory.oom.group option to select killing all processes
> > in a memcg, why not introduce a memcg level memcg.oom.score_adj?
>
> Because the oom selection is process based as already mentioned. There
> was a long discussion about memcg based oom victim selection last year
> but no consensus has been achieved.
>
> > Then we can set different scores to different memcgs.
> > Because we always deploy lots of containers on a single host, when OOM
> > occurs it will better to prefer killing the low priority containers
> > (with higher memcg.oom.score_adj) first.
>
> How would you define low priority container with score_adj?
>

For example, Container-A is high priority and Container-B is low priority.
When OOM killer happens we prefer to kill all processes in Container-B
and prevent Container-A from being killed.
So we set memroy.oom.score_adj  with -1000 to Container-A  and +1000
to Container-B, both container with memory.oom.cgroup set.
When we set memroy.oom.score_adj  to a container, all processes
belonging to this container will be set this value to their own
oom_score_adj.
If a new process is created, its oom_score_adj will be initialized
with this memory.oom.score_adj as well.

Thanks
Yafang
Michal Hocko Aug. 20, 2019, 10:40 a.m. UTC | #14
On Tue 20-08-19 17:26:49, Yafang Shao wrote:
> On Tue, Aug 20, 2019 at 5:17 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > As there's a memory.oom.group option to select killing all processes
> > > in a memcg, why not introduce a memcg level memcg.oom.score_adj?
> >
> > Because the oom selection is process based as already mentioned. There
> > was a long discussion about memcg based oom victim selection last year
> > but no consensus has been achieved.
> >
> > > Then we can set different scores to different memcgs.
> > > Because we always deploy lots of containers on a single host, when OOM
> > > occurs it will better to prefer killing the low priority containers
> > > (with higher memcg.oom.score_adj) first.
> >
> > How would you define low priority container with score_adj?
> >
> 
> For example, Container-A is high priority and Container-B is low priority.
> When OOM killer happens we prefer to kill all processes in Container-B
> and prevent Container-A from being killed.
> So we set memroy.oom.score_adj  with -1000 to Container-A  and +1000
> to Container-B, both container with memory.oom.cgroup set.
> When we set memroy.oom.score_adj  to a container, all processes
> belonging to this container will be set this value to their own
> oom_score_adj.

I hope you can see that this on/off mechanism doesn't scale and thus it
is a dubious interface. Just think of mutlitple containers.
Roman Gushchin Aug. 20, 2019, 9:39 p.m. UTC | #15
On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> In the current memory.min design, the system is going to do OOM instead
> of reclaiming the reclaimable pages protected by memory.min if the
> system is lack of free memory. While under this condition, the OOM
> killer may kill the processes in the memcg protected by memory.min.
> This behavior is very weird.
> In order to make it more reasonable, I make some changes in the OOM
> killer. In this patch, the OOM killer will do two-round scan. It will
> skip the processes under memcg protection at the first scan, and if it
> can't kill any processes it will rescan all the processes.
> 
> Regarding the overhead this change may takes, I don't think it will be a
> problem because this only happens under system  memory pressure and
> the OOM killer can't find any proper victims which are not under memcg
> protection.

Also, after the second thought, what your patch really does,
it basically guarantees that no processes out of memory cgroups
with memory.min set will be ever killed (unless there are any other
processes). In most cases (at least on our setups) it's basically
makes such processes immune to the OOM killer (similar to oom_score_adj
set to -1000).

This is by far a too strong side effect of setting memory.min,
so I don't think the idea is acceptable at all.

Thanks!
Yafang Shao Aug. 21, 2019, 1 a.m. UTC | #16
On Wed, Aug 21, 2019 at 5:39 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Sun, Aug 18, 2019 at 09:18:06PM -0400, Yafang Shao wrote:
> > In the current memory.min design, the system is going to do OOM instead
> > of reclaiming the reclaimable pages protected by memory.min if the
> > system is lack of free memory. While under this condition, the OOM
> > killer may kill the processes in the memcg protected by memory.min.
> > This behavior is very weird.
> > In order to make it more reasonable, I make some changes in the OOM
> > killer. In this patch, the OOM killer will do two-round scan. It will
> > skip the processes under memcg protection at the first scan, and if it
> > can't kill any processes it will rescan all the processes.
> >
> > Regarding the overhead this change may takes, I don't think it will be a
> > problem because this only happens under system  memory pressure and
> > the OOM killer can't find any proper victims which are not under memcg
> > protection.
>
> Also, after the second thought, what your patch really does,
> it basically guarantees that no processes out of memory cgroups
> with memory.min set will be ever killed (unless there are any other
> processes). In most cases (at least on our setups) it's basically
> makes such processes immune to the OOM killer (similar to oom_score_adj
> set to -1000).
>

Actually it is between -999 and -1000.

> This is by far a too strong side effect of setting memory.min,
> so I don't think the idea is acceptable at all.
>

More possible OOMs is also a strong side effect (and it prevent us
from using it).
Leave all other works to the userspace is not proper.
We should improve it.

Thanks
Yafang
Michal Hocko Aug. 21, 2019, 6:44 a.m. UTC | #17
On Wed 21-08-19 09:00:39, Yafang Shao wrote:
[...]
> More possible OOMs is also a strong side effect (and it prevent us
> from using it).

So why don't you use low limit if the guarantee side of min limit is too
strong for you?
Yafang Shao Aug. 21, 2019, 7:26 a.m. UTC | #18
On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 21-08-19 09:00:39, Yafang Shao wrote:
> [...]
> > More possible OOMs is also a strong side effect (and it prevent us
> > from using it).
>
> So why don't you use low limit if the guarantee side of min limit is too
> strong for you?

Well, I don't know what the best-practice of memory.min is.
In our plan, we want to use it to protect the top priority containers
(e.g. set the memory.min same with memory limit), which may latency
sensive. Using memory.min may sometimes decrease the refault.
If we set it too low, it may useless, becasue what memory.min is
protecting is not specified. And if there're some busrt anon memory
allocate in this memcg, the memory.min may can't protect any file
memory.

I appreciate if you could show me some best practice of it.

Thanks
Yafang
Michal Hocko Aug. 21, 2019, 8:05 a.m. UTC | #19
On Wed 21-08-19 15:26:56, Yafang Shao wrote:
> On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 21-08-19 09:00:39, Yafang Shao wrote:
> > [...]
> > > More possible OOMs is also a strong side effect (and it prevent us
> > > from using it).
> >
> > So why don't you use low limit if the guarantee side of min limit is too
> > strong for you?
> 
> Well, I don't know what the best-practice of memory.min is.

It is really a workload reclaim protection. Say you have a memory
consumer which performance characteristics would be noticeably disrupted
by any memory reclaim which then would lead to SLA disruption. This is a
strong requirement/QoS feature and as such comes with its demand on
configuration.

> In our plan, we want to use it to protect the top priority containers
> (e.g. set the memory.min same with memory limit), which may latency
> sensive. Using memory.min may sometimes decrease the refault.
> If we set it too low, it may useless, becasue what memory.min is
> protecting is not specified. And if there're some busrt anon memory
> allocate in this memcg, the memory.min may can't protect any file
> memory.

I am still not seeing why you are considering guarantee (memory.min)
rather than best practice (memory.low) here?
Yafang Shao Aug. 21, 2019, 8:15 a.m. UTC | #20
On Wed, Aug 21, 2019 at 4:05 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 21-08-19 15:26:56, Yafang Shao wrote:
> > On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 21-08-19 09:00:39, Yafang Shao wrote:
> > > [...]
> > > > More possible OOMs is also a strong side effect (and it prevent us
> > > > from using it).
> > >
> > > So why don't you use low limit if the guarantee side of min limit is too
> > > strong for you?
> >
> > Well, I don't know what the best-practice of memory.min is.
>
> It is really a workload reclaim protection. Say you have a memory
> consumer which performance characteristics would be noticeably disrupted
> by any memory reclaim which then would lead to SLA disruption. This is a
> strong requirement/QoS feature and as such comes with its demand on
> configuration.
>
> > In our plan, we want to use it to protect the top priority containers
> > (e.g. set the memory.min same with memory limit), which may latency
> > sensive. Using memory.min may sometimes decrease the refault.
> > If we set it too low, it may useless, becasue what memory.min is
> > protecting is not specified. And if there're some busrt anon memory
> > allocate in this memcg, the memory.min may can't protect any file
> > memory.
>
> I am still not seeing why you are considering guarantee (memory.min)
> rather than best practice (memory.low) here?

Let me show some examples for you.
Suppose we have three containers with different priorities, which are
high priority, medium priority and low priority.
Then we set memory.low to these containers as bellow,
high prioirty: memory.low same with memory.max
medium priroity: memory.low is 50% of memory.max
low priority: memory.low is 0

When all relcaimable pages withouth protection are reclaimed, the
reclaimer begins to reclaim the protected pages, but unforuantely it
desn't know which pages are belonging to high priority container and
which pages are belonging to medium priority container. So the
relcaimer may reclaim the high priority contianer first, and without
reclaiming the medium priority container at all.

Thanks
Yafang
Michal Hocko Aug. 21, 2019, 8:34 a.m. UTC | #21
On Wed 21-08-19 16:15:54, Yafang Shao wrote:
> On Wed, Aug 21, 2019 at 4:05 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 21-08-19 15:26:56, Yafang Shao wrote:
> > > On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 21-08-19 09:00:39, Yafang Shao wrote:
> > > > [...]
> > > > > More possible OOMs is also a strong side effect (and it prevent us
> > > > > from using it).
> > > >
> > > > So why don't you use low limit if the guarantee side of min limit is too
> > > > strong for you?
> > >
> > > Well, I don't know what the best-practice of memory.min is.
> >
> > It is really a workload reclaim protection. Say you have a memory
> > consumer which performance characteristics would be noticeably disrupted
> > by any memory reclaim which then would lead to SLA disruption. This is a
> > strong requirement/QoS feature and as such comes with its demand on
> > configuration.
> >
> > > In our plan, we want to use it to protect the top priority containers
> > > (e.g. set the memory.min same with memory limit), which may latency
> > > sensive. Using memory.min may sometimes decrease the refault.
> > > If we set it too low, it may useless, becasue what memory.min is
> > > protecting is not specified. And if there're some busrt anon memory
> > > allocate in this memcg, the memory.min may can't protect any file
> > > memory.
> >
> > I am still not seeing why you are considering guarantee (memory.min)
> > rather than best practice (memory.low) here?
> 
> Let me show some examples for you.
> Suppose we have three containers with different priorities, which are
> high priority, medium priority and low priority.
> Then we set memory.low to these containers as bellow,
> high prioirty: memory.low same with memory.max
> medium priroity: memory.low is 50% of memory.max
> low priority: memory.low is 0
> 
> When all relcaimable pages withouth protection are reclaimed, the
> reclaimer begins to reclaim the protected pages, but unforuantely it
> desn't know which pages are belonging to high priority container and
> which pages are belonging to medium priority container. So the
> relcaimer may reclaim the high priority contianer first, and without
> reclaiming the medium priority container at all.

Hmm, it is hard to comment on this configuration without knowing what is
the overall consumption of all the three. In any case reclaiming all of
the reclaimable memory means that you have actually reclaimed full of
the low and half of the medium container to even start hitting on high
priority one. When there are only low priority protected containers then
they will get reclaimed proportionally to their size.
Yafang Shao Aug. 21, 2019, 8:46 a.m. UTC | #22
On Wed, Aug 21, 2019 at 4:34 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 21-08-19 16:15:54, Yafang Shao wrote:
> > On Wed, Aug 21, 2019 at 4:05 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 21-08-19 15:26:56, Yafang Shao wrote:
> > > > On Wed, Aug 21, 2019 at 2:44 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Wed 21-08-19 09:00:39, Yafang Shao wrote:
> > > > > [...]
> > > > > > More possible OOMs is also a strong side effect (and it prevent us
> > > > > > from using it).
> > > > >
> > > > > So why don't you use low limit if the guarantee side of min limit is too
> > > > > strong for you?
> > > >
> > > > Well, I don't know what the best-practice of memory.min is.
> > >
> > > It is really a workload reclaim protection. Say you have a memory
> > > consumer which performance characteristics would be noticeably disrupted
> > > by any memory reclaim which then would lead to SLA disruption. This is a
> > > strong requirement/QoS feature and as such comes with its demand on
> > > configuration.
> > >
> > > > In our plan, we want to use it to protect the top priority containers
> > > > (e.g. set the memory.min same with memory limit), which may latency
> > > > sensive. Using memory.min may sometimes decrease the refault.
> > > > If we set it too low, it may useless, becasue what memory.min is
> > > > protecting is not specified. And if there're some busrt anon memory
> > > > allocate in this memcg, the memory.min may can't protect any file
> > > > memory.
> > >
> > > I am still not seeing why you are considering guarantee (memory.min)
> > > rather than best practice (memory.low) here?
> >
> > Let me show some examples for you.
> > Suppose we have three containers with different priorities, which are
> > high priority, medium priority and low priority.
> > Then we set memory.low to these containers as bellow,
> > high prioirty: memory.low same with memory.max
> > medium priroity: memory.low is 50% of memory.max
> > low priority: memory.low is 0
> >
> > When all relcaimable pages withouth protection are reclaimed, the
> > reclaimer begins to reclaim the protected pages, but unforuantely it
> > desn't know which pages are belonging to high priority container and
> > which pages are belonging to medium priority container. So the
> > relcaimer may reclaim the high priority contianer first, and without
> > reclaiming the medium priority container at all.
>
> Hmm, it is hard to comment on this configuration without knowing what is
> the overall consumption of all the three. In any case reclaiming all of
> the reclaimable memory means that you have actually reclaimed full of
> the low and half of the medium container to even start hitting on high
> priority one. When there are only low priority protected containers then
> they will get reclaimed proportionally to their size.

Right.
I think priority-based reclaimer (different priorities has differecnt
proportional scan count ) would be more fine, while memroy.low is not
easy to practice in this situation.

Thanks
Yafang
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44c4146..534fe92 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -337,6 +337,7 @@  static inline bool mem_cgroup_disabled(void)
 
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg);
+bool task_under_memcg_protection(struct task_struct *p);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -813,6 +814,11 @@  static inline enum mem_cgroup_protection mem_cgroup_protected(
 	return MEMCG_PROT_NONE;
 }
 
+static inline bool task_under_memcg_protection(struct task_struct *p)
+{
+	return false;
+}
+
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
 					struct mem_cgroup **memcgp,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cdbb7a8..8df3d88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6030,6 +6030,22 @@  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 		return MEMCG_PROT_NONE;
 }
 
+bool task_under_memcg_protection(struct task_struct *p)
+{
+	struct mem_cgroup *memcg;
+	bool protected;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(p);
+	if (memcg != root_mem_cgroup && memcg->memory.min)
+		protected = true;
+	else
+		protected = false;
+	rcu_read_unlock();
+
+	return protected;
+}
+
 /**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eda2e2a..e9bdad3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,12 +367,31 @@  static void select_bad_process(struct oom_control *oc)
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
 	else {
+		bool memcg_check = false;
+		bool memcg_skip = false;
 		struct task_struct *p;
+		int selected = 0;
 
 		rcu_read_lock();
-		for_each_process(p)
-			if (oom_evaluate_task(p, oc))
+retry:
+		for_each_process(p) {
+			if (!memcg_check && task_under_memcg_protection(p)) {
+				memcg_skip = true;
+				continue;
+			}
+			selected = oom_evaluate_task(p, oc);
+			if (selected)
 				break;
+		}
+
+		if (!selected) {
+			if (memcg_skip) {
+				if (!oc->chosen || oc->chosen == (void *)-1UL) {
+					memcg_check = true;
+					goto retry;
+				}
+			}
+		}
 		rcu_read_unlock();
 	}
 }