mm, memcg: fix wrong mem cgroup protection

Message ID	20200423061629.24185-1-laoar.shao@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=ftPA=6H=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DF5E1206B9 From: Yafang Shao <laoar.shao@gmail.com> To: akpm@linux-foundation.org, mhocko@kernel.org, vdavydov.dev@gmail.com Cc: linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com>, Chris Down <chris@chrisdown.name>, Roman Gushchin <guro@fb.com>, stable@vger.kernel.org Subject: [PATCH] mm, memcg: fix wrong mem cgroup protection Date: Thu, 23 Apr 2020 02:16:29 -0400 Message-Id: <20200423061629.24185-1-laoar.shao@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, memcg: fix wrong mem cgroup protection \| expand mm, memcg: fix wrong mem cgroup protection

Yafang Shao April 23, 2020, 6:16 a.m. UTC

This patch is an improvement of a previous version[1], as the previous
version is not easy to understand.
This issue persists in the newest kernel, I have to resend the fix. As
the implementation is changed, I drop Roman's ack from the previous
version.

Here's the explanation of this issue.
memory.{low,min} won't take effect if the to-be-reclaimed memcg is the
sc->target_mem_cgroup, that can also be proved by the implementation in
mem_cgroup_protected(), see bellow,
	mem_cgroup_protected
		if (memcg == root) [2]
			return MEMCG_PROT_NONE;

But this rule is ignored in mem_cgroup_protection(), which will read
memory.{emin, elow} as the protection whatever the memcg is.

How would this issue happen?
Because in mem_cgroup_protected() we forget to clear the
memory.{emin, elow} if the memcg is target_mem_cgroup [2].

An example to illustrate this issue.
   root_mem_cgroup
         /
        A   memory.max: 1024M
            memory.min: 512M
            memory.current: 800M ('current' must be greater than 'min')
Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A.
Then kswapd stops.
As a result of it, the memory values of A will be,
   root_mem_cgroup
         /
        A   memory.max: 1024M
            memory.min: 512M
            memory.current: 512M (approximately)
            memory.emin: 512M

Then a new workload starts to run in memcg A, and it will trigger memcg
relcaim in A soon. As memcg A is the target_mem_cgroup of this
reclaimer, so it return directly without touching memory.{emin, elow}.[2]
The memory values of A will be,
   root_mem_cgroup
         /
        A   memory.max: 1024M
            memory.min: 512M
            memory.current: 1024M (approximately)
            memory.emin: 512M
Then this memory.emin will be used in mem_cgroup_protection() to get the
scan count, which is obvoiusly a wrong scan count.

[1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/

Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Cc: Chris Down <chris@chrisdown.name>
Cc: Roman Gushchin <guro@fb.com>
Cc: stable@vger.kernel.org
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/memcontrol.h | 13 +++++++++++--
 mm/vmscan.c                |  4 ++--
 2 files changed, 13 insertions(+), 4 deletions(-)

Chris Down April 23, 2020, 3:33 p.m. UTC | #1

Hi Yafang,

I'm afraid I'm just as confused as Michal was about the intent of this patch.

Can you please be more concise and clear about the practical ramifications and 
demonstrate some pathological behaviour? I really can't visualise what's wrong 
here from your explanation, and if I can't understand it as the person who 
wrote this code, I am not surprised others are also confused :-)

Or maybe Roman can try to explain, since he acked the previous patch? At least 
to me, the emin/elow behaviour seems fairly non-trivial to reason about right 
now.

Thanks.

Roman Gushchin April 23, 2020, 9:06 p.m. UTC | #2

On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> This patch is an improvement of a previous version[1], as the previous
> version is not easy to understand.
> This issue persists in the newest kernel, I have to resend the fix. As
> the implementation is changed, I drop Roman's ack from the previous
> version.
> 
> Here's the explanation of this issue.
> memory.{low,min} won't take effect if the to-be-reclaimed memcg is the
> sc->target_mem_cgroup, that can also be proved by the implementation in
> mem_cgroup_protected(), see bellow,
> 	mem_cgroup_protected
> 		if (memcg == root) [2]
> 			return MEMCG_PROT_NONE;
> 
> But this rule is ignored in mem_cgroup_protection(), which will read
> memory.{emin, elow} as the protection whatever the memcg is.
> 
> How would this issue happen?
> Because in mem_cgroup_protected() we forget to clear the
> memory.{emin, elow} if the memcg is target_mem_cgroup [2].
> 
> An example to illustrate this issue.
>    root_mem_cgroup
>          /
>         A   memory.max: 1024M
>             memory.min: 512M
>             memory.current: 800M ('current' must be greater than 'min')
> Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A.
> Then kswapd stops.
> As a result of it, the memory values of A will be,
>    root_mem_cgroup
>          /
>         A   memory.max: 1024M
>             memory.min: 512M
>             memory.current: 512M (approximately)
>             memory.emin: 512M
> 
> Then a new workload starts to run in memcg A, and it will trigger memcg
> relcaim in A soon. As memcg A is the target_mem_cgroup of this
> reclaimer, so it return directly without touching memory.{emin, elow}.[2]
> The memory values of A will be,
>    root_mem_cgroup
>          /
>         A   memory.max: 1024M
>             memory.min: 512M
>             memory.current: 1024M (approximately)
>             memory.emin: 512M
> Then this memory.emin will be used in mem_cgroup_protection() to get the
> scan count, which is obvoiusly a wrong scan count.
> 
> [1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
> 
> Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> Cc: Chris Down <chris@chrisdown.name>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/memcontrol.h | 13 +++++++++++--
>  mm/vmscan.c                |  4 ++--
>  2 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d275c72c4f8e..114cfe06bf60 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -344,12 +344,20 @@ static inline bool mem_cgroup_disabled(void)
>  	return !cgroup_subsys_enabled(memory_cgrp_subsys);
>  }
>  
> -static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
> +						  struct mem_cgroup *memcg,
>  						  bool in_low_reclaim)

I'd rename "root" to "target", maybe it will make the whole thing more clear.

I'll think a bit more about it, but at the first glance the patch looks sane to me.

Thanks!

Roman Gushchin April 23, 2020, 9:13 p.m. UTC | #3

On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
> Hi Yafang,
> 
> I'm afraid I'm just as confused as Michal was about the intent of this patch.
> 
> Can you please be more concise and clear about the practical ramifications
> and demonstrate some pathological behaviour? I really can't visualise what's
> wrong here from your explanation, and if I can't understand it as the person
> who wrote this code, I am not surprised others are also confused :-)
> 
> Or maybe Roman can try to explain, since he acked the previous patch? At
> least to me, the emin/elow behaviour seems fairly non-trivial to reason
> about right now.

Hi Chris!

So the thing is that emin/elow cached values are shared between global and
targeted (caused by memory.max) reclaim. It's racy by design, but in general
it should work ok, because in the end we'll reclaim or not approximately
the same amount of memory.

In the case which Yafang described, the emin value calculated in the process
of the global reclaim leads to a slowdown of the targeted reclaim. It's not
a tragedy, but not perfect too. It seems that the proposed patch makes it better,
and as now I don't see any bad consequences.

Thanks!

Yafang Shao April 24, 2020, 12:29 a.m. UTC | #4

On Fri, Apr 24, 2020 at 5:06 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > This patch is an improvement of a previous version[1], as the previous
> > version is not easy to understand.
> > This issue persists in the newest kernel, I have to resend the fix. As
> > the implementation is changed, I drop Roman's ack from the previous
> > version.
> >
> > Here's the explanation of this issue.
> > memory.{low,min} won't take effect if the to-be-reclaimed memcg is the
> > sc->target_mem_cgroup, that can also be proved by the implementation in
> > mem_cgroup_protected(), see bellow,
> >       mem_cgroup_protected
> >               if (memcg == root) [2]
> >                       return MEMCG_PROT_NONE;
> >
> > But this rule is ignored in mem_cgroup_protection(), which will read
> > memory.{emin, elow} as the protection whatever the memcg is.
> >
> > How would this issue happen?
> > Because in mem_cgroup_protected() we forget to clear the
> > memory.{emin, elow} if the memcg is target_mem_cgroup [2].
> >
> > An example to illustrate this issue.
> >    root_mem_cgroup
> >          /
> >         A   memory.max: 1024M
> >             memory.min: 512M
> >             memory.current: 800M ('current' must be greater than 'min')
> > Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A.
> > Then kswapd stops.
> > As a result of it, the memory values of A will be,
> >    root_mem_cgroup
> >          /
> >         A   memory.max: 1024M
> >             memory.min: 512M
> >             memory.current: 512M (approximately)
> >             memory.emin: 512M
> >
> > Then a new workload starts to run in memcg A, and it will trigger memcg
> > relcaim in A soon. As memcg A is the target_mem_cgroup of this
> > reclaimer, so it return directly without touching memory.{emin, elow}.[2]
> > The memory values of A will be,
> >    root_mem_cgroup
> >          /
> >         A   memory.max: 1024M
> >             memory.min: 512M
> >             memory.current: 1024M (approximately)
> >             memory.emin: 512M
> > Then this memory.emin will be used in mem_cgroup_protection() to get the
> > scan count, which is obvoiusly a wrong scan count.
> >
> > [1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
> >
> > Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> > Cc: Chris Down <chris@chrisdown.name>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h | 13 +++++++++++--
> >  mm/vmscan.c                |  4 ++--
> >  2 files changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index d275c72c4f8e..114cfe06bf60 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -344,12 +344,20 @@ static inline bool mem_cgroup_disabled(void)
> >       return !cgroup_subsys_enabled(memory_cgrp_subsys);
> >  }
> >
> > -static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> > +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *root,
> > +                                               struct mem_cgroup *memcg,
> >                                                 bool in_low_reclaim)
>
> I'd rename "root" to "target", maybe it will make the whole thing more clear.
>

That would make it better.  I will change it.

> I'll think a bit more about it, but at the first glance the patch looks sane to me.
>
> Thanks!

Yafang Shao April 24, 2020, 12:32 a.m. UTC | #5

On Fri, Apr 24, 2020 at 5:13 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
> > Hi Yafang,
> >
> > I'm afraid I'm just as confused as Michal was about the intent of this patch.
> >
> > Can you please be more concise and clear about the practical ramifications
> > and demonstrate some pathological behaviour? I really can't visualise what's
> > wrong here from your explanation, and if I can't understand it as the person
> > who wrote this code, I am not surprised others are also confused :-)
> >
> > Or maybe Roman can try to explain, since he acked the previous patch? At
> > least to me, the emin/elow behaviour seems fairly non-trivial to reason
> > about right now.
>
> Hi Chris!
>
> So the thing is that emin/elow cached values are shared between global and
> targeted (caused by memory.max) reclaim. It's racy by design, but in general
> it should work ok, because in the end we'll reclaim or not approximately
> the same amount of memory.
>
> In the case which Yafang described, the emin value calculated in the process
> of the global reclaim leads to a slowdown of the targeted reclaim. It's not
> a tragedy, but not perfect too. It seems that the proposed patch makes it better,
> and as now I don't see any bad consequences.
>

Thanks for your explanation. Your explanation make the issue more clear.
If you don't mind, I will copy some of your comments to improve the commit log.

Yafang Shao April 24, 2020, 12:49 a.m. UTC | #6

On Thu, Apr 23, 2020 at 11:33 PM Chris Down <chris@chrisdown.name> wrote:
>
> Hi Yafang,
>
> I'm afraid I'm just as confused as Michal was about the intent of this patch.
>
> Can you please be more concise and clear about the practical ramifications and
> demonstrate some pathological behaviour? I really can't visualise what's wrong
> here from your explanation, and if I can't understand it as the person who
> wrote this code, I am not surprised others are also confused :-)
>

Hi Chris,

If the author can't understand deeply in the code worte by
himself/herself, I think the author should do more test on his/her
patches.
Regarding the issue in this case, my understanding is you know the
benefit of proportional reclaim, but I'm wondering that do you know
the loss if the proportional is not correct ?
I don't mean to affend you, while I just try to explain how the
community should cooperate.

> Or maybe Roman can try to explain, since he acked the previous patch? At least
> to me, the emin/elow behaviour seems fairly non-trivial to reason about right
> now.
>

Michal Hocko April 24, 2020, 10:40 a.m. UTC | #7

On Thu 23-04-20 14:13:19, Roman Gushchin wrote:
> On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
> > Hi Yafang,
> > 
> > I'm afraid I'm just as confused as Michal was about the intent of this patch.
> > 
> > Can you please be more concise and clear about the practical ramifications
> > and demonstrate some pathological behaviour? I really can't visualise what's
> > wrong here from your explanation, and if I can't understand it as the person
> > who wrote this code, I am not surprised others are also confused :-)
> > 
> > Or maybe Roman can try to explain, since he acked the previous patch? At
> > least to me, the emin/elow behaviour seems fairly non-trivial to reason
> > about right now.
> 
> Hi Chris!
> 
> So the thing is that emin/elow cached values are shared between global and
> targeted (caused by memory.max) reclaim. It's racy by design, but in general
> it should work ok, because in the end we'll reclaim or not approximately
> the same amount of memory.
> 
> In the case which Yafang described, the emin value calculated in the process
> of the global reclaim leads to a slowdown of the targeted reclaim. It's not
> a tragedy, but not perfect too. It seems that the proposed patch makes it better,
> and as now I don't see any bad consequences.

Do we have any means to quantify the effect?

I do understand the racy nature of the effective protection values. We
do update them in mem_cgroup_protected and that handles the
reclaim_target == memcg case already. So why do we care later on in
mem_cgroup_protection? And why don't we care about any other concurrent
reclaimers which have a different reclaim memcg target? This just
doesn't make any sense.

Either we do care about races because they are harmful and then we care
for all possible case or we don't and this patch doesn't really any big
value. Or I still miss the point.

Yafang Shao April 24, 2020, 10:57 a.m. UTC | #8

On Fri, Apr 24, 2020 at 6:40 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 23-04-20 14:13:19, Roman Gushchin wrote:
> > On Thu, Apr 23, 2020 at 04:33:23PM +0100, Chris Down wrote:
> > > Hi Yafang,
> > >
> > > I'm afraid I'm just as confused as Michal was about the intent of this patch.
> > >
> > > Can you please be more concise and clear about the practical ramifications
> > > and demonstrate some pathological behaviour? I really can't visualise what's
> > > wrong here from your explanation, and if I can't understand it as the person
> > > who wrote this code, I am not surprised others are also confused :-)
> > >
> > > Or maybe Roman can try to explain, since he acked the previous patch? At
> > > least to me, the emin/elow behaviour seems fairly non-trivial to reason
> > > about right now.
> >
> > Hi Chris!
> >
> > So the thing is that emin/elow cached values are shared between global and
> > targeted (caused by memory.max) reclaim. It's racy by design, but in general
> > it should work ok, because in the end we'll reclaim or not approximately
> > the same amount of memory.
> >
> > In the case which Yafang described, the emin value calculated in the process
> > of the global reclaim leads to a slowdown of the targeted reclaim. It's not
> > a tragedy, but not perfect too. It seems that the proposed patch makes it better,
> > and as now I don't see any bad consequences.
>
> Do we have any means to quantify the effect?
>
> I do understand the racy nature of the effective protection values. We
> do update them in mem_cgroup_protected and that handles the
> reclaim_target == memcg case already. So why do we care later on in
> mem_cgroup_protection? And why don't we care about any other concurrent
> reclaimers which have a different reclaim memcg target? This just
> doesn't make any sense.
>

No, you missed the point.
The issue pointed by me isn't related with racy.  Roman also explained
that the racy is not the point.

> Either we do care about races because they are harmful and then we care
> for all possible case or we don't and this patch doesn't really any big
> value. Or I still miss the point.
>

The real point is memcg relcaim will get a wrong value from
mem_cgroup_protection() after memory.emin is set.
Suppose target memcg has memory.current is 2G and memory.min is 2G, in
memcg reclaim, the scan count will be
(SWAP_CLUSTER_MAX>>sc->priority), rather than (lruvec_size >>
sc->priority). That's a slowdown, and that's explained by Roman as
well.

Chris Down April 24, 2020, 12:18 p.m. UTC | #9

Yafang Shao writes:
>If the author can't understand deeply in the code worte by
>himself/herself, I think the author should do more test on his/her
>patches.
>Regarding the issue in this case, my understanding is you know the
>benefit of proportional reclaim, but I'm wondering that do you know
>the loss if the proportional is not correct ?
>I don't mean to affend you, while I just try to explain how the
>community should cooperate.

I'm pretty sure that since multiple people on mm list have already expressed 
confusion at this patch, this isn't a question of testing, but of lack of 
clarity in usage :-)

Promoting "testing" as a panacea for this issue misses a significant part of 
the real problem: that the intended semantics and room for allowed races is 
currently unclear, which is why there is a general sense of confusion around 
your proposed patch and what it solves. If more testing would help, then the 
benefit of your patch should be patently obvious -- but it isn't.

Yafang Shao April 24, 2020, 12:44 p.m. UTC | #10

On Fri, Apr 24, 2020 at 8:18 PM Chris Down <chris@chrisdown.name> wrote:
>
> Yafang Shao writes:
> >If the author can't understand deeply in the code worte by
> >himself/herself, I think the author should do more test on his/her
> >patches.
> >Regarding the issue in this case, my understanding is you know the
> >benefit of proportional reclaim, but I'm wondering that do you know
> >the loss if the proportional is not correct ?
> >I don't mean to affend you, while I just try to explain how the
> >community should cooperate.
>
> I'm pretty sure that since multiple people on mm list have already expressed
> confusion at this patch, this isn't a question of testing, but of lack of
> clarity in usage :-)
>
> Promoting "testing" as a panacea for this issue misses a significant part of
> the real problem: that the intended semantics and room for allowed races is
> currently unclear, which is why there is a general sense of confusion around
> your proposed patch and what it solves. If more testing would help, then the
> benefit of your patch should be patently obvious -- but it isn't.

I have shown a testcase in my commit log.
Bellow is the result without my patch,

[  601.811428] vmscan: protection 1048576 memcg /foo target memcg /foo
[  601.811429] vmscan:
[  601.811429] vmscan: protection 1048576 memcg /foo target memcg /foo
[  601.811430] vmscan:
[  601.811430] vmscan: protection 1048576 memcg /foo target memcg /foo
[  601.811431] vmscan:
[  602.452791] vmscan: protection 1048576 memcg /foo target memcg /foo
[  602.452795] vmscan:
[  602.452796] vmscan: protection 1048576 memcg /foo target memcg /foo
[  602.452805] vmscan:
[  602.452805] vmscan: protection 1048576 memcg /foo target memcg /foo
[  602.452806] vmscan:
[  602.452807] vmscan: protection 1048576 memcg /foo target memcg /foo
[  602.452808] vmscan:


Here's patch to print the above info.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b06868fc4926..7525665d7cec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2344,10 +2344,18 @@ static void get_scan_count(struct lruvec
*lruvec, struct scan_control *sc,
                unsigned long lruvec_size;
                unsigned long scan;
                unsigned long protection;
+               struct mem_cgroup *target = sc->target_mem_cgroup;

                lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
                protection = mem_cgroup_protection(memcg,
                                                   sc->memcg_low_reclaim);
+               if (memcg && memcg != root_mem_cgroup && target) {
+                       pr_info("protection %lu memcg ", protection);
+                       pr_cont_cgroup_path(memcg->css.cgroup);
+                       pr_cont(" target memcg ");
+                       pr_cont_cgroup_path(target->css.cgroup);
+                       pr_info("\n");
+               }

                if (protection) {


So my question is that do you think the protection in these log is okay ?
Can you answer me ?

Hint: what should protection be if memcg is the target memcg ?

Chris Down April 24, 2020, 1:05 p.m. UTC | #11

I'm not debating whether your test case is correct or not, or whether the 
numbers are correct or not. The point is that 3 out of 4 people from mm list 
who have looked at this patch have no idea what it's trying to do, why it's 
important, or why these numbers *should* necessarily be wrong.

It's good that you have provided a way to demonstrate the effects of your 
changes, but one can't solve cognitive gaps with testing. If one could, then 
things like TDD would be effective on complex projects -- but they aren't[0].

We're really getting off in the weeds here, though. I just want to make it 
clear to you that if you think "more testing" is a solution to kernel ailments 
like this, you're going to be disappointed.

0: http://people.brunel.ac.uk/~csstmms/FucciEtAl_ESEM2016.pdf

Yafang Shao April 24, 2020, 1:10 p.m. UTC | #12

On Fri, Apr 24, 2020 at 9:05 PM Chris Down <chris@chrisdown.name> wrote:
>
> I'm not debating whether your test case is correct or not, or whether the
> numbers are correct or not. The point is that 3 out of 4 people from mm list
> who have looked at this patch have no idea what it's trying to do, why it's
> important, or why these numbers *should* necessarily be wrong.
>
> It's good that you have provided a way to demonstrate the effects of your
> changes, but one can't solve cognitive gaps with testing. If one could, then
> things like TDD would be effective on complex projects -- but they aren't[0].
>
> We're really getting off in the weeds here, though. I just want to make it
> clear to you that if you think "more testing" is a solution to kernel ailments
> like this, you're going to be disappointed.
>
> 0: http://people.brunel.ac.uk/~csstmms/FucciEtAl_ESEM2016.pdf


Hi  Chris,

Pls. answer my question directly - what should protection be if memcg
is the target memcg ?
If you can't answer my quesiont, I suggest to revist your patch.

Johannes Weiner April 24, 2020, 1:14 p.m. UTC | #13

On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> This patch is an improvement of a previous version[1], as the previous
> version is not easy to understand.
> This issue persists in the newest kernel, I have to resend the fix. As
> the implementation is changed, I drop Roman's ack from the previous
> version.

Now that I understand the problem, I much prefer the previous version.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 745697906ce3..2bf91ae1e640 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,

 	if (!root)
 		root = root_mem_cgroup;
-	if (memcg == root)
+	if (memcg == root) {
+		/*
+		 * The cgroup is the reclaim root in this reclaim
+		 * cycle, and therefore not protected. But it may have
+		 * stale effective protection values from previous
+		 * cycles in which it was not the reclaim root - for
+		 * example, global reclaim followed by limit reclaim.
+		 * Reset these values for mem_cgroup_protection().
+		 */
+		memcg->memory.emin = 0;
+		memcg->memory.elow = 0;
 		return MEMCG_PROT_NONE;
+	}

 	usage = page_counter_read(&memcg->memory);
 	if (!usage)

> Here's the explanation of this issue.
> memory.{low,min} won't take effect if the to-be-reclaimed memcg is the
> sc->target_mem_cgroup, that can also be proved by the implementation in
> mem_cgroup_protected(), see bellow,
> 	mem_cgroup_protected
> 		if (memcg == root) [2]
> 			return MEMCG_PROT_NONE;
> 
> But this rule is ignored in mem_cgroup_protection(), which will read
> memory.{emin, elow} as the protection whatever the memcg is.
> 
> How would this issue happen?
> Because in mem_cgroup_protected() we forget to clear the
> memory.{emin, elow} if the memcg is target_mem_cgroup [2].
> 
> An example to illustrate this issue.
>    root_mem_cgroup
>          /
>         A   memory.max: 1024M
>             memory.min: 512M
>             memory.current: 800M ('current' must be greater than 'min')
> Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A.
> Then kswapd stops.
> As a result of it, the memory values of A will be,
>    root_mem_cgroup
>          /
>         A   memory.max: 1024M
>             memory.min: 512M
>             memory.current: 512M (approximately)
>             memory.emin: 512M
> 
> Then a new workload starts to run in memcg A, and it will trigger memcg
> relcaim in A soon. As memcg A is the target_mem_cgroup of this
> reclaimer, so it return directly without touching memory.{emin, elow}.[2]
> The memory values of A will be,
>    root_mem_cgroup
>          /
>         A   memory.max: 1024M
>             memory.min: 512M
>             memory.current: 1024M (approximately)
>             memory.emin: 512M
> Then this memory.emin will be used in mem_cgroup_protection() to get the
> scan count, which is obvoiusly a wrong scan count.
> 
> [1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
> 
> Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> Cc: Chris Down <chris@chrisdown.name>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

As others have noted, it's fairly hard to understand the problem from
the above changelog. How about the following:

A cgroup can have both memory protection and a memory limit to isolate
it from its siblings in both directions - for example, to prevent it
from being shrunk below 2G under high pressure from outside, but also
from growing beyond 4G under low pressure.

9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
implemented proportional scan pressure so that multiple siblings in
excess of their protection settings don't get reclaimed equally but
instead in accordance to their unprotected portion.

During limit reclaim, this proportionality shouldn't apply of course:
there is no competition, all pressure is from within the cgroup and
should be applied as such. Reclaim should operate at full efficiency.

However, mem_cgroup_protected() never expected anybody to look at the
effective protection values when it indicated that the cgroup is above
its protection. As a result, a query during limit reclaim may return
stale protection values that were calculated by a previous reclaim
cycle in which the cgroup did have siblings.

When this happens, reclaim is unnecessarily hesitant and potentially
slow to meet the desired limit. In theory this could lead to premature
OOM kills, although it's not obvious this has occurred in practice.

Johannes Weiner April 24, 2020, 1:44 p.m. UTC | #14

On Fri, Apr 24, 2020 at 09:14:52AM -0400, Johannes Weiner wrote:
> However, mem_cgroup_protected() never expected anybody to look at the
> effective protection values when it indicated that the cgroup is above
> its protection. As a result, a query during limit reclaim may return
> stale protection values that were calculated by a previous reclaim
> cycle in which the cgroup did have siblings.

Btw, I think there is opportunity to make this a bit less error prone.

We have a mem_cgroup_protected() that returns yes or no, essentially,
but protection isn't a binary state anymore.

It's also been a bit iffy that it looks like a simple predicate
function, but it indeed needs to run procedurally for each cgroup in
order for the calculations throughout the tree to be correct.

It might be better to have a

	mem_cgroup_calculate_protection()

that runs for every cgroup we visit and sets up the internal state;
then have more self-explanatory query functions on top of that:

	mem_cgroup_below_min()
	mem_cgroup_below_low()
	mem_cgroup_protection()

What do you guys think?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0f502b5fca6..dbd3f75d39b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2615,14 +2615,15 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		unsigned long reclaimed;
 		unsigned long scanned;
 
-		switch (mem_cgroup_protected(target_memcg, memcg)) {
-		case MEMCG_PROT_MIN:
+		mem_cgroup_calculate_protection(target_memcg, memcg);
+
+		if (mem_cgroup_below_min(memcg)) {
 			/*
 			 * Hard protection.
 			 * If there is no reclaimable memory, OOM.
 			 */
 			continue;
-		case MEMCG_PROT_LOW:
+		} else if (mem_cgroup_below_low(memcg)) {
 			/*
 			 * Soft protection.
 			 * Respect the protection only as long as
@@ -2634,16 +2635,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 				continue;
 			}
 			memcg_memory_event(memcg, MEMCG_LOW);
-			break;
-		case MEMCG_PROT_NONE:
-			/*
-			 * All protection thresholds breached. We may
-			 * still choose to vary the scan pressure
-			 * applied based on by how much the cgroup in
-			 * question has exceeded its protection
-			 * thresholds (see get_scan_count).
-			 */
-			break;
 		}
 
 		reclaimed = sc->nr_reclaimed;

Michal Hocko April 24, 2020, 2:29 p.m. UTC | #15

On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > This patch is an improvement of a previous version[1], as the previous
> > version is not easy to understand.
> > This issue persists in the newest kernel, I have to resend the fix. As
> > the implementation is changed, I drop Roman's ack from the previous
> > version.
> 
> Now that I understand the problem, I much prefer the previous version.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 745697906ce3..2bf91ae1e640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  
>  	if (!root)
>  		root = root_mem_cgroup;
> -	if (memcg == root)
> +	if (memcg == root) {
> +		/*
> +		 * The cgroup is the reclaim root in this reclaim
> +		 * cycle, and therefore not protected. But it may have
> +		 * stale effective protection values from previous
> +		 * cycles in which it was not the reclaim root - for
> +		 * example, global reclaim followed by limit reclaim.
> +		 * Reset these values for mem_cgroup_protection().
> +		 */
> +		memcg->memory.emin = 0;
> +		memcg->memory.elow = 0;
>  		return MEMCG_PROT_NONE;
> +	}

Could you be more specific why you prefer this over the
mem_cgroup_protection which doesn't change the effective value?
Isn't it easier to simply ignore effective value for the reclaim roots?

[...]

> As others have noted, it's fairly hard to understand the problem from
> the above changelog. How about the following:
> 
> A cgroup can have both memory protection and a memory limit to isolate
> it from its siblings in both directions - for example, to prevent it
> from being shrunk below 2G under high pressure from outside, but also
> from growing beyond 4G under low pressure.
> 
> 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> implemented proportional scan pressure so that multiple siblings in
> excess of their protection settings don't get reclaimed equally but
> instead in accordance to their unprotected portion.
> 
> During limit reclaim, this proportionality shouldn't apply of course:
> there is no competition, all pressure is from within the cgroup and
> should be applied as such. Reclaim should operate at full efficiency.
> 
> However, mem_cgroup_protected() never expected anybody to look at the
> effective protection values when it indicated that the cgroup is above
> its protection. As a result, a query during limit reclaim may return
> stale protection values that were calculated by a previous reclaim
> cycle in which the cgroup did have siblings.

This is better. Thanks!

> When this happens, reclaim is unnecessarily hesitant and potentially
> slow to meet the desired limit. In theory this could lead to premature
> OOM kills, although it's not obvious this has occurred in practice.

I do not see how this would lead all the way to OOM killer but it
certainly can lead to unnecessary increase of the reclaim priority. The
smaller the difference between the reclaim target and protection the
more visible the effect would be. But if there are reclaimable pages
then the reclaim should see them sooner or later

Michal Hocko April 24, 2020, 2:33 p.m. UTC | #16

On Fri 24-04-20 09:44:38, Johannes Weiner wrote:
> On Fri, Apr 24, 2020 at 09:14:52AM -0400, Johannes Weiner wrote:
> > However, mem_cgroup_protected() never expected anybody to look at the
> > effective protection values when it indicated that the cgroup is above
> > its protection. As a result, a query during limit reclaim may return
> > stale protection values that were calculated by a previous reclaim
> > cycle in which the cgroup did have siblings.
> 
> Btw, I think there is opportunity to make this a bit less error prone.
> 
> We have a mem_cgroup_protected() that returns yes or no, essentially,
> but protection isn't a binary state anymore.
> 
> It's also been a bit iffy that it looks like a simple predicate
> function, but it indeed needs to run procedurally for each cgroup in
> order for the calculations throughout the tree to be correct.
> 
> It might be better to have a
> 
> 	mem_cgroup_calculate_protection()
> 
> that runs for every cgroup we visit and sets up the internal state;
> then have more self-explanatory query functions on top of that:
> 
> 	mem_cgroup_below_min()
> 	mem_cgroup_below_low()
> 	mem_cgroup_protection()
> 
> What do you guys think?

Fully agreed. I have to say that I have considered the predicate with
side effects really confusing.

Johannes Weiner April 24, 2020, 3:10 p.m. UTC | #17

On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > This patch is an improvement of a previous version[1], as the previous
> > > version is not easy to understand.
> > > This issue persists in the newest kernel, I have to resend the fix. As
> > > the implementation is changed, I drop Roman's ack from the previous
> > > version.
> > 
> > Now that I understand the problem, I much prefer the previous version.
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 745697906ce3..2bf91ae1e640 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >  
> >  	if (!root)
> >  		root = root_mem_cgroup;
> > -	if (memcg == root)
> > +	if (memcg == root) {
> > +		/*
> > +		 * The cgroup is the reclaim root in this reclaim
> > +		 * cycle, and therefore not protected. But it may have
> > +		 * stale effective protection values from previous
> > +		 * cycles in which it was not the reclaim root - for
> > +		 * example, global reclaim followed by limit reclaim.
> > +		 * Reset these values for mem_cgroup_protection().
> > +		 */
> > +		memcg->memory.emin = 0;
> > +		memcg->memory.elow = 0;
> >  		return MEMCG_PROT_NONE;
> > +	}
> 
> Could you be more specific why you prefer this over the
> mem_cgroup_protection which doesn't change the effective value?
> Isn't it easier to simply ignore effective value for the reclaim roots?

Because now both mem_cgroup_protection() and mem_cgroup_protected()
have to know about the reclaim root semantics, instead of just the one
central place.

And the query function has to know additional rules about when the
emin/elow values are uptodate or it could silently be looking at stale
data, which isn't very robust.

"The effective protection values are uptodate after calling
mem_cgroup_protected() inside the reclaim cycle - UNLESS the group
you're looking at happens to be..."

It's much easier to make the rule: The values are uptodate after you
called mem_cgroup_protected().

Or mem_cgroup_calculate_protection(), if we go with that later.

> > As others have noted, it's fairly hard to understand the problem from
> > the above changelog. How about the following:
> > 
> > A cgroup can have both memory protection and a memory limit to isolate
> > it from its siblings in both directions - for example, to prevent it
> > from being shrunk below 2G under high pressure from outside, but also
> > from growing beyond 4G under low pressure.
> > 
> > 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> > implemented proportional scan pressure so that multiple siblings in
> > excess of their protection settings don't get reclaimed equally but
> > instead in accordance to their unprotected portion.
> > 
> > During limit reclaim, this proportionality shouldn't apply of course:
> > there is no competition, all pressure is from within the cgroup and
> > should be applied as such. Reclaim should operate at full efficiency.
> > 
> > However, mem_cgroup_protected() never expected anybody to look at the
> > effective protection values when it indicated that the cgroup is above
> > its protection. As a result, a query during limit reclaim may return
> > stale protection values that were calculated by a previous reclaim
> > cycle in which the cgroup did have siblings.
> 
> This is better. Thanks!
> 
> > When this happens, reclaim is unnecessarily hesitant and potentially
> > slow to meet the desired limit. In theory this could lead to premature
> > OOM kills, although it's not obvious this has occurred in practice.
> 
> I do not see how this would lead all the way to OOM killer but it
> certainly can lead to unnecessary increase of the reclaim priority. The
> smaller the difference between the reclaim target and protection the
> more visible the effect would be. But if there are reclaimable pages
> then the reclaim should see them sooner or later

It would be a pretty extreme case, but not impossible AFAICS, because
OOM is just a sampled state, not deterministic.

If memory.max is 64G and memory.low is 64G minus one page, this bug
could cause limit reclaim to look at no more than SWAP_CLUSTER_MAX
pages at priority 0. It's possible it wouldn't get through the full
64G worth of memory before giving up and declaring OOM.

Not that that would be a sensical configuration... My point is that
OOM is defined as "I've looked at X pages and found nothing" and this
bug can significantly lower X.

Yafang Shao April 24, 2020, 4 p.m. UTC | #18

On Fri, Apr 24, 2020 at 9:14 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > This patch is an improvement of a previous version[1], as the previous
> > version is not easy to understand.
> > This issue persists in the newest kernel, I have to resend the fix. As
> > the implementation is changed, I drop Roman's ack from the previous
> > version.
>
> Now that I understand the problem, I much prefer the previous version.
>

Great news that this issue is understood by one more reviewer.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 745697906ce3..2bf91ae1e640 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>
>         if (!root)
>                 root = root_mem_cgroup;
> -       if (memcg == root)
> +       if (memcg == root) {
> +               /*
> +                * The cgroup is the reclaim root in this reclaim
> +                * cycle, and therefore not protected. But it may have
> +                * stale effective protection values from previous
> +                * cycles in which it was not the reclaim root - for
> +                * example, global reclaim followed by limit reclaim.
> +                * Reset these values for mem_cgroup_protection().
> +                */
> +               memcg->memory.emin = 0;
> +               memcg->memory.elow = 0;
>                 return MEMCG_PROT_NONE;
> +       }
>
>         usage = page_counter_read(&memcg->memory);
>         if (!usage)
>
> > Here's the explanation of this issue.
> > memory.{low,min} won't take effect if the to-be-reclaimed memcg is the
> > sc->target_mem_cgroup, that can also be proved by the implementation in
> > mem_cgroup_protected(), see bellow,
> >       mem_cgroup_protected
> >               if (memcg == root) [2]
> >                       return MEMCG_PROT_NONE;
> >
> > But this rule is ignored in mem_cgroup_protection(), which will read
> > memory.{emin, elow} as the protection whatever the memcg is.
> >
> > How would this issue happen?
> > Because in mem_cgroup_protected() we forget to clear the
> > memory.{emin, elow} if the memcg is target_mem_cgroup [2].
> >
> > An example to illustrate this issue.
> >    root_mem_cgroup
> >          /
> >         A   memory.max: 1024M
> >             memory.min: 512M
> >             memory.current: 800M ('current' must be greater than 'min')
> > Once kswapd starts to reclaim memcg A, it assigns 512M to memory.emin of A.
> > Then kswapd stops.
> > As a result of it, the memory values of A will be,
> >    root_mem_cgroup
> >          /
> >         A   memory.max: 1024M
> >             memory.min: 512M
> >             memory.current: 512M (approximately)
> >             memory.emin: 512M
> >
> > Then a new workload starts to run in memcg A, and it will trigger memcg
> > relcaim in A soon. As memcg A is the target_mem_cgroup of this
> > reclaimer, so it return directly without touching memory.{emin, elow}.[2]
> > The memory values of A will be,
> >    root_mem_cgroup
> >          /
> >         A   memory.max: 1024M
> >             memory.min: 512M
> >             memory.current: 1024M (approximately)
> >             memory.emin: 512M
> > Then this memory.emin will be used in mem_cgroup_protection() to get the
> > scan count, which is obvoiusly a wrong scan count.
> >
> > [1]. https://lore.kernel.org/linux-mm/20200216145249.6900-1-laoar.shao@gmail.com/
> >
> > Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> > Cc: Chris Down <chris@chrisdown.name>
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> As others have noted, it's fairly hard to understand the problem from
> the above changelog. How about the following:
>
> A cgroup can have both memory protection and a memory limit to isolate
> it from its siblings in both directions - for example, to prevent it
> from being shrunk below 2G under high pressure from outside, but also
> from growing beyond 4G under low pressure.
>
> 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> implemented proportional scan pressure so that multiple siblings in
> excess of their protection settings don't get reclaimed equally but
> instead in accordance to their unprotected portion.
>
> During limit reclaim, this proportionality shouldn't apply of course:
> there is no competition, all pressure is from within the cgroup and
> should be applied as such. Reclaim should operate at full efficiency.
>
> However, mem_cgroup_protected() never expected anybody to look at the
> effective protection values when it indicated that the cgroup is above
> its protection. As a result, a query during limit reclaim may return
> stale protection values that were calculated by a previous reclaim
> cycle in which the cgroup did have siblings.
>
> When this happens, reclaim is unnecessarily hesitant and potentially
> slow to meet the desired limit. In theory this could lead to premature
> OOM kills, although it's not obvious this has occurred in practice.

Thanks a lot for your improvement on the change log.

--
Thanks
Yafang

Yafang Shao April 24, 2020, 4:08 p.m. UTC | #19

On Fri, Apr 24, 2020 at 9:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 24, 2020 at 09:14:52AM -0400, Johannes Weiner wrote:
> > However, mem_cgroup_protected() never expected anybody to look at the
> > effective protection values when it indicated that the cgroup is above
> > its protection. As a result, a query during limit reclaim may return
> > stale protection values that were calculated by a previous reclaim
> > cycle in which the cgroup did have siblings.
>
> Btw, I think there is opportunity to make this a bit less error prone.
>
> We have a mem_cgroup_protected() that returns yes or no, essentially,
> but protection isn't a binary state anymore.
>
> It's also been a bit iffy that it looks like a simple predicate
> function, but it indeed needs to run procedurally for each cgroup in
> order for the calculations throughout the tree to be correct.
>
> It might be better to have a
>
>         mem_cgroup_calculate_protection()
>
> that runs for every cgroup we visit and sets up the internal state;
> then have more self-explanatory query functions on top of that:
>
>         mem_cgroup_below_min()
>         mem_cgroup_below_low()
>         mem_cgroup_protection()
>
> What do you guys think?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e0f502b5fca6..dbd3f75d39b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2615,14 +2615,15 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>                 unsigned long reclaimed;
>                 unsigned long scanned;
>
> -               switch (mem_cgroup_protected(target_memcg, memcg)) {
> -               case MEMCG_PROT_MIN:
> +               mem_cgroup_calculate_protection(target_memcg, memcg);
> +
> +               if (mem_cgroup_below_min(memcg)) {
>                         /*
>                          * Hard protection.
>                          * If there is no reclaimable memory, OOM.
>                          */
>                         continue;
> -               case MEMCG_PROT_LOW:
> +               } else if (mem_cgroup_below_low(memcg)) {
>                         /*
>                          * Soft protection.
>                          * Respect the protection only as long as
> @@ -2634,16 +2635,6 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>                                 continue;
>                         }
>                         memcg_memory_event(memcg, MEMCG_LOW);
> -                       break;
> -               case MEMCG_PROT_NONE:
> -                       /*
> -                        * All protection thresholds breached. We may
> -                        * still choose to vary the scan pressure
> -                        * applied based on by how much the cgroup in
> -                        * question has exceeded its protection
> -                        * thresholds (see get_scan_count).
> -                        */
> -                       break;
>                 }
>
>                 reclaimed = sc->nr_reclaimed;


After my revist of the memcg protection. I have another idea.

The emin and elow is not decided by the memcg(struct mem_cgroup), but
they are really decided by the reclaim context(struct srhink_control).
So they should not be bound into struct  mem_cgroup, while they are
really should be bound into struct srhink_control.
IOW, we should move emin and elow from struct mem_cgroup into struct
srhink_control.
And they two members in shrink_control will be updated when a new
memcg is to be shrinked.
I haven't thought it deeply, but I think this should be the right thing to do.

Thanks
Yafang

Michal Hocko April 24, 2020, 4:21 p.m. UTC | #20

On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
> On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> > On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > > This patch is an improvement of a previous version[1], as the previous
> > > > version is not easy to understand.
> > > > This issue persists in the newest kernel, I have to resend the fix. As
> > > > the implementation is changed, I drop Roman's ack from the previous
> > > > version.
> > > 
> > > Now that I understand the problem, I much prefer the previous version.
> > > 
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 745697906ce3..2bf91ae1e640 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > >  
> > >  	if (!root)
> > >  		root = root_mem_cgroup;
> > > -	if (memcg == root)
> > > +	if (memcg == root) {
> > > +		/*
> > > +		 * The cgroup is the reclaim root in this reclaim
> > > +		 * cycle, and therefore not protected. But it may have
> > > +		 * stale effective protection values from previous
> > > +		 * cycles in which it was not the reclaim root - for
> > > +		 * example, global reclaim followed by limit reclaim.
> > > +		 * Reset these values for mem_cgroup_protection().
> > > +		 */
> > > +		memcg->memory.emin = 0;
> > > +		memcg->memory.elow = 0;
> > >  		return MEMCG_PROT_NONE;
> > > +	}
> > 
> > Could you be more specific why you prefer this over the
> > mem_cgroup_protection which doesn't change the effective value?
> > Isn't it easier to simply ignore effective value for the reclaim roots?
> 
> Because now both mem_cgroup_protection() and mem_cgroup_protected()
> have to know about the reclaim root semantics, instead of just the one
> central place.

Yes this is true but it is also potentially overwriting the state with
a parallel reclaim which can lead to surprising results beacause
parent's effective protection is used to define protection distribution
for children. Let's have global and A's reclaim in parallel:
 |
 A (low=2G, usage = 3G, max = 3G, children_low_usage = 1.5G)
 |\
 | C (low = 1G, usage = 2.5G)
 B (low = 1G, usage = 0.5G)

for A reclaim we have
B.elow = B.low
C.elow = C.low

For the global reclaim
A.elow = A.low
B.elow = min(B.usage, B.low) because children_low_usage <= A.elow
C.elow = min(C.usage, C.low)

With the effective values reseting we have A reclaim
A.elow = 0
B.elow = B.low
C.elow = C.low
[...]

and global reclaim could see the above and then
B.elow = C.elow = 0 because children_low_usage > A.elow

> And the query function has to know additional rules about when the
> emin/elow values are uptodate or it could silently be looking at stale
> data, which isn't very robust.
> 
> "The effective protection values are uptodate after calling
> mem_cgroup_protected() inside the reclaim cycle - UNLESS the group
> you're looking at happens to be..."
> 
> It's much easier to make the rule: The values are uptodate after you
> called mem_cgroup_protected().
> 
> Or mem_cgroup_calculate_protection(), if we go with that later.
> 
> > > As others have noted, it's fairly hard to understand the problem from
> > > the above changelog. How about the following:
> > > 
> > > A cgroup can have both memory protection and a memory limit to isolate
> > > it from its siblings in both directions - for example, to prevent it
> > > from being shrunk below 2G under high pressure from outside, but also
> > > from growing beyond 4G under low pressure.
> > > 
> > > 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> > > implemented proportional scan pressure so that multiple siblings in
> > > excess of their protection settings don't get reclaimed equally but
> > > instead in accordance to their unprotected portion.
> > > 
> > > During limit reclaim, this proportionality shouldn't apply of course:
> > > there is no competition, all pressure is from within the cgroup and
> > > should be applied as such. Reclaim should operate at full efficiency.
> > > 
> > > However, mem_cgroup_protected() never expected anybody to look at the
> > > effective protection values when it indicated that the cgroup is above
> > > its protection. As a result, a query during limit reclaim may return
> > > stale protection values that were calculated by a previous reclaim
> > > cycle in which the cgroup did have siblings.
> > 
> > This is better. Thanks!
> > 
> > > When this happens, reclaim is unnecessarily hesitant and potentially
> > > slow to meet the desired limit. In theory this could lead to premature
> > > OOM kills, although it's not obvious this has occurred in practice.
> > 
> > I do not see how this would lead all the way to OOM killer but it
> > certainly can lead to unnecessary increase of the reclaim priority. The
> > smaller the difference between the reclaim target and protection the
> > more visible the effect would be. But if there are reclaimable pages
> > then the reclaim should see them sooner or later
> 
> It would be a pretty extreme case, but not impossible AFAICS, because
> OOM is just a sampled state, not deterministic.
> 
> If memory.max is 64G and memory.low is 64G minus one page, this bug
> could cause limit reclaim to look at no more than SWAP_CLUSTER_MAX
> pages at priority 0. It's possible it wouldn't get through the full
> 64G worth of memory before giving up and declaring OOM.

Yes, my bad I didn't really realize that there won't be a full scan even
under priority 0.

Roman Gushchin April 24, 2020, 4:21 p.m. UTC | #21

On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > This patch is an improvement of a previous version[1], as the previous
> > > version is not easy to understand.
> > > This issue persists in the newest kernel, I have to resend the fix. As
> > > the implementation is changed, I drop Roman's ack from the previous
> > > version.
> > 
> > Now that I understand the problem, I much prefer the previous version.
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 745697906ce3..2bf91ae1e640 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >  
> >  	if (!root)
> >  		root = root_mem_cgroup;
> > -	if (memcg == root)
> > +	if (memcg == root) {
> > +		/*
> > +		 * The cgroup is the reclaim root in this reclaim
> > +		 * cycle, and therefore not protected. But it may have
> > +		 * stale effective protection values from previous
> > +		 * cycles in which it was not the reclaim root - for
> > +		 * example, global reclaim followed by limit reclaim.
> > +		 * Reset these values for mem_cgroup_protection().
> > +		 */
> > +		memcg->memory.emin = 0;
> > +		memcg->memory.elow = 0;
> >  		return MEMCG_PROT_NONE;
> > +	}
> 
> Could you be more specific why you prefer this over the
> mem_cgroup_protection which doesn't change the effective value?
> Isn't it easier to simply ignore effective value for the reclaim roots?

Hm, I think I like the new version better, because it feels "safer" in terms
of preserving sane effective protection values for concurrent reclaimers.

> 
> [...]
> 
> > As others have noted, it's fairly hard to understand the problem from
> > the above changelog. How about the following:
> > 
> > A cgroup can have both memory protection and a memory limit to isolate
> > it from its siblings in both directions - for example, to prevent it
> > from being shrunk below 2G under high pressure from outside, but also
> > from growing beyond 4G under low pressure.
> > 
> > 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> > implemented proportional scan pressure so that multiple siblings in
> > excess of their protection settings don't get reclaimed equally but
> > instead in accordance to their unprotected portion.
> > 
> > During limit reclaim, this proportionality shouldn't apply of course:
> > there is no competition, all pressure is from within the cgroup and
> > should be applied as such. Reclaim should operate at full efficiency.
> > 
> > However, mem_cgroup_protected() never expected anybody to look at the
> > effective protection values when it indicated that the cgroup is above
> > its protection. As a result, a query during limit reclaim may return
> > stale protection values that were calculated by a previous reclaim
> > cycle in which the cgroup did have siblings.
> 
> This is better. Thanks!

+1

and I like the proposed renaming/cleanup. Thanks, Johannes!

> 
> > When this happens, reclaim is unnecessarily hesitant and potentially
> > slow to meet the desired limit. In theory this could lead to premature
> > OOM kills, although it's not obvious this has occurred in practice.
> 
> I do not see how this would lead all the way to OOM killer but it
> certainly can lead to unnecessary increase of the reclaim priority. The
> smaller the difference between the reclaim target and protection the
> more visible the effect would be. But if there are reclaimable pages
> then the reclaim should see them sooner or later

I guess if all memory is protected by emin and the targeted reclaim
will be unable to reclaim anything, OOM can be triggered.

Btw, I wonder if this case can be covered by a new memcg kselftest?
I'm not sure it can be easily reproducible, but if it can, it would be
the best demonstration of a problem and the fix.
Yafang, don't you want to try?

Thanks!

Yafang Shao April 24, 2020, 4:30 p.m. UTC | #22

On Sat, Apr 25, 2020 at 12:22 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> > On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > > This patch is an improvement of a previous version[1], as the previous
> > > > version is not easy to understand.
> > > > This issue persists in the newest kernel, I have to resend the fix. As
> > > > the implementation is changed, I drop Roman's ack from the previous
> > > > version.
> > >
> > > Now that I understand the problem, I much prefer the previous version.
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 745697906ce3..2bf91ae1e640 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > >
> > >     if (!root)
> > >             root = root_mem_cgroup;
> > > -   if (memcg == root)
> > > +   if (memcg == root) {
> > > +           /*
> > > +            * The cgroup is the reclaim root in this reclaim
> > > +            * cycle, and therefore not protected. But it may have
> > > +            * stale effective protection values from previous
> > > +            * cycles in which it was not the reclaim root - for
> > > +            * example, global reclaim followed by limit reclaim.
> > > +            * Reset these values for mem_cgroup_protection().
> > > +            */
> > > +           memcg->memory.emin = 0;
> > > +           memcg->memory.elow = 0;
> > >             return MEMCG_PROT_NONE;
> > > +   }
> >
> > Could you be more specific why you prefer this over the
> > mem_cgroup_protection which doesn't change the effective value?
> > Isn't it easier to simply ignore effective value for the reclaim roots?
>
> Hm, I think I like the new version better, because it feels "safer" in terms
> of preserving sane effective protection values for concurrent reclaimers.
>
> >
> > [...]
> >
> > > As others have noted, it's fairly hard to understand the problem from
> > > the above changelog. How about the following:
> > >
> > > A cgroup can have both memory protection and a memory limit to isolate
> > > it from its siblings in both directions - for example, to prevent it
> > > from being shrunk below 2G under high pressure from outside, but also
> > > from growing beyond 4G under low pressure.
> > >
> > > 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
> > > implemented proportional scan pressure so that multiple siblings in
> > > excess of their protection settings don't get reclaimed equally but
> > > instead in accordance to their unprotected portion.
> > >
> > > During limit reclaim, this proportionality shouldn't apply of course:
> > > there is no competition, all pressure is from within the cgroup and
> > > should be applied as such. Reclaim should operate at full efficiency.
> > >
> > > However, mem_cgroup_protected() never expected anybody to look at the
> > > effective protection values when it indicated that the cgroup is above
> > > its protection. As a result, a query during limit reclaim may return
> > > stale protection values that were calculated by a previous reclaim
> > > cycle in which the cgroup did have siblings.
> >
> > This is better. Thanks!
>
> +1
>
> and I like the proposed renaming/cleanup. Thanks, Johannes!
>
> >
> > > When this happens, reclaim is unnecessarily hesitant and potentially
> > > slow to meet the desired limit. In theory this could lead to premature
> > > OOM kills, although it's not obvious this has occurred in practice.
> >
> > I do not see how this would lead all the way to OOM killer but it
> > certainly can lead to unnecessary increase of the reclaim priority. The
> > smaller the difference between the reclaim target and protection the
> > more visible the effect would be. But if there are reclaimable pages
> > then the reclaim should see them sooner or later
>
> I guess if all memory is protected by emin and the targeted reclaim
> will be unable to reclaim anything, OOM can be triggered.
>
> Btw, I wonder if this case can be covered by a new memcg kselftest?
> I'm not sure it can be easily reproducible, but if it can, it would be
> the best demonstration of a problem and the fix.
> Yafang, don't you want to try?

I have tried to produce the premature OOM before I send this fix, but
I find that it is really not easy to produce.
But if a new memcg kselftest is needed, I can try it again.

Johannes Weiner April 24, 2020, 4:51 p.m. UTC | #23

On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
> On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
> > On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> > > On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > > > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > > > This patch is an improvement of a previous version[1], as the previous
> > > > > version is not easy to understand.
> > > > > This issue persists in the newest kernel, I have to resend the fix. As
> > > > > the implementation is changed, I drop Roman's ack from the previous
> > > > > version.
> > > > 
> > > > Now that I understand the problem, I much prefer the previous version.
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 745697906ce3..2bf91ae1e640 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > > >  
> > > >  	if (!root)
> > > >  		root = root_mem_cgroup;
> > > > -	if (memcg == root)
> > > > +	if (memcg == root) {
> > > > +		/*
> > > > +		 * The cgroup is the reclaim root in this reclaim
> > > > +		 * cycle, and therefore not protected. But it may have
> > > > +		 * stale effective protection values from previous
> > > > +		 * cycles in which it was not the reclaim root - for
> > > > +		 * example, global reclaim followed by limit reclaim.
> > > > +		 * Reset these values for mem_cgroup_protection().
> > > > +		 */
> > > > +		memcg->memory.emin = 0;
> > > > +		memcg->memory.elow = 0;
> > > >  		return MEMCG_PROT_NONE;
> > > > +	}
> > > 
> > > Could you be more specific why you prefer this over the
> > > mem_cgroup_protection which doesn't change the effective value?
> > > Isn't it easier to simply ignore effective value for the reclaim roots?
> > 
> > Because now both mem_cgroup_protection() and mem_cgroup_protected()
> > have to know about the reclaim root semantics, instead of just the one
> > central place.
> 
> Yes this is true but it is also potentially overwriting the state with
> a parallel reclaim which can lead to surprising results

Checking in mem_cgroup_protection() doesn't avoid the fundamental race:

  root
     `- A (low=2G, elow=2G, max=3G)
        `- A1 (low=2G, elow=2G)

If A does limit reclaim while global reclaim races, the memcg == root
check in mem_cgroup_protection() will reliably calculate the "right"
scan value for A, which has no pages, and the wrong scan value for A1
where the memory actually is.

I'm okay with fixing the case where a really old left-over value is
used by target reclaim.

I don't see a point in special casing this one instance of a
fundamental race condition at the expense of less robust code.

Michal Hocko April 27, 2020, 8:25 a.m. UTC | #24

On Fri 24-04-20 12:51:03, Johannes Weiner wrote:
> On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
> > On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
> > > On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> > > > On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > > > > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > > > > This patch is an improvement of a previous version[1], as the previous
> > > > > > version is not easy to understand.
> > > > > > This issue persists in the newest kernel, I have to resend the fix. As
> > > > > > the implementation is changed, I drop Roman's ack from the previous
> > > > > > version.
> > > > > 
> > > > > Now that I understand the problem, I much prefer the previous version.
> > > > > 
> > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > index 745697906ce3..2bf91ae1e640 100644
> > > > > --- a/mm/memcontrol.c
> > > > > +++ b/mm/memcontrol.c
> > > > > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > > > >  
> > > > >  	if (!root)
> > > > >  		root = root_mem_cgroup;
> > > > > -	if (memcg == root)
> > > > > +	if (memcg == root) {
> > > > > +		/*
> > > > > +		 * The cgroup is the reclaim root in this reclaim
> > > > > +		 * cycle, and therefore not protected. But it may have
> > > > > +		 * stale effective protection values from previous
> > > > > +		 * cycles in which it was not the reclaim root - for
> > > > > +		 * example, global reclaim followed by limit reclaim.
> > > > > +		 * Reset these values for mem_cgroup_protection().
> > > > > +		 */
> > > > > +		memcg->memory.emin = 0;
> > > > > +		memcg->memory.elow = 0;
> > > > >  		return MEMCG_PROT_NONE;
> > > > > +	}
> > > > 
> > > > Could you be more specific why you prefer this over the
> > > > mem_cgroup_protection which doesn't change the effective value?
> > > > Isn't it easier to simply ignore effective value for the reclaim roots?
> > > 
> > > Because now both mem_cgroup_protection() and mem_cgroup_protected()
> > > have to know about the reclaim root semantics, instead of just the one
> > > central place.
> > 
> > Yes this is true but it is also potentially overwriting the state with
> > a parallel reclaim which can lead to surprising results
> 
> Checking in mem_cgroup_protection() doesn't avoid the fundamental race:
> 
>   root
>      `- A (low=2G, elow=2G, max=3G)
>         `- A1 (low=2G, elow=2G)
> 
> If A does limit reclaim while global reclaim races, the memcg == root
> check in mem_cgroup_protection() will reliably calculate the "right"
> scan value for A, which has no pages, and the wrong scan value for A1
> where the memory actually is.

I am sorry but I do not see how A1 would get wrong scan value.
- Global reclaim
  - A.elow = 2G
  - A1.elow = min(A1.low, A1.usage) ; if (A.children_low_usage < A.elow)

- A reclaim.
  - A.elow = stale/undefined
  - A1.elow = A1.low

if mem_cgroup_protection returns 0 for A's reclaim targeting A (assuming
the check is there) then not a big deal as there are no pages there as
you say.

Let's compare the GR (global reclaim), AR (A reclaim).
GR(A1.elow) <= AR(A1.elow) by definition, right? For A1.low
overcommitted we have
min(A1.low, A1.usage) * A.elow / A.children_low_usage <= min(A1.low, A1.usage)
because A.elow <= A.children_low_usage

so in both cases we have GR(A1.elow) <= AR(A1.elow) which means that
racing reclaims will behave sanely because the protection for the
external pressure pressure is not violated. A is going to reclaim A1
less than the global reclaim but that should be OK.

Or what do I miss?

> I'm okay with fixing the case where a really old left-over value is
> used by target reclaim.
> 
> I don't see a point in special casing this one instance of a
> fundamental race condition at the expense of less robust code.

I am definitely not calling to fragment the code. I do agree that having
a special case in mem_cgroup_protection is quite non-intuitive.
The existing code is quite hard to reason about in its current form
as we can see. If we can fix all that in mem_cgroup_protected then no
objections from me at all.

Yafang Shao April 27, 2020, 8:37 a.m. UTC | #25

On Mon, Apr 27, 2020 at 4:25 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 24-04-20 12:51:03, Johannes Weiner wrote:
> > On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
> > > On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
> > > > On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> > > > > On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > > > > > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > > > > > This patch is an improvement of a previous version[1], as the previous
> > > > > > > version is not easy to understand.
> > > > > > > This issue persists in the newest kernel, I have to resend the fix. As
> > > > > > > the implementation is changed, I drop Roman's ack from the previous
> > > > > > > version.
> > > > > >
> > > > > > Now that I understand the problem, I much prefer the previous version.
> > > > > >
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 745697906ce3..2bf91ae1e640 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > > > > >
> > > > > >       if (!root)
> > > > > >               root = root_mem_cgroup;
> > > > > > -     if (memcg == root)
> > > > > > +     if (memcg == root) {
> > > > > > +             /*
> > > > > > +              * The cgroup is the reclaim root in this reclaim
> > > > > > +              * cycle, and therefore not protected. But it may have
> > > > > > +              * stale effective protection values from previous
> > > > > > +              * cycles in which it was not the reclaim root - for
> > > > > > +              * example, global reclaim followed by limit reclaim.
> > > > > > +              * Reset these values for mem_cgroup_protection().
> > > > > > +              */
> > > > > > +             memcg->memory.emin = 0;
> > > > > > +             memcg->memory.elow = 0;
> > > > > >               return MEMCG_PROT_NONE;
> > > > > > +     }
> > > > >
> > > > > Could you be more specific why you prefer this over the
> > > > > mem_cgroup_protection which doesn't change the effective value?
> > > > > Isn't it easier to simply ignore effective value for the reclaim roots?
> > > >
> > > > Because now both mem_cgroup_protection() and mem_cgroup_protected()
> > > > have to know about the reclaim root semantics, instead of just the one
> > > > central place.
> > >
> > > Yes this is true but it is also potentially overwriting the state with
> > > a parallel reclaim which can lead to surprising results
> >
> > Checking in mem_cgroup_protection() doesn't avoid the fundamental race:
> >
> >   root
> >      `- A (low=2G, elow=2G, max=3G)
> >         `- A1 (low=2G, elow=2G)
> >
> > If A does limit reclaim while global reclaim races, the memcg == root
> > check in mem_cgroup_protection() will reliably calculate the "right"
> > scan value for A, which has no pages, and the wrong scan value for A1
> > where the memory actually is.
>
> I am sorry but I do not see how A1 would get wrong scan value.
> - Global reclaim
>   - A.elow = 2G
>   - A1.elow = min(A1.low, A1.usage) ; if (A.children_low_usage < A.elow)
>
> - A reclaim.
>   - A.elow = stale/undefined
>   - A1.elow = A1.low
>
> if mem_cgroup_protection returns 0 for A's reclaim targeting A (assuming
> the check is there) then not a big deal as there are no pages there as
> you say.
>
> Let's compare the GR (global reclaim), AR (A reclaim).
> GR(A1.elow) <= AR(A1.elow) by definition, right? For A1.low
> overcommitted we have
> min(A1.low, A1.usage) * A.elow / A.children_low_usage <= min(A1.low, A1.usage)
> because A.elow <= A.children_low_usage
>
> so in both cases we have GR(A1.elow) <= AR(A1.elow) which means that
> racing reclaims will behave sanely because the protection for the
> external pressure pressure is not violated. A is going to reclaim A1
> less than the global reclaim but that should be OK.
>
> Or what do I miss?
>
> > I'm okay with fixing the case where a really old left-over value is
> > used by target reclaim.
> >
> > I don't see a point in special casing this one instance of a
> > fundamental race condition at the expense of less robust code.
>
> I am definitely not calling to fragment the code. I do agree that having
> a special case in mem_cgroup_protection is quite non-intuitive.
> The existing code is quite hard to reason about in its current form
> as we can see. If we can fix all that in mem_cgroup_protected then no
> objections from me at all.

Hi Michal,

Pls. help take a look at my refactor on proportional memcg protection[1].
I think my new patchset can fix all that in mem_cgroup_protected and
make the existing code clear.


[1]. https://lore.kernel.org/linux-mm/20200425152418.28388-1-laoar.shao@gmail.com/T/#t

Johannes Weiner April 27, 2020, 4:52 p.m. UTC | #26

On Mon, Apr 27, 2020 at 10:25:24AM +0200, Michal Hocko wrote:
> On Fri 24-04-20 12:51:03, Johannes Weiner wrote:
> > On Fri, Apr 24, 2020 at 06:21:03PM +0200, Michal Hocko wrote:
> > > On Fri 24-04-20 11:10:13, Johannes Weiner wrote:
> > > > On Fri, Apr 24, 2020 at 04:29:58PM +0200, Michal Hocko wrote:
> > > > > On Fri 24-04-20 09:14:50, Johannes Weiner wrote:
> > > > > > On Thu, Apr 23, 2020 at 02:16:29AM -0400, Yafang Shao wrote:
> > > > > > > This patch is an improvement of a previous version[1], as the previous
> > > > > > > version is not easy to understand.
> > > > > > > This issue persists in the newest kernel, I have to resend the fix. As
> > > > > > > the implementation is changed, I drop Roman's ack from the previous
> > > > > > > version.
> > > > > > 
> > > > > > Now that I understand the problem, I much prefer the previous version.
> > > > > > 
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 745697906ce3..2bf91ae1e640 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -6332,8 +6332,19 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > > > > >  
> > > > > >  	if (!root)
> > > > > >  		root = root_mem_cgroup;
> > > > > > -	if (memcg == root)
> > > > > > +	if (memcg == root) {
> > > > > > +		/*
> > > > > > +		 * The cgroup is the reclaim root in this reclaim
> > > > > > +		 * cycle, and therefore not protected. But it may have
> > > > > > +		 * stale effective protection values from previous
> > > > > > +		 * cycles in which it was not the reclaim root - for
> > > > > > +		 * example, global reclaim followed by limit reclaim.
> > > > > > +		 * Reset these values for mem_cgroup_protection().
> > > > > > +		 */
> > > > > > +		memcg->memory.emin = 0;
> > > > > > +		memcg->memory.elow = 0;
> > > > > >  		return MEMCG_PROT_NONE;
> > > > > > +	}
> > > > > 
> > > > > Could you be more specific why you prefer this over the
> > > > > mem_cgroup_protection which doesn't change the effective value?
> > > > > Isn't it easier to simply ignore effective value for the reclaim roots?
> > > > 
> > > > Because now both mem_cgroup_protection() and mem_cgroup_protected()
> > > > have to know about the reclaim root semantics, instead of just the one
> > > > central place.
> > > 
> > > Yes this is true but it is also potentially overwriting the state with
> > > a parallel reclaim which can lead to surprising results
> > 
> > Checking in mem_cgroup_protection() doesn't avoid the fundamental race:
> > 
> >   root
> >      `- A (low=2G, elow=2G, max=3G)
> >         `- A1 (low=2G, elow=2G)
> > 
> > If A does limit reclaim while global reclaim races, the memcg == root
> > check in mem_cgroup_protection() will reliably calculate the "right"
> > scan value for A, which has no pages, and the wrong scan value for A1
> > where the memory actually is.
> 
> I am sorry but I do not see how A1 would get wrong scan value.

I mistyped the example. If we're in limit reclaim in A, elow should
look like this:

  root
     `- A (low=2G, max=3G -> elow=0)
        `- A1 (low=0G -> elow=0)

But if global reclaim were to kick in, it could overwrite elow to
this:

  root
     `- A (low=2G, max=3G -> elow=2G)
        `- A1 (low=0G -> elow=2G)

(This is with the recursive memory.low semantics, of course.)

> > I'm okay with fixing the case where a really old left-over value is
> > used by target reclaim.
> > 
> > I don't see a point in special casing this one instance of a
> > fundamental race condition at the expense of less robust code.
> 
> I am definitely not calling to fragment the code. I do agree that having
> a special case in mem_cgroup_protection is quite non-intuitive.
> The existing code is quite hard to reason about in its current form
> as we can see. If we can fix all that in mem_cgroup_protected then no
> objections from me at all.

Agreed, sounds reasonable.

mm, memcg: fix wrong mem cgroup protection

Commit Message

Comments

Patch