mm: skip current when memcg reclaim

Message ID	1634278529-16983-1-git-send-email-huangzhaoyang@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=eTfr=PD=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 0559D61090 From: Huangzhaoyang <huangzhaoyang@gmail.com> To: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Zhaoyang Huang <zhaoyang.huang@unisoc.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: skip current when memcg reclaim Date: Fri, 15 Oct 2021 14:15:29 +0800 Message-Id: <1634278529-16983-1-git-send-email-huangzhaoyang@gmail.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: skip current when memcg reclaim \| expand mm: skip current when memcg reclaim

Zhaoyang Huang Oct. 15, 2021, 6:15 a.m. UTC

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Sibling thread of the same process could refault the reclaimed pages
in the same time, which would be typical in None global reclaim and
introduce thrashing.
---
 mm/vmscan.c | 5 +++++
 1 file changed, 5 insertions(+)

Andrew Morton Oct. 15, 2021, 8 p.m. UTC | #1

On Fri, 15 Oct 2021 14:15:29 +0800 Huangzhaoyang <huangzhaoyang@gmail.com> wrote:

> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> Sibling thread of the same process could refault the reclaimed pages
> in the same time, which would be typical in None global reclaim and
> introduce thrashing.

"None" -> "node", I assume?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2841,6 +2841,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  				sc->memcg_low_skipped = 1;
>  				continue;
>  			}
> +			/*
> +			 * Don't bother current when its memcg is below low
> +			 */

The comment explains what the code is doing, but the code itself
already does this.  Please can we have a comment that explains *why*
the code is doing this?


> +			if (get_mem_cgroup_from_mm(current->mm) == memcg)
> +				continue;
>  			memcg_memory_event(memcg, MEMCG_LOW);
>  		}

Zhaoyang Huang Oct. 16, 2021, 2:28 a.m. UTC | #2

On Sat, Oct 16, 2021 at 4:00 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 15 Oct 2021 14:15:29 +0800 Huangzhaoyang <huangzhaoyang@gmail.com> wrote:
>
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > Sibling thread of the same process could refault the reclaimed pages
> > in the same time, which would be typical in None global reclaim and
> > introduce thrashing.
>
> "None" -> "node", I assume?
>
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2841,6 +2841,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >                               sc->memcg_low_skipped = 1;
> >                               continue;
> >                       }
> > +                     /*
> > +                      * Don't bother current when its memcg is below low
> > +                      */
>
> The comment explains what the code is doing, but the code itself
> already does this.  Please can we have a comment that explains *why*
> the code is doing this?
We find that the patch help direct reclaiming bail out early and
eliminate page thrashing for some scenarios(etc APP start on android).
The case could be worse if each APP possess a unique memcg(pages on
current's lru are reclaimed more than global reclaim)
>
>
> > +                     if (get_mem_cgroup_from_mm(current->mm) == memcg)
> > +                             continue;
> >                       memcg_memory_event(memcg, MEMCG_LOW);
> >               }
>

Andrew Morton Oct. 16, 2021, 2:58 a.m. UTC | #3

On Sat, 16 Oct 2021 10:28:54 +0800 Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:

> On Sat, Oct 16, 2021 at 4:00 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 15 Oct 2021 14:15:29 +0800 Huangzhaoyang <huangzhaoyang@gmail.com> wrote:
> >
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > Sibling thread of the same process could refault the reclaimed pages
> > > in the same time, which would be typical in None global reclaim and
> > > introduce thrashing.
> >
> > "None" -> "node", I assume?
> >
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2841,6 +2841,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > >                               sc->memcg_low_skipped = 1;
> > >                               continue;
> > >                       }
> > > +                     /*
> > > +                      * Don't bother current when its memcg is below low
> > > +                      */
> >
> > The comment explains what the code is doing, but the code itself
> > already does this.  Please can we have a comment that explains *why*
> > the code is doing this?
> We find that the patch help direct reclaiming bail out early and
> eliminate page thrashing for some scenarios(etc APP start on android).
> The case could be worse if each APP possess a unique memcg(pages on
> current's lru are reclaimed more than global reclaim)

What I meant was: please redo the patch with a comment which explains
"why the code does this", rather than "what the code does".

Also, as this is a performance enhancement, it is preferred to have
some performance testing results in the changelog.

Matthew Wilcox Oct. 16, 2021, 3:05 a.m. UTC | #4

On Fri, Oct 15, 2021 at 01:00:35PM -0700, Andrew Morton wrote:
> On Fri, 15 Oct 2021 14:15:29 +0800 Huangzhaoyang <huangzhaoyang@gmail.com> wrote:
> 
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > 
> > Sibling thread of the same process could refault the reclaimed pages
> > in the same time, which would be typical in None global reclaim and
> > introduce thrashing.
> 
> "None" -> "node", I assume?

I think that's "non-global", ie memcg based.

Zhaoyang Huang Oct. 16, 2021, 8:17 a.m. UTC | #5

On Sat, Oct 16, 2021 at 11:06 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Oct 15, 2021 at 01:00:35PM -0700, Andrew Morton wrote:
> > On Fri, 15 Oct 2021 14:15:29 +0800 Huangzhaoyang <huangzhaoyang@gmail.com> wrote:
> >
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > Sibling thread of the same process could refault the reclaimed pages
> > > in the same time, which would be typical in None global reclaim and
> > > introduce thrashing.
> >
> > "None" -> "node", I assume?
>
> I think that's "non-global", ie memcg based.
It is exactly what I mean. memcg based reclaim could be worse for page
thrashing.

Michal Hocko Oct. 18, 2021, 8:23 a.m. UTC | #6

On Fri 15-10-21 14:15:29, Huangzhaoyang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> Sibling thread of the same process could refault the reclaimed pages
> in the same time, which would be typical in None global reclaim and
> introduce thrashing.

It is hard to understand what kind of problem you see (ideally along
with some numbers) and how the proposed patch addresses that problem

Also you are missing Signed-off-by tag (please have a look at
Documentation/process/submitting-patches.rst which is much more
comprehensive about the process).

> ---
>  mm/vmscan.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5199b96..ebbdc37 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2841,6 +2841,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  				sc->memcg_low_skipped = 1;
>  				continue;
>  			}
> +			/*
> +			 * Don't bother current when its memcg is below low
> +			 */
> +			if (get_mem_cgroup_from_mm(current->mm) == memcg)
> +				continue;

This code is executed when none of memcg in the reclaimed hierarchy
could be reclaimed. Low limit is then ignored and this change is
tweaking that behavior without any description of the effect. A very
vague note about trashing would indicate that you have something like
the following

	A (hiting hard limit)
       / \
      B   C

Both B and C low limit protected and current task associated with B. As
none of the two could be reclaimed due to soft protection yuu prefer to
reclaim from C as you do not want to reclaim from the current process as
that could reclaim current's working set. Correct?

I would be really curious about more specifics of the used hierarchy.

Thanks!

>  			memcg_memory_event(memcg, MEMCG_LOW);
>  		}
>  
> -- 
> 1.9.1

Zhaoyang Huang Oct. 18, 2021, 9:25 a.m. UTC | #7

On Mon, Oct 18, 2021 at 4:23 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 15-10-21 14:15:29, Huangzhaoyang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > Sibling thread of the same process could refault the reclaimed pages
> > in the same time, which would be typical in None global reclaim and
> > introduce thrashing.
>
> It is hard to understand what kind of problem you see (ideally along
> with some numbers) and how the proposed patch addresses that problem
>
> Also you are missing Signed-off-by tag (please have a look at
> Documentation/process/submitting-patches.rst which is much more
> comprehensive about the process).
sorry for that, I will fix it.
>
> > ---
> >  mm/vmscan.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 5199b96..ebbdc37 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2841,6 +2841,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >                               sc->memcg_low_skipped = 1;
> >                               continue;
> >                       }
> > +                     /*
> > +                      * Don't bother current when its memcg is below low
> > +                      */
> > +                     if (get_mem_cgroup_from_mm(current->mm) == memcg)
> > +                             continue;
>
> This code is executed when none of memcg in the reclaimed hierarchy
> could be reclaimed. Low limit is then ignored and this change is
> tweaking that behavior without any description of the effect. A very
> vague note about trashing would indicate that you have something like
> the following
>
>         A (hiting hard limit)
>        / \
>       B   C
>
> Both B and C low limit protected and current task associated with B. As
> none of the two could be reclaimed due to soft protection yuu prefer to
> reclaim from C as you do not want to reclaim from the current process as
> that could reclaim current's working set. Correct?
>
> I would be really curious about more specifics of the used hierarchy.
What I am facing is a typical scenario on Android, that is a big
memory consuming APP(camera etc) launched while background filled by
other processes. The hierarchy is like what you describe above where B
represents the APP and memory.low is set to help warm restart. Both of
kswapd and direct reclaim work together to reclaim pages under this
scenario, which can cause 20MB file page delete from LRU in several
second. This change could help to have current process's page escape
from being reclaimed and cause page thrashing. We observed the result
via systrace which shows that the Uninterruptible sleep(block on page
bit) and iowait get smaller than usual.
>
> Thanks!
>
> >                       memcg_memory_event(memcg, MEMCG_LOW);
> >               }
> >
> > --
> > 1.9.1
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Oct. 18, 2021, 12:41 p.m. UTC | #8

On Mon 18-10-21 17:25:23, Zhaoyang Huang wrote:
> On Mon, Oct 18, 2021 at 4:23 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 15-10-21 14:15:29, Huangzhaoyang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > Sibling thread of the same process could refault the reclaimed pages
> > > in the same time, which would be typical in None global reclaim and
> > > introduce thrashing.
> >
> > It is hard to understand what kind of problem you see (ideally along
> > with some numbers) and how the proposed patch addresses that problem
> >
> > Also you are missing Signed-off-by tag (please have a look at
> > Documentation/process/submitting-patches.rst which is much more
> > comprehensive about the process).
> sorry for that, I will fix it.
> >
> > > ---
> > >  mm/vmscan.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 5199b96..ebbdc37 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2841,6 +2841,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > >                               sc->memcg_low_skipped = 1;
> > >                               continue;
> > >                       }
> > > +                     /*
> > > +                      * Don't bother current when its memcg is below low
> > > +                      */
> > > +                     if (get_mem_cgroup_from_mm(current->mm) == memcg)
> > > +                             continue;
> >
> > This code is executed when none of memcg in the reclaimed hierarchy
> > could be reclaimed. Low limit is then ignored and this change is
> > tweaking that behavior without any description of the effect. A very
> > vague note about trashing would indicate that you have something like
> > the following
> >
> >         A (hiting hard limit)
> >        / \
> >       B   C
> >
> > Both B and C low limit protected and current task associated with B. As
> > none of the two could be reclaimed due to soft protection yuu prefer to
> > reclaim from C as you do not want to reclaim from the current process as
> > that could reclaim current's working set. Correct?
> >
> > I would be really curious about more specifics of the used hierarchy.
> What I am facing is a typical scenario on Android, that is a big
> memory consuming APP(camera etc) launched while background filled by
> other processes. The hierarchy is like what you describe above where B
> represents the APP and memory.low is set to help warm restart. Both of
> kswapd and direct reclaim work together to reclaim pages under this
> scenario, which can cause 20MB file page delete from LRU in several
> second. This change could help to have current process's page escape
> from being reclaimed and cause page thrashing. We observed the result
> via systrace which shows that the Uninterruptible sleep(block on page
> bit) and iowait get smaller than usual.

I still have hard time to understand the exact setup and why the patch
helps you. If you want to protect B more than the low limit would allow
for by stealiong from C then the same thing can happen from anybody
reclaiming from C so in the end there is no protection. The same would
apply for any global direct memory reclaim done by a 3rd party. So I
suspect that your patch just happens to work by a luck.

Why both B and C have low limit setup and they both cannot be reclaimed?
Isn't that a weird setup where A hard limit is too close to sum of low
limits of B and C?

In other words could you share a more detailed configuration you are
using and some more details why both B and C have been skipped during
the first pass of the reclaim?

Zhaoyang Huang Oct. 19, 2021, 7:11 a.m. UTC | #9

On Mon, Oct 18, 2021 at 8:41 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 18-10-21 17:25:23, Zhaoyang Huang wrote:
> > On Mon, Oct 18, 2021 at 4:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 15-10-21 14:15:29, Huangzhaoyang wrote:
> > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > Sibling thread of the same process could refault the reclaimed pages
> > > > in the same time, which would be typical in None global reclaim and
> > > > introduce thrashing.
> > >
> > > It is hard to understand what kind of problem you see (ideally along
> > > with some numbers) and how the proposed patch addresses that problem
> > >
> > > Also you are missing Signed-off-by tag (please have a look at
> > > Documentation/process/submitting-patches.rst which is much more
> > > comprehensive about the process).
> > sorry for that, I will fix it.
> > >
> > > > ---
> > > >  mm/vmscan.c | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 5199b96..ebbdc37 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2841,6 +2841,11 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> > > >                               sc->memcg_low_skipped = 1;
> > > >                               continue;
> > > >                       }
> > > > +                     /*
> > > > +                      * Don't bother current when its memcg is below low
> > > > +                      */
> > > > +                     if (get_mem_cgroup_from_mm(current->mm) == memcg)
> > > > +                             continue;
> > >
> > > This code is executed when none of memcg in the reclaimed hierarchy
> > > could be reclaimed. Low limit is then ignored and this change is
> > > tweaking that behavior without any description of the effect. A very
> > > vague note about trashing would indicate that you have something like
> > > the following
> > >
> > >         A (hiting hard limit)
> > >        / \
> > >       B   C
> > >
> > > Both B and C low limit protected and current task associated with B. As
> > > none of the two could be reclaimed due to soft protection yuu prefer to
> > > reclaim from C as you do not want to reclaim from the current process as
> > > that could reclaim current's working set. Correct?
> > >
> > > I would be really curious about more specifics of the used hierarchy.
> > What I am facing is a typical scenario on Android, that is a big
> > memory consuming APP(camera etc) launched while background filled by
> > other processes. The hierarchy is like what you describe above where B
> > represents the APP and memory.low is set to help warm restart. Both of
> > kswapd and direct reclaim work together to reclaim pages under this
> > scenario, which can cause 20MB file page delete from LRU in several
> > second. This change could help to have current process's page escape
> > from being reclaimed and cause page thrashing. We observed the result
> > via systrace which shows that the Uninterruptible sleep(block on page
> > bit) and iowait get smaller than usual.
>
> I still have hard time to understand the exact setup and why the patch
> helps you. If you want to protect B more than the low limit would allow
> for by stealiong from C then the same thing can happen from anybody
> reclaiming from C so in the end there is no protection. The same would
> apply for any global direct memory reclaim done by a 3rd party. So I
> suspect that your patch just happens to work by a luck.
B and C compete fairly and superior than others. The idea based on
assuming NOT all groups will trap into direct reclaim concurrently, so
we want to have the groups steal pages from the processes under
root(Non-memory sensitive) or other groups with lower thresholds(high
memory tolerance) or the one totally sleeping(not busy for the time
being, borrow some pages).
>
> Why both B and C have low limit setup and they both cannot be reclaimed?
> Isn't that a weird setup where A hard limit is too close to sum of low
> limits of B and C?
>
> In other words could you share a more detailed configuration you are
> using and some more details why both B and C have been skipped during
> the first pass of the reclaim?
My practical scenario is that important processes(vip APP etc) are
placed into protected memcg and keep other processes just under root.
Current introduces direct reclaim because of alloc_pages(DMA_ALLOC
etc), in which the number of allocation would be much larger than low
but would NOT be charged to LRU. Whereas, current also wants to keep
the pages(.so files to exec) on LRU.
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Oct. 19, 2021, 9:09 a.m. UTC | #10

On Tue 19-10-21 15:11:30, Zhaoyang Huang wrote:
> On Mon, Oct 18, 2021 at 8:41 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 18-10-21 17:25:23, Zhaoyang Huang wrote:
> > > On Mon, Oct 18, 2021 at 4:23 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > > I would be really curious about more specifics of the used hierarchy.
> > > What I am facing is a typical scenario on Android, that is a big
> > > memory consuming APP(camera etc) launched while background filled by
> > > other processes. The hierarchy is like what you describe above where B
> > > represents the APP and memory.low is set to help warm restart. Both of
> > > kswapd and direct reclaim work together to reclaim pages under this
> > > scenario, which can cause 20MB file page delete from LRU in several
> > > second. This change could help to have current process's page escape
> > > from being reclaimed and cause page thrashing. We observed the result
> > > via systrace which shows that the Uninterruptible sleep(block on page
> > > bit) and iowait get smaller than usual.
> >
> > I still have hard time to understand the exact setup and why the patch
> > helps you. If you want to protect B more than the low limit would allow
> > for by stealiong from C then the same thing can happen from anybody
> > reclaiming from C so in the end there is no protection. The same would
> > apply for any global direct memory reclaim done by a 3rd party. So I
> > suspect that your patch just happens to work by a luck.
> B and C compete fairly and superior than others. The idea based on
> assuming NOT all groups will trap into direct reclaim concurrently, so
> we want to have the groups steal pages from the processes under
> root(Non-memory sensitive) or other groups with lower thresholds(high
> memory tolerance) or the one totally sleeping(not busy for the time
> being, borrow some pages).

I am really confused now. The memcg reclaim cannot really reclaim
anything from outside of the reclaimed hierarchy. Protected memcgs are
only considered if the reclaim was not able to reclaim anything during
the first hierarchy walk. That would imply that the reclaimed hierarchy
has either all memcgs with memory protected or non-protected memcgs do
not have any memory to reclaim.

I think it would really help to provide much details about what is going
on here before we can move forward.

> > Why both B and C have low limit setup and they both cannot be reclaimed?
> > Isn't that a weird setup where A hard limit is too close to sum of low
> > limits of B and C?
> >
> > In other words could you share a more detailed configuration you are
> > using and some more details why both B and C have been skipped during
> > the first pass of the reclaim?
> My practical scenario is that important processes(vip APP etc) are
> placed into protected memcg and keep other processes just under root.
> Current introduces direct reclaim because of alloc_pages(DMA_ALLOC
> etc), in which the number of allocation would be much larger than low
> but would NOT be charged to LRU. Whereas, current also wants to keep
> the pages(.so files to exec) on LRU.

I am sorry but this description makes even less sense to me. If your
important process runs under a protected memcg and everything else is
running under root memcg then your memcg will get protected as long as
there is a reclaimable memory. There should ever be only global memory
reclaim happening, unless you specify a hard/high limit on your
important memcg. If you do so then there is no way to reclaim from
outside of that specific memcg.

I really fail how your patch can help with either of those situations.

Zhaoyang Huang Oct. 19, 2021, 12:17 p.m. UTC | #11

On Tue, Oct 19, 2021 at 5:09 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 19-10-21 15:11:30, Zhaoyang Huang wrote:
> > On Mon, Oct 18, 2021 at 8:41 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 18-10-21 17:25:23, Zhaoyang Huang wrote:
> > > > On Mon, Oct 18, 2021 at 4:23 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > > I would be really curious about more specifics of the used hierarchy.
> > > > What I am facing is a typical scenario on Android, that is a big
> > > > memory consuming APP(camera etc) launched while background filled by
> > > > other processes. The hierarchy is like what you describe above where B
> > > > represents the APP and memory.low is set to help warm restart. Both of
> > > > kswapd and direct reclaim work together to reclaim pages under this
> > > > scenario, which can cause 20MB file page delete from LRU in several
> > > > second. This change could help to have current process's page escape
> > > > from being reclaimed and cause page thrashing. We observed the result
> > > > via systrace which shows that the Uninterruptible sleep(block on page
> > > > bit) and iowait get smaller than usual.
> > >
> > > I still have hard time to understand the exact setup and why the patch
> > > helps you. If you want to protect B more than the low limit would allow
> > > for by stealiong from C then the same thing can happen from anybody
> > > reclaiming from C so in the end there is no protection. The same would
> > > apply for any global direct memory reclaim done by a 3rd party. So I
> > > suspect that your patch just happens to work by a luck.
> > B and C compete fairly and superior than others. The idea based on
> > assuming NOT all groups will trap into direct reclaim concurrently, so
> > we want to have the groups steal pages from the processes under
> > root(Non-memory sensitive) or other groups with lower thresholds(high
> > memory tolerance) or the one totally sleeping(not busy for the time
> > being, borrow some pages).
>
> I am really confused now. The memcg reclaim cannot really reclaim
> anything from outside of the reclaimed hierarchy. Protected memcgs are
> only considered if the reclaim was not able to reclaim anything during
> the first hierarchy walk. That would imply that the reclaimed hierarchy
> has either all memcgs with memory protected or non-protected memcgs do
> not have any memory to reclaim.
>
> I think it would really help to provide much details about what is going
> on here before we can move forward.
>
> > > Why both B and C have low limit setup and they both cannot be reclaimed?
> > > Isn't that a weird setup where A hard limit is too close to sum of low
> > > limits of B and C?
> > >
> > > In other words could you share a more detailed configuration you are
> > > using and some more details why both B and C have been skipped during
> > > the first pass of the reclaim?
> > My practical scenario is that important processes(vip APP etc) are
> > placed into protected memcg and keep other processes just under root.
> > Current introduces direct reclaim because of alloc_pages(DMA_ALLOC
> > etc), in which the number of allocation would be much larger than low
> > but would NOT be charged to LRU. Whereas, current also wants to keep
> > the pages(.so files to exec) on LRU.
>
> I am sorry but this description makes even less sense to me. If your
> important process runs under a protected memcg and everything else is
> running under root memcg then your memcg will get protected as long as
> there is a reclaimable memory. There should ever be only global memory
> reclaim happening, unless you specify a hard/high limit on your
> important memcg. If you do so then there is no way to reclaim from
> outside of that specific memcg.
>
> I really fail how your patch can help with either of those situations.

please find cgv2 hierarchy on my sys[1], where uid_2000 is a cgroup
under root and trace_printk info[3] from trace_printk embedded in
shrink_node[2]. I don't why you say there should be no reclaim from
groups under root which opposite to[3]

[1]
/sys/fs/cgroup # ls uid_2000
cgroup.controllers  cgroup.max.depth        cgroup.stat
cgroup.type   io.pressure     memory.events.local  memory.max
memory.pressure      memory.swap.events
cgroup.events       cgroup.max.descendants  cgroup.subtree_control
cpu.pressure  memory.current  memory.high          memory.min
memory.stat          memory.swap.max
cgroup.freeze       cgroup.procs            cgroup.threads
cpu.stat      memory.events   memory.low           memory.oom.group
memory.swap.current  pid_275

[2]
@@ -2962,6 +2962,7 @@ static bool shrink_node(pg_data_t *pgdat, struct
scan_control *sc)

                        reclaimed = sc->nr_reclaimed;
                        scanned = sc->nr_scanned;
+                     trace_printk("root %x memcg %x reclaimed
%ld\n",root_mem_cgroup,memcg,sc->nr_reclaimed);
                        shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
                        node_lru_pages += lru_pages;

[3]
 allocator@4.0-s-1034    [005] ....   442.077013: shrink_node: root
ef022800 memcg ef027800 reclaimed 41
   kworker/u16:3-931     [002] ....   442.077019: shrink_node: root
ef022800 memcg c7e54000 reclaimed 17
 allocator@4.0-s-1034    [005] ....   442.077019: shrink_node: root
ef022800 memcg ef025000 reclaimed 41
 allocator@4.0-s-1034    [005] ....   442.077024: shrink_node: root
ef022800 memcg ef023000 reclaimed 41
   kworker/u16:3-931     [002] ....   442.077026: shrink_node: root
ef022800 memcg c7e57800 reclaimed 17
 allocator@4.0-s-1034    [005] ....   442.077028: shrink_node: root
ef022800 memcg ef026800 reclaimed 41


> --
> Michal Hocko
> SUSE Labs

Michal Hocko Oct. 19, 2021, 1:23 p.m. UTC | #12

On Tue 19-10-21 20:17:16, Zhaoyang Huang wrote:
> On Tue, Oct 19, 2021 at 5:09 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 19-10-21 15:11:30, Zhaoyang Huang wrote:
> > > On Mon, Oct 18, 2021 at 8:41 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 18-10-21 17:25:23, Zhaoyang Huang wrote:
> > > > > On Mon, Oct 18, 2021 at 4:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > > I would be really curious about more specifics of the used hierarchy.
> > > > > What I am facing is a typical scenario on Android, that is a big
> > > > > memory consuming APP(camera etc) launched while background filled by
> > > > > other processes. The hierarchy is like what you describe above where B
> > > > > represents the APP and memory.low is set to help warm restart. Both of
> > > > > kswapd and direct reclaim work together to reclaim pages under this
> > > > > scenario, which can cause 20MB file page delete from LRU in several
> > > > > second. This change could help to have current process's page escape
> > > > > from being reclaimed and cause page thrashing. We observed the result
> > > > > via systrace which shows that the Uninterruptible sleep(block on page
> > > > > bit) and iowait get smaller than usual.
> > > >
> > > > I still have hard time to understand the exact setup and why the patch
> > > > helps you. If you want to protect B more than the low limit would allow
> > > > for by stealiong from C then the same thing can happen from anybody
> > > > reclaiming from C so in the end there is no protection. The same would
> > > > apply for any global direct memory reclaim done by a 3rd party. So I
> > > > suspect that your patch just happens to work by a luck.
> > > B and C compete fairly and superior than others. The idea based on
> > > assuming NOT all groups will trap into direct reclaim concurrently, so
> > > we want to have the groups steal pages from the processes under
> > > root(Non-memory sensitive) or other groups with lower thresholds(high
> > > memory tolerance) or the one totally sleeping(not busy for the time
> > > being, borrow some pages).
> >
> > I am really confused now. The memcg reclaim cannot really reclaim
> > anything from outside of the reclaimed hierarchy. Protected memcgs are
> > only considered if the reclaim was not able to reclaim anything during
> > the first hierarchy walk. That would imply that the reclaimed hierarchy
> > has either all memcgs with memory protected or non-protected memcgs do
> > not have any memory to reclaim.
> >
> > I think it would really help to provide much details about what is going
> > on here before we can move forward.
> >
> > > > Why both B and C have low limit setup and they both cannot be reclaimed?
> > > > Isn't that a weird setup where A hard limit is too close to sum of low
> > > > limits of B and C?
> > > >
> > > > In other words could you share a more detailed configuration you are
> > > > using and some more details why both B and C have been skipped during
> > > > the first pass of the reclaim?
> > > My practical scenario is that important processes(vip APP etc) are
> > > placed into protected memcg and keep other processes just under root.
> > > Current introduces direct reclaim because of alloc_pages(DMA_ALLOC
> > > etc), in which the number of allocation would be much larger than low
> > > but would NOT be charged to LRU. Whereas, current also wants to keep
> > > the pages(.so files to exec) on LRU.
> >
> > I am sorry but this description makes even less sense to me. If your
> > important process runs under a protected memcg and everything else is
> > running under root memcg then your memcg will get protected as long as
> > there is a reclaimable memory. There should ever be only global memory
> > reclaim happening, unless you specify a hard/high limit on your
> > important memcg. If you do so then there is no way to reclaim from
> > outside of that specific memcg.
> >
> > I really fail how your patch can help with either of those situations.
> 
> please find cgv2 hierarchy on my sys[1], where uid_2000 is a cgroup
> under root and trace_printk info[3] from trace_printk embedded in
> shrink_node[2]. I don't why you say there should be no reclaim from
> groups under root which opposite to[3]

That is not what I am saying. I am saying the protected (by low limit)
memcgs only get reclaimed if there is no reclaim progress from the
reclaimed hierarchy. In your case that would mean that there is no
reclaim from the root cgroup.

> [1]
> /sys/fs/cgroup # ls uid_2000
> cgroup.controllers  cgroup.max.depth        cgroup.stat
> cgroup.type   io.pressure     memory.events.local  memory.max
> memory.pressure      memory.swap.events
> cgroup.events       cgroup.max.descendants  cgroup.subtree_control
> cpu.pressure  memory.current  memory.high          memory.min
> memory.stat          memory.swap.max
> cgroup.freeze       cgroup.procs            cgroup.threads
> cpu.stat      memory.events   memory.low           memory.oom.group
> memory.swap.current  pid_275

This doesn't really help to make a better picture because it doesn't
tell the configuration. It would help to print all cgroups with memory
controller enabled and print memory.* values.

> [2]
> @@ -2962,6 +2962,7 @@ static bool shrink_node(pg_data_t *pgdat, struct
> scan_control *sc)
> 
>                         reclaimed = sc->nr_reclaimed;
>                         scanned = sc->nr_scanned;
> +                     trace_printk("root %x memcg %x reclaimed
> %ld\n",root_mem_cgroup,memcg,sc->nr_reclaimed);
>                         shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
>                         node_lru_pages += lru_pages;
> 
> [3]
>  allocator@4.0-s-1034    [005] ....   442.077013: shrink_node: root
> ef022800 memcg ef027800 reclaimed 41
>    kworker/u16:3-931     [002] ....   442.077019: shrink_node: root
> ef022800 memcg c7e54000 reclaimed 17
>  allocator@4.0-s-1034    [005] ....   442.077019: shrink_node: root
> ef022800 memcg ef025000 reclaimed 41
>  allocator@4.0-s-1034    [005] ....   442.077024: shrink_node: root
> ef022800 memcg ef023000 reclaimed 41
>    kworker/u16:3-931     [002] ....   442.077026: shrink_node: root
> ef022800 memcg c7e57800 reclaimed 17
>  allocator@4.0-s-1034    [005] ....   442.077028: shrink_node: root
> ef022800 memcg ef026800 reclaimed 41

It is impossible to tell which memcgs those are. It also doesn't tell
anything whether low limit has been broken due to lack of progress.

Zhaoyang Huang Oct. 20, 2021, 7:33 a.m. UTC | #13

On Tue, Oct 19, 2021 at 9:24 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 19-10-21 20:17:16, Zhaoyang Huang wrote:
> > On Tue, Oct 19, 2021 at 5:09 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 19-10-21 15:11:30, Zhaoyang Huang wrote:
> > > > On Mon, Oct 18, 2021 at 8:41 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Mon 18-10-21 17:25:23, Zhaoyang Huang wrote:
> > > > > > On Mon, Oct 18, 2021 at 4:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > > > I would be really curious about more specifics of the used hierarchy.
> > > > > > What I am facing is a typical scenario on Android, that is a big
> > > > > > memory consuming APP(camera etc) launched while background filled by
> > > > > > other processes. The hierarchy is like what you describe above where B
> > > > > > represents the APP and memory.low is set to help warm restart. Both of
> > > > > > kswapd and direct reclaim work together to reclaim pages under this
> > > > > > scenario, which can cause 20MB file page delete from LRU in several
> > > > > > second. This change could help to have current process's page escape
> > > > > > from being reclaimed and cause page thrashing. We observed the result
> > > > > > via systrace which shows that the Uninterruptible sleep(block on page
> > > > > > bit) and iowait get smaller than usual.
> > > > >
> > > > > I still have hard time to understand the exact setup and why the patch
> > > > > helps you. If you want to protect B more than the low limit would allow
> > > > > for by stealiong from C then the same thing can happen from anybody
> > > > > reclaiming from C so in the end there is no protection. The same would
> > > > > apply for any global direct memory reclaim done by a 3rd party. So I
> > > > > suspect that your patch just happens to work by a luck.
> > > > B and C compete fairly and superior than others. The idea based on
> > > > assuming NOT all groups will trap into direct reclaim concurrently, so
> > > > we want to have the groups steal pages from the processes under
> > > > root(Non-memory sensitive) or other groups with lower thresholds(high
> > > > memory tolerance) or the one totally sleeping(not busy for the time
> > > > being, borrow some pages).
> > >
> > > I am really confused now. The memcg reclaim cannot really reclaim
> > > anything from outside of the reclaimed hierarchy. Protected memcgs are
> > > only considered if the reclaim was not able to reclaim anything during
> > > the first hierarchy walk. That would imply that the reclaimed hierarchy
> > > has either all memcgs with memory protected or non-protected memcgs do
> > > not have any memory to reclaim.
> > >
> > > I think it would really help to provide much details about what is going
> > > on here before we can move forward.
> > >
> > > > > Why both B and C have low limit setup and they both cannot be reclaimed?
> > > > > Isn't that a weird setup where A hard limit is too close to sum of low
> > > > > limits of B and C?
> > > > >
> > > > > In other words could you share a more detailed configuration you are
> > > > > using and some more details why both B and C have been skipped during
> > > > > the first pass of the reclaim?
> > > > My practical scenario is that important processes(vip APP etc) are
> > > > placed into protected memcg and keep other processes just under root.
> > > > Current introduces direct reclaim because of alloc_pages(DMA_ALLOC
> > > > etc), in which the number of allocation would be much larger than low
> > > > but would NOT be charged to LRU. Whereas, current also wants to keep
> > > > the pages(.so files to exec) on LRU.
> > >
> > > I am sorry but this description makes even less sense to me. If your
> > > important process runs under a protected memcg and everything else is
> > > running under root memcg then your memcg will get protected as long as
> > > there is a reclaimable memory. There should ever be only global memory
> > > reclaim happening, unless you specify a hard/high limit on your
> > > important memcg. If you do so then there is no way to reclaim from
> > > outside of that specific memcg.
> > >
> > > I really fail how your patch can help with either of those situations.
> >
> > please find cgv2 hierarchy on my sys[1], where uid_2000 is a cgroup
> > under root and trace_printk info[3] from trace_printk embedded in
> > shrink_node[2]. I don't why you say there should be no reclaim from
> > groups under root which opposite to[3]
>
> That is not what I am saying. I am saying the protected (by low limit)
> memcgs only get reclaimed if there is no reclaim progress from the
> reclaimed hierarchy. In your case that would mean that there is no
> reclaim from the root cgroup.
Do you mean that direct reclaim should succeed for the first round
reclaim within which memcg get protected by memory.low and would NOT
retry by setting memcg_low_reclaim to true? It is not true in android
like system, where reclaim always failed and introduce lmk and even
OOM.
>
> > [1]
> > /sys/fs/cgroup # ls uid_2000
> > cgroup.controllers  cgroup.max.depth        cgroup.stat
> > cgroup.type   io.pressure     memory.events.local  memory.max
> > memory.pressure      memory.swap.events
> > cgroup.events       cgroup.max.descendants  cgroup.subtree_control
> > cpu.pressure  memory.current  memory.high          memory.min
> > memory.stat          memory.swap.max
> > cgroup.freeze       cgroup.procs            cgroup.threads
> > cpu.stat      memory.events   memory.low           memory.oom.group
> > memory.swap.current  pid_275
>
> This doesn't really help to make a better picture because it doesn't
> tell the configuration. It would help to print all cgroups with memory
> controller enabled and print memory.* values.
>
> > [2]
> > @@ -2962,6 +2962,7 @@ static bool shrink_node(pg_data_t *pgdat, struct
> > scan_control *sc)
> >
> >                         reclaimed = sc->nr_reclaimed;
> >                         scanned = sc->nr_scanned;
> > +                     trace_printk("root %x memcg %x reclaimed
> > %ld\n",root_mem_cgroup,memcg,sc->nr_reclaimed);
> >                         shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
> >                         node_lru_pages += lru_pages;
> >
> > [3]
> >  allocator@4.0-s-1034    [005] ....   442.077013: shrink_node: root
> > ef022800 memcg ef027800 reclaimed 41
> >    kworker/u16:3-931     [002] ....   442.077019: shrink_node: root
> > ef022800 memcg c7e54000 reclaimed 17
> >  allocator@4.0-s-1034    [005] ....   442.077019: shrink_node: root
> > ef022800 memcg ef025000 reclaimed 41
> >  allocator@4.0-s-1034    [005] ....   442.077024: shrink_node: root
> > ef022800 memcg ef023000 reclaimed 41
> >    kworker/u16:3-931     [002] ....   442.077026: shrink_node: root
> > ef022800 memcg c7e57800 reclaimed 17
> >  allocator@4.0-s-1034    [005] ....   442.077028: shrink_node: root
> > ef022800 memcg ef026800 reclaimed 41
>
> It is impossible to tell which memcgs those are. It also doesn't tell
> anything whether low limit has been broken due to lack of progress.
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Oct. 20, 2021, 8:55 a.m. UTC | #14

On Wed 20-10-21 15:33:39, Zhaoyang Huang wrote:
[...]
> Do you mean that direct reclaim should succeed for the first round
> reclaim within which memcg get protected by memory.low and would NOT
> retry by setting memcg_low_reclaim to true?

Yes, this is the semantic of low limit protection in the upstream
kernel. Have a look at do_try_to_free_pages and how it sets
memcg_low_reclaim only if there were no pages reclaimed.

> It is not true in android
> like system, where reclaim always failed and introduce lmk and even
> OOM.

I am not familiar with android specific changes to the upstream reclaim
logic. You should be investigating why the reclaim couldn't make a
forward progress (aka reclaim pages) from non-protected memcgs. There
are tracepoints you can use (generally vmscan prefix).

Zhaoyang Huang Oct. 20, 2021, 11:45 a.m. UTC | #15

On Wed, Oct 20, 2021 at 4:55 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 20-10-21 15:33:39, Zhaoyang Huang wrote:
> [...]
> > Do you mean that direct reclaim should succeed for the first round
> > reclaim within which memcg get protected by memory.low and would NOT
> > retry by setting memcg_low_reclaim to true?
>
> Yes, this is the semantic of low limit protection in the upstream
> kernel. Have a look at do_try_to_free_pages and how it sets
> memcg_low_reclaim only if there were no pages reclaimed.
>
> > It is not true in android
> > like system, where reclaim always failed and introduce lmk and even
> > OOM.
>
> I am not familiar with android specific changes to the upstream reclaim
> logic. You should be investigating why the reclaim couldn't make a
> forward progress (aka reclaim pages) from non-protected memcgs. There
> are tracepoints you can use (generally vmscan prefix).
Ok, I am aware of why you get confused now. I think you are analysing
cgroup's behaviour according to a pre-defined workload and memory
pattern, which should work according to the design, such as processes
within root should provide memory before protected memcg get
reclaimed. You can refer [1] as the hierarchy, where effective
userspace workloads locate in protect groups and have rest of
processes be non-grouped. In fact, non-grouped ones can not provide
enough memory as they are kernel threads and the processes with few
pages on LRU(control logic inside). The practical scenario is groupA
launched a high-order kmalloc and introduce reclaiming(kswapd and
direct reclaim). As I said, non-grouped ones can not provide enough
contiguous memory blocks which let direct reclaim quickly fail for the
first round reclaiming. What I am trying to do is that let kswapd try
more for the target. It is also fair if groupA,B,C are trapping in
slow path concurrently.

[1]
root
|                                                       |
|              |
non-grouped processes             groupA    groupB  groupC
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Oct. 20, 2021, 3:11 p.m. UTC | #16

On Wed 20-10-21 19:45:33, Zhaoyang Huang wrote:
> On Wed, Oct 20, 2021 at 4:55 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 20-10-21 15:33:39, Zhaoyang Huang wrote:
> > [...]
> > > Do you mean that direct reclaim should succeed for the first round
> > > reclaim within which memcg get protected by memory.low and would NOT
> > > retry by setting memcg_low_reclaim to true?
> >
> > Yes, this is the semantic of low limit protection in the upstream
> > kernel. Have a look at do_try_to_free_pages and how it sets
> > memcg_low_reclaim only if there were no pages reclaimed.
> >
> > > It is not true in android
> > > like system, where reclaim always failed and introduce lmk and even
> > > OOM.
> >
> > I am not familiar with android specific changes to the upstream reclaim
> > logic. You should be investigating why the reclaim couldn't make a
> > forward progress (aka reclaim pages) from non-protected memcgs. There
> > are tracepoints you can use (generally vmscan prefix).
> Ok, I am aware of why you get confused now. I think you are analysing
> cgroup's behaviour according to a pre-defined workload and memory
> pattern, which should work according to the design, such as processes
> within root should provide memory before protected memcg get
> reclaimed. You can refer [1] as the hierarchy, where effective
> userspace workloads locate in protect groups and have rest of
> processes be non-grouped. In fact, non-grouped ones can not provide
> enough memory as they are kernel threads and the processes with few
> pages on LRU(control logic inside). The practical scenario is groupA
> launched a high-order kmalloc and introduce reclaiming(kswapd and
> direct reclaim). As I said, non-grouped ones can not provide enough
> contiguous memory blocks which let direct reclaim quickly fail for the
> first round reclaiming. What I am trying to do is that let kswapd try
> more for the target. It is also fair if groupA,B,C are trapping in
> slow path concurrently.
> 
> [1]
> root
> |                                                       |
> |              |
> non-grouped processes             groupA    groupB  groupC

I am sorry but I still do not understand your setup. I have asked
several times for more specifics. Without that I cannot really wrap my
head around your (ever changing) statements. This is not really a
productive use of time. I am sorry but I cannot really help you much
without understanding the actual problem.

mm: skip current when memcg reclaim

Commit Message

Comments

Patch