diff mbox series

[RFC] memcg: use root_mem_cgroup when css is inherited

Message ID 1660908562-17409-1-git-send-email-zhaoyang.huang@unisoc.com (mailing list archive)
State New
Headers show
Series [RFC] memcg: use root_mem_cgroup when css is inherited | expand

Commit Message

zhaoyang.huang Aug. 19, 2022, 11:29 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

It is observed in android system where per-app cgroup is demanded by freezer
subsys and part of groups require memory control. The hierarchy could be simplized
as bellowing where memory charged on group B abserved while we only want have
group E's memory be controlled and B's descendants compete freely for memory.
This should be the consequences of unified hierarchy.
Under this scenario, less efficient memory reclaim is observed when comparing
with no memory control. It is believed that multi LRU scanning introduces some
of the overhead. Furthermore, page thrashing is also heavier than global LRU
which could be the consequences of partial failure of WORKINGSET mechanism as
LRU is too short to protect the active pages.

A(subtree_control = memory) - B(subtree_control = NULL) - C()
							\ D()
			    - E(subtree_control = memory) - F()
							  \ G()

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 include/linux/cgroup.h |  1 +
 kernel/cgroup/cgroup.c | 11 +++++++++++
 mm/memcontrol.c        |  5 ++++-
 3 files changed, 16 insertions(+), 1 deletion(-)

Comments

Tejun Heo Aug. 19, 2022, 4:29 p.m. UTC | #1
On Fri, Aug 19, 2022 at 07:29:22PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> It is observed in android system where per-app cgroup is demanded by freezer
> subsys and part of groups require memory control. The hierarchy could be simplized
> as bellowing where memory charged on group B abserved while we only want have
> group E's memory be controlled and B's descendants compete freely for memory.
> This should be the consequences of unified hierarchy.
> Under this scenario, less efficient memory reclaim is observed when comparing
> with no memory control. It is believed that multi LRU scanning introduces some
> of the overhead. Furthermore, page thrashing is also heavier than global LRU
> which could be the consequences of partial failure of WORKINGSET mechanism as
> LRU is too short to protect the active pages.
> 
> A(subtree_control = memory) - B(subtree_control = NULL) - C()
> 							\ D()
> 			    - E(subtree_control = memory) - F()
> 							  \ G()
> 
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Just in case it wasn't clear.

Nacked-by: Tejun Heo <tj@kernel.org>

Thanks.
Shakeel Butt Aug. 19, 2022, 5:08 p.m. UTC | #2
On Fri, Aug 19, 2022 at 9:29 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Aug 19, 2022 at 07:29:22PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > It is observed in android system where per-app cgroup is demanded by freezer
> > subsys and part of groups require memory control. The hierarchy could be simplized
> > as bellowing where memory charged on group B abserved while we only want have
> > group E's memory be controlled and B's descendants compete freely for memory.
> > This should be the consequences of unified hierarchy.
> > Under this scenario, less efficient memory reclaim is observed when comparing
> > with no memory control. It is believed that multi LRU scanning introduces some
> > of the overhead. Furthermore, page thrashing is also heavier than global LRU
> > which could be the consequences of partial failure of WORKINGSET mechanism as
> > LRU is too short to protect the active pages.
> >
> > A(subtree_control = memory) - B(subtree_control = NULL) - C()
> >                                                       \ D()
> >                           - E(subtree_control = memory) - F()
> >                                                         \ G()
> >
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>
> Just in case it wasn't clear.
>
> Nacked-by: Tejun Heo <tj@kernel.org>
>
> Thanks.
>

Was there a previous discussion on this? The commit message is unreadable.
Tejun Heo Aug. 19, 2022, 5:10 p.m. UTC | #3
On Fri, Aug 19, 2022 at 10:08:59AM -0700, Shakeel Butt wrote:
> On Fri, Aug 19, 2022 at 9:29 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Fri, Aug 19, 2022 at 07:29:22PM +0800, zhaoyang.huang wrote:
> > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > >
> > > It is observed in android system where per-app cgroup is demanded by freezer
> > > subsys and part of groups require memory control. The hierarchy could be simplized
> > > as bellowing where memory charged on group B abserved while we only want have
> > > group E's memory be controlled and B's descendants compete freely for memory.
> > > This should be the consequences of unified hierarchy.
> > > Under this scenario, less efficient memory reclaim is observed when comparing
> > > with no memory control. It is believed that multi LRU scanning introduces some
> > > of the overhead. Furthermore, page thrashing is also heavier than global LRU
> > > which could be the consequences of partial failure of WORKINGSET mechanism as
> > > LRU is too short to protect the active pages.
> > >
> > > A(subtree_control = memory) - B(subtree_control = NULL) - C()
> > >                                                       \ D()
> > >                           - E(subtree_control = memory) - F()
> > >                                                         \ G()
> > >
> > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > Just in case it wasn't clear.
> >
> > Nacked-by: Tejun Heo <tj@kernel.org>
> >
> > Thanks.
> >
> 
> Was there a previous discussion on this? The commit message is unreadable.

http://lkml.kernel.org/r/1660298966-11493-1-git-send-email-zhaoyang.huang@unisoc.com

Thanks.
Zhaoyang Huang Aug. 23, 2022, 2:31 a.m. UTC | #4
On Mon, Aug 22, 2022 at 7:31 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-08-22 07:10:26, Tejun Heo wrote:
> > On Fri, Aug 19, 2022 at 10:08:59AM -0700, Shakeel Butt wrote:
> > > On Fri, Aug 19, 2022 at 9:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > >
> > > > On Fri, Aug 19, 2022 at 07:29:22PM +0800, zhaoyang.huang wrote:
> > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > >
> > > > > It is observed in android system where per-app cgroup is demanded by freezer
> > > > > subsys and part of groups require memory control. The hierarchy could be simplized
> > > > > as bellowing where memory charged on group B abserved while we only want have
> > > > > group E's memory be controlled and B's descendants compete freely for memory.
> > > > > This should be the consequences of unified hierarchy.
> > > > > Under this scenario, less efficient memory reclaim is observed when comparing
> > > > > with no memory control. It is believed that multi LRU scanning introduces some
> > > > > of the overhead. Furthermore, page thrashing is also heavier than global LRU
> > > > > which could be the consequences of partial failure of WORKINGSET mechanism as
> > > > > LRU is too short to protect the active pages.
> > > > >
> > > > > A(subtree_control = memory) - B(subtree_control = NULL) - C()
> > > > >                                                       \ D()
> > > > >                           - E(subtree_control = memory) - F()
> > > > >                                                         \ G()
> > > > >
> > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > >
> > > > Just in case it wasn't clear.
> > > >
> > > > Nacked-by: Tejun Heo <tj@kernel.org>
> > > >
> > > > Thanks.
> > > >
> > >
> > > Was there a previous discussion on this? The commit message is unreadable.
> >
> > http://lkml.kernel.org/r/1660298966-11493-1-git-send-email-zhaoyang.huang@unisoc.com
>
> Even that discussion doesn't really explain the real underlying problem.
> There are statements about inefficiency and trashing without any further
> details or clarifications.
I would like to quote the comments from google side for more details
which can also be observed from different vendors.
"Also be advised that when you enable memcg v2 you will be using
per-app memcg configuration which implies noticeable overhead because
every app will have its own group. For example pagefault path will
regress by about 15%. And obviously there will be some memory overhead
as well. That's the reason we don't enable them in Android by
default."
>
> My very vague understanding is that the Android system would like to
> freeze specific applications and for that it requires each application
> to live in its own cgroup. This clashes with a requirement to age and
> reclaim memory on a different granularity (aka no per process reclaim).
> So in fact something that cgroup v1 would achieve by having 2
> hierarchies, one for the freezer which would have a dedicated cgroup for
> each application and the other for the memory controller where tasks are
> grouped by a different criteria. This would rule out that a global (or
> any external memory pressure) reclaim would age LRUs that contain a mix
> bag of application pages rather than iterate over per-application LRUs.
> Is that understanding correct?
Correct, this is just our confusion. Besides, we believe that charge
the pages to implicit memory enabled parent control group doesn't make
sense as the memory cannot be managed at all.
> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang Aug. 23, 2022, 6:03 a.m. UTC | #5
On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
> > On Mon, Aug 22, 2022 at 7:31 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 19-08-22 07:10:26, Tejun Heo wrote:
> > > > On Fri, Aug 19, 2022 at 10:08:59AM -0700, Shakeel Butt wrote:
> > > > > On Fri, Aug 19, 2022 at 9:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > > > >
> > > > > > On Fri, Aug 19, 2022 at 07:29:22PM +0800, zhaoyang.huang wrote:
> > > > > > > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > > >
> > > > > > > It is observed in android system where per-app cgroup is demanded by freezer
> > > > > > > subsys and part of groups require memory control. The hierarchy could be simplized
> > > > > > > as bellowing where memory charged on group B abserved while we only want have
> > > > > > > group E's memory be controlled and B's descendants compete freely for memory.
> > > > > > > This should be the consequences of unified hierarchy.
> > > > > > > Under this scenario, less efficient memory reclaim is observed when comparing
> > > > > > > with no memory control. It is believed that multi LRU scanning introduces some
> > > > > > > of the overhead. Furthermore, page thrashing is also heavier than global LRU
> > > > > > > which could be the consequences of partial failure of WORKINGSET mechanism as
> > > > > > > LRU is too short to protect the active pages.
> > > > > > >
> > > > > > > A(subtree_control = memory) - B(subtree_control = NULL) - C()
> > > > > > >                                                       \ D()
> > > > > > >                           - E(subtree_control = memory) - F()
> > > > > > >                                                         \ G()
> > > > > > >
> > > > > > > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > > > > >
> > > > > > Just in case it wasn't clear.
> > > > > >
> > > > > > Nacked-by: Tejun Heo <tj@kernel.org>
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > >
> > > > > Was there a previous discussion on this? The commit message is unreadable.
> > > >
> > > > http://lkml.kernel.org/r/1660298966-11493-1-git-send-email-zhaoyang.huang@unisoc.com
> > >
> > > Even that discussion doesn't really explain the real underlying problem.
> > > There are statements about inefficiency and trashing without any further
> > > details or clarifications.
> > I would like to quote the comments from google side for more details
> > which can also be observed from different vendors.
> > "Also be advised that when you enable memcg v2 you will be using
> > per-app memcg configuration which implies noticeable overhead because
> > every app will have its own group. For example pagefault path will
> > regress by about 15%. And obviously there will be some memory overhead
> > as well. That's the reason we don't enable them in Android by
> > default."
>
> This should be reported and investigated. Because per-application memcg
> vs. memcg in general shouldn't make much of a difference from the
> performance side. I can see a potential performance impact for no-memcg
> vs. memcg case but even then 15% is quite a lot.
Less efficiency on memory reclaim caused by multi-LRU should be one of
the reason, which has been proved by comparing per-app memcg on/off.
Besides, theoretically workingset could also broken as LRU is too
short to compose workingset.
>
> > > My very vague understanding is that the Android system would like to
> > > freeze specific applications and for that it requires each application
> > > to live in its own cgroup. This clashes with a requirement to age and
> > > reclaim memory on a different granularity (aka no per process reclaim).
> > > So in fact something that cgroup v1 would achieve by having 2
> > > hierarchies, one for the freezer which would have a dedicated cgroup for
> > > each application and the other for the memory controller where tasks are
> > > grouped by a different criteria. This would rule out that a global (or
> > > any external memory pressure) reclaim would age LRUs that contain a mix
> > > bag of application pages rather than iterate over per-application LRUs.
> > > Is that understanding correct?
> > Correct, this is just our confusion. Besides, we believe that charge
> > the pages to implicit memory enabled parent control group doesn't make
> > sense as the memory cannot be managed at all.
>
> I do not get that part. The parent can manange and control the memory
> usage so how come it cannot be managed at all?
What I mean is the kind of parent which is enabled implicitly by
enabling on its sibling group like belowing hierarchy. Imagine that C
has no intention of memory control but has to be enabled as B would
have it. IMO, it doesn't make sense to charge C1's memory.current to C
until an explicitly echo "+memory" >  C/subtree_control.
A----B---B1
     \ C---C1
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 23, 2022, 8:33 a.m. UTC | #6
On Tue 23-08-22 14:03:04, Zhaoyang Huang wrote:
> On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
[...]
> > > I would like to quote the comments from google side for more details
> > > which can also be observed from different vendors.
> > > "Also be advised that when you enable memcg v2 you will be using
> > > per-app memcg configuration which implies noticeable overhead because
> > > every app will have its own group. For example pagefault path will
> > > regress by about 15%. And obviously there will be some memory overhead
> > > as well. That's the reason we don't enable them in Android by
> > > default."
> >
> > This should be reported and investigated. Because per-application memcg
> > vs. memcg in general shouldn't make much of a difference from the
> > performance side. I can see a potential performance impact for no-memcg
> > vs. memcg case but even then 15% is quite a lot.
> Less efficiency on memory reclaim caused by multi-LRU should be one of
> the reason, which has been proved by comparing per-app memcg on/off.
> Besides, theoretically workingset could also broken as LRU is too
> short to compose workingset.

Do you have any data to back these claims? Is this something that could
be handled on the configuration level? E.g. by applying low limit
protection to keep the workingset in the memory?

> > > > My very vague understanding is that the Android system would like to
> > > > freeze specific applications and for that it requires each application
> > > > to live in its own cgroup. This clashes with a requirement to age and
> > > > reclaim memory on a different granularity (aka no per process reclaim).
> > > > So in fact something that cgroup v1 would achieve by having 2
> > > > hierarchies, one for the freezer which would have a dedicated cgroup for
> > > > each application and the other for the memory controller where tasks are
> > > > grouped by a different criteria. This would rule out that a global (or
> > > > any external memory pressure) reclaim would age LRUs that contain a mix
> > > > bag of application pages rather than iterate over per-application LRUs.
> > > > Is that understanding correct?
> > > Correct, this is just our confusion. Besides, we believe that charge
> > > the pages to implicit memory enabled parent control group doesn't make
> > > sense as the memory cannot be managed at all.
> >
> > I do not get that part. The parent can manange and control the memory
> > usage so how come it cannot be managed at all?
> What I mean is the kind of parent which is enabled implicitly by
> enabling on its sibling group like belowing hierarchy. Imagine that C
> has no intention of memory control but has to be enabled as B would
> have it. IMO, it doesn't make sense to charge C1's memory.current to C
> until an explicitly echo "+memory" >  C/subtree_control.
> A----B---B1
>      \ C---C1

So let me just expand your example for clarity

A.cgroup.controllers = memory
A.cgroup.subtree_control = memory

A/B.cgroup.controllers = memory
A/B.cgroup.subtree_control = memory
A/B/B1.cgroup.controllers = memory

A/C.cgroup.controllers = memory
A/C.cgroup.subtree_control = ""
A/C/C1.cgroup.controllers = ""

Is your concern that C1 is charged to A/C or that you cannot actually make
A/C.cgroup.controllers = "" because you want to maintain memory in A?
Because that would be breaking the internal node constrain rule AFAICS.

Or maybe you just really want a different hierarchy where
A == root_cgroup and want the memory acocunted in B
(root/B.cgroup.controllers = memory) but not in C (root/C.cgroup.controllers = "")?

That would mean that C memory would be maintained on the global (root
memcg) LRUs which is the only internal node which is allowed to have
resources because it is special.
Zhaoyang Huang Aug. 23, 2022, 9:20 a.m. UTC | #7
On Tue, Aug 23, 2022 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 23-08-22 14:03:04, Zhaoyang Huang wrote:
> > On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
> [...]
> > > > I would like to quote the comments from google side for more details
> > > > which can also be observed from different vendors.
> > > > "Also be advised that when you enable memcg v2 you will be using
> > > > per-app memcg configuration which implies noticeable overhead because
> > > > every app will have its own group. For example pagefault path will
> > > > regress by about 15%. And obviously there will be some memory overhead
> > > > as well. That's the reason we don't enable them in Android by
> > > > default."
> > >
> > > This should be reported and investigated. Because per-application memcg
> > > vs. memcg in general shouldn't make much of a difference from the
> > > performance side. I can see a potential performance impact for no-memcg
> > > vs. memcg case but even then 15% is quite a lot.
> > Less efficiency on memory reclaim caused by multi-LRU should be one of
> > the reason, which has been proved by comparing per-app memcg on/off.
> > Besides, theoretically workingset could also broken as LRU is too
> > short to compose workingset.
>
> Do you have any data to back these claims? Is this something that could
> be handled on the configuration level? E.g. by applying low limit
> protection to keep the workingset in the memory?
I don't think so. IMO, workingset works when there are pages evicted
from LRU and then refault which provide refault distance for pages.
Applying memcg's protection will have all LRU out of evicted which
make the mechanism fail.
>
> > > > > My very vague understanding is that the Android system would like to
> > > > > freeze specific applications and for that it requires each application
> > > > > to live in its own cgroup. This clashes with a requirement to age and
> > > > > reclaim memory on a different granularity (aka no per process reclaim).
> > > > > So in fact something that cgroup v1 would achieve by having 2
> > > > > hierarchies, one for the freezer which would have a dedicated cgroup for
> > > > > each application and the other for the memory controller where tasks are
> > > > > grouped by a different criteria. This would rule out that a global (or
> > > > > any external memory pressure) reclaim would age LRUs that contain a mix
> > > > > bag of application pages rather than iterate over per-application LRUs.
> > > > > Is that understanding correct?
> > > > Correct, this is just our confusion. Besides, we believe that charge
> > > > the pages to implicit memory enabled parent control group doesn't make
> > > > sense as the memory cannot be managed at all.
> > >
> > > I do not get that part. The parent can manange and control the memory
> > > usage so how come it cannot be managed at all?
> > What I mean is the kind of parent which is enabled implicitly by
> > enabling on its sibling group like belowing hierarchy. Imagine that C
> > has no intention of memory control but has to be enabled as B would
> > have it. IMO, it doesn't make sense to charge C1's memory.current to C
> > until an explicitly echo "+memory" >  C/subtree_control.
> > A----B---B1
> >      \ C---C1
>
> So let me just expand your example for clarity
>
> A.cgroup.controllers = memory
> A.cgroup.subtree_control = memory
>
> A/B.cgroup.controllers = memory
> A/B.cgroup.subtree_control = memory
> A/B/B1.cgroup.controllers = memory
>
> A/C.cgroup.controllers = memory
> A/C.cgroup.subtree_control = ""
> A/C/C1.cgroup.controllers = ""
Yes for above hierarchy and configuration.
>
> Is your concern that C1 is charged to A/C or that you cannot actually make
> A/C.cgroup.controllers = "" because you want to maintain memory in A?
> Because that would be breaking the internal node constrain rule AFAICS.
No. I just want to keep memory on B.
>
> Or maybe you just really want a different hierarchy where
> A == root_cgroup and want the memory acocunted in B
> (root/B.cgroup.controllers = memory) but not in C (root/C.cgroup.controllers = "")?
Yes.
>
> That would mean that C memory would be maintained on the global (root
> memcg) LRUs which is the only internal node which is allowed to have
> resources because it is special.
Exactly. I would like to have all groups like C which have no parent's
subtree_control = memory charge memory to root. Under this
implementation, memory under enabled group will be protected by
min/low while other groups' memory share the same LRU to have
workingset things take effect.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 23, 2022, 11:51 a.m. UTC | #8
On Tue 23-08-22 17:20:59, Zhaoyang Huang wrote:
> On Tue, Aug 23, 2022 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 23-08-22 14:03:04, Zhaoyang Huang wrote:
> > > On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
> > [...]
> > > > > I would like to quote the comments from google side for more details
> > > > > which can also be observed from different vendors.
> > > > > "Also be advised that when you enable memcg v2 you will be using
> > > > > per-app memcg configuration which implies noticeable overhead because
> > > > > every app will have its own group. For example pagefault path will
> > > > > regress by about 15%. And obviously there will be some memory overhead
> > > > > as well. That's the reason we don't enable them in Android by
> > > > > default."
> > > >
> > > > This should be reported and investigated. Because per-application memcg
> > > > vs. memcg in general shouldn't make much of a difference from the
> > > > performance side. I can see a potential performance impact for no-memcg
> > > > vs. memcg case but even then 15% is quite a lot.
> > > Less efficiency on memory reclaim caused by multi-LRU should be one of
> > > the reason, which has been proved by comparing per-app memcg on/off.
> > > Besides, theoretically workingset could also broken as LRU is too
> > > short to compose workingset.
> >
> > Do you have any data to back these claims? Is this something that could
> > be handled on the configuration level? E.g. by applying low limit
> > protection to keep the workingset in the memory?
> I don't think so. IMO, workingset works when there are pages evicted
> from LRU and then refault which provide refault distance for pages.
> Applying memcg's protection will have all LRU out of evicted which
> make the mechanism fail.

It is really hard to help you out without any actual data. The idea was
though to use the low limit protection to adaptively configure
respective memcgs to reduce refaults. You already have data about
refaults ready so increasing the limit for often refaulting memcgs would
reduce the trashing.

[...]
> > A.cgroup.controllers = memory
> > A.cgroup.subtree_control = memory
> >
> > A/B.cgroup.controllers = memory
> > A/B.cgroup.subtree_control = memory
> > A/B/B1.cgroup.controllers = memory
> >
> > A/C.cgroup.controllers = memory
> > A/C.cgroup.subtree_control = ""
> > A/C/C1.cgroup.controllers = ""
> Yes for above hierarchy and configuration.
> >
> > Is your concern that C1 is charged to A/C or that you cannot actually make
> > A/C.cgroup.controllers = "" because you want to maintain memory in A?
> > Because that would be breaking the internal node constrain rule AFAICS.
> No. I just want to keep memory on B.

That would require A to be without controllers which is not possible due
to hierarchical constrain.

> > Or maybe you just really want a different hierarchy where
> > A == root_cgroup and want the memory acocunted in B
> > (root/B.cgroup.controllers = memory) but not in C (root/C.cgroup.controllers = "")?
> Yes.
> >
> > That would mean that C memory would be maintained on the global (root
> > memcg) LRUs which is the only internal node which is allowed to have
> > resources because it is special.
> Exactly. I would like to have all groups like C which have no parent's
> subtree_control = memory charge memory to root. Under this
> implementation, memory under enabled group will be protected by
> min/low while other groups' memory share the same LRU to have
> workingset things take effect.

One way to achieve that would be shaping the hierarchy the following way
	    root
	/         \
no_memcg[1]      memcg[2]
||||||||         |||||
app_cgroups     app_cgroups

with 
no_memcg.subtree_control = ""
memcg.subtree_control = memory

no?

You haven't really described why you need per application freezer cgroup
but I suspect you want to selectively freeze applications. Is there
any obstacle to have a dedicated frozen cgroup and migrate tasks to be
frozen there?
Suren Baghdasaryan Aug. 23, 2022, 4:21 p.m. UTC | #9
On Tue, Aug 23, 2022 at 4:51 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 23-08-22 17:20:59, Zhaoyang Huang wrote:
> > On Tue, Aug 23, 2022 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 23-08-22 14:03:04, Zhaoyang Huang wrote:
> > > > On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
> > > [...]
> > > > > > I would like to quote the comments from google side for more details
> > > > > > which can also be observed from different vendors.
> > > > > > "Also be advised that when you enable memcg v2 you will be using
> > > > > > per-app memcg configuration which implies noticeable overhead because
> > > > > > every app will have its own group. For example pagefault path will
> > > > > > regress by about 15%. And obviously there will be some memory overhead
> > > > > > as well. That's the reason we don't enable them in Android by
> > > > > > default."
> > > > >
> > > > > This should be reported and investigated. Because per-application memcg
> > > > > vs. memcg in general shouldn't make much of a difference from the
> > > > > performance side. I can see a potential performance impact for no-memcg
> > > > > vs. memcg case but even then 15% is quite a lot.
> > > > Less efficiency on memory reclaim caused by multi-LRU should be one of
> > > > the reason, which has been proved by comparing per-app memcg on/off.
> > > > Besides, theoretically workingset could also broken as LRU is too
> > > > short to compose workingset.
> > >
> > > Do you have any data to back these claims? Is this something that could
> > > be handled on the configuration level? E.g. by applying low limit
> > > protection to keep the workingset in the memory?
> > I don't think so. IMO, workingset works when there are pages evicted
> > from LRU and then refault which provide refault distance for pages.
> > Applying memcg's protection will have all LRU out of evicted which
> > make the mechanism fail.
>
> It is really hard to help you out without any actual data. The idea was
> though to use the low limit protection to adaptively configure
> respective memcgs to reduce refaults. You already have data about
> refaults ready so increasing the limit for often refaulting memcgs would
> reduce the trashing.

Sorry for joining late.
A couple years ago I tested root-memcg vs per-app memcg configurations
on an Android phone. Here is a snapshot from my findings:

Problem
=======
We see tangible increase in major faults and workingset refaults when
transitioning from root-only memory cgroup to per-application cgroups
on Android.

Test results
============
Results while running memory-demanding workload:
root memcg     per-app memcg     delta
workingset_refault 1771228 3874281 +118.73%
workingset_nodereclaim 4543 13928 +206.58%
pgpgin 13319208 20618944 +54.81%
pgpgout 1739552 3080664 +77.1%
pgpgoutclean 2616571 4805755 +83.67%
pswpin 359211 3918716 +990.92%
pswpout 1082238 5697463 +426.45%
pgfree 28978393 32531010 +12.26%
pgactivate 2586562 8731113 +237.56%
pgdeactivate 3811074 11670051 +206.21%
pgfault 38692510 46096963 +19.14%
pgmajfault 441288 4100020 +829.1%
pgrefill 4590451 12768165 +178.15%

Results while running application cycle test (20 apps, 20 cycles):
root memcg     per-app memcg     delta
workingset_refault 10634691 11429223 +7.47%
workingset_nodereclaim 37477 59033 +57.52%
pgpgin 70662840 69569516 -1.55%
pgpgout 2605968 2695596 +3.44%
pgpgoutclean 13514955 14980610 +10.84%
pswpin 1489851 3780868 +153.77%
pswpout 4125547 8050819 +95.15%
pgfree 99823083 105104637 +5.29%
pgactivate 7685275 11647913 +51.56%
pgdeactivate 14193660 21459784 +51.19%
pgfault 89173166 100598528 +12.81%
pgmajfault 1856172 4227190 +127.74%
pgrefill 16643554 23203927 +39.42%

Tests were conducted on an Android phone with 4GB RAM.
Similar regression was reported a couple years ago here:
https://www.spinics.net/lists/linux-mm/msg121665.html

I plan on checking the difference again on newer kernels (likely 5.15)
after LPC this September.

>
> [...]
> > > A.cgroup.controllers = memory
> > > A.cgroup.subtree_control = memory
> > >
> > > A/B.cgroup.controllers = memory
> > > A/B.cgroup.subtree_control = memory
> > > A/B/B1.cgroup.controllers = memory
> > >
> > > A/C.cgroup.controllers = memory
> > > A/C.cgroup.subtree_control = ""
> > > A/C/C1.cgroup.controllers = ""
> > Yes for above hierarchy and configuration.
> > >
> > > Is your concern that C1 is charged to A/C or that you cannot actually make
> > > A/C.cgroup.controllers = "" because you want to maintain memory in A?
> > > Because that would be breaking the internal node constrain rule AFAICS.
> > No. I just want to keep memory on B.
>
> That would require A to be without controllers which is not possible due
> to hierarchical constrain.
>
> > > Or maybe you just really want a different hierarchy where
> > > A == root_cgroup and want the memory acocunted in B
> > > (root/B.cgroup.controllers = memory) but not in C (root/C.cgroup.controllers = "")?
> > Yes.
> > >
> > > That would mean that C memory would be maintained on the global (root
> > > memcg) LRUs which is the only internal node which is allowed to have
> > > resources because it is special.
> > Exactly. I would like to have all groups like C which have no parent's
> > subtree_control = memory charge memory to root. Under this
> > implementation, memory under enabled group will be protected by
> > min/low while other groups' memory share the same LRU to have
> > workingset things take effect.
>
> One way to achieve that would be shaping the hierarchy the following way
>             root
>         /         \
> no_memcg[1]      memcg[2]
> ||||||||         |||||
> app_cgroups     app_cgroups
>
> with
> no_memcg.subtree_control = ""
> memcg.subtree_control = memory
>
> no?
>
> You haven't really described why you need per application freezer cgroup
> but I suspect you want to selectively freeze applications. Is there
> any obstacle to have a dedicated frozen cgroup and migrate tasks to be
> frozen there?

We intend for Android to gradually migrate to v2 cgroups for all
controllers and given that it has to use a unified hierarchy,
per-application hierarchy provides highest flexibility. That way we
can control every aspect of every app without affecting others. Of
course that comes with its overhead.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang Aug. 24, 2022, 2:23 a.m. UTC | #10
On Tue, Aug 23, 2022 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 23-08-22 17:20:59, Zhaoyang Huang wrote:
> > On Tue, Aug 23, 2022 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 23-08-22 14:03:04, Zhaoyang Huang wrote:
> > > > On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
> > > [...]
> > > > > > I would like to quote the comments from google side for more details
> > > > > > which can also be observed from different vendors.
> > > > > > "Also be advised that when you enable memcg v2 you will be using
> > > > > > per-app memcg configuration which implies noticeable overhead because
> > > > > > every app will have its own group. For example pagefault path will
> > > > > > regress by about 15%. And obviously there will be some memory overhead
> > > > > > as well. That's the reason we don't enable them in Android by
> > > > > > default."
> > > > >
> > > > > This should be reported and investigated. Because per-application memcg
> > > > > vs. memcg in general shouldn't make much of a difference from the
> > > > > performance side. I can see a potential performance impact for no-memcg
> > > > > vs. memcg case but even then 15% is quite a lot.
> > > > Less efficiency on memory reclaim caused by multi-LRU should be one of
> > > > the reason, which has been proved by comparing per-app memcg on/off.
> > > > Besides, theoretically workingset could also broken as LRU is too
> > > > short to compose workingset.
> > >
> > > Do you have any data to back these claims? Is this something that could
> > > be handled on the configuration level? E.g. by applying low limit
> > > protection to keep the workingset in the memory?
> > I don't think so. IMO, workingset works when there are pages evicted
> > from LRU and then refault which provide refault distance for pages.
> > Applying memcg's protection will have all LRU out of evicted which
> > make the mechanism fail.
>
> It is really hard to help you out without any actual data. The idea was
> though to use the low limit protection to adaptively configure
> respective memcgs to reduce refaults. You already have data about
> refaults ready so increasing the limit for often refaulting memcgs would
> reduce the trashing.
>
> [...]
> > > A.cgroup.controllers = memory
> > > A.cgroup.subtree_control = memory
> > >
> > > A/B.cgroup.controllers = memory
> > > A/B.cgroup.subtree_control = memory
> > > A/B/B1.cgroup.controllers = memory
> > >
> > > A/C.cgroup.controllers = memory
> > > A/C.cgroup.subtree_control = ""
> > > A/C/C1.cgroup.controllers = ""
> > Yes for above hierarchy and configuration.
> > >
> > > Is your concern that C1 is charged to A/C or that you cannot actually make
> > > A/C.cgroup.controllers = "" because you want to maintain memory in A?
> > > Because that would be breaking the internal node constrain rule AFAICS.
> > No. I just want to keep memory on B.
>
> That would require A to be without controllers which is not possible due
> to hierarchical constrain.
>
> > > Or maybe you just really want a different hierarchy where
> > > A == root_cgroup and want the memory acocunted in B
> > > (root/B.cgroup.controllers = memory) but not in C (root/C.cgroup.controllers = "")?
> > Yes.
> > >
> > > That would mean that C memory would be maintained on the global (root
> > > memcg) LRUs which is the only internal node which is allowed to have
> > > resources because it is special.
> > Exactly. I would like to have all groups like C which have no parent's
> > subtree_control = memory charge memory to root. Under this
> > implementation, memory under enabled group will be protected by
> > min/low while other groups' memory share the same LRU to have
> > workingset things take effect.
>
> One way to achieve that would be shaping the hierarchy the following way
>             root
>         /         \
> no_memcg[1]      memcg[2]
> ||||||||         |||||
> app_cgroups     app_cgroups
>
> with
> no_memcg.subtree_control = ""
> memcg.subtree_control = memory
>
> no?
According to my understanding, No as there will be no no_memcg. All
children groups under root would have its cgroup.controllers = memory
as long as root has memory enabled. Under this circumstance, all
descendants group under 'no_memcg' will charge memory to its parent
group. This is caused by e_css  policy when apply subsys control which
have child group use its first level ancestors css.
>
> You haven't really described why you need per application freezer cgroup
> but I suspect you want to selectively freeze applications. Is there
> any obstacle to have a dedicated frozen cgroup and migrate tasks to be
> frozen there?
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 24, 2022, 7:50 a.m. UTC | #11
On Wed 24-08-22 10:23:14, Zhaoyang Huang wrote:
> On Tue, Aug 23, 2022 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > One way to achieve that would be shaping the hierarchy the following way
> >             root
> >         /         \
> > no_memcg[1]      memcg[2]
> > ||||||||         |||||
> > app_cgroups     app_cgroups
> >
> > with
> > no_memcg.subtree_control = ""
> > memcg.subtree_control = memory
> >
> > no?
> According to my understanding, No as there will be no no_memcg. All
> children groups under root would have its cgroup.controllers = memory
> as long as root has memory enabled.

Correct

> Under this circumstance, all
> descendants group under 'no_memcg' will charge memory to its parent
> group.

Correct. And why is that a problem? I thought you main concern was a per
application LRUs. With the above configuration all app_cgroups which do
not require an explicit memory control will share the same (no_memcg)
LRU and they will be aged together.
Michal Hocko Aug. 24, 2022, 7:59 a.m. UTC | #12
On Tue 23-08-22 09:21:16, Suren Baghdasaryan wrote:
> On Tue, Aug 23, 2022 at 4:51 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 23-08-22 17:20:59, Zhaoyang Huang wrote:
> > > On Tue, Aug 23, 2022 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Tue 23-08-22 14:03:04, Zhaoyang Huang wrote:
> > > > > On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
> > > > [...]
> > > > > > > I would like to quote the comments from google side for more details
> > > > > > > which can also be observed from different vendors.
> > > > > > > "Also be advised that when you enable memcg v2 you will be using
> > > > > > > per-app memcg configuration which implies noticeable overhead because
> > > > > > > every app will have its own group. For example pagefault path will
> > > > > > > regress by about 15%. And obviously there will be some memory overhead
> > > > > > > as well. That's the reason we don't enable them in Android by
> > > > > > > default."
> > > > > >
> > > > > > This should be reported and investigated. Because per-application memcg
> > > > > > vs. memcg in general shouldn't make much of a difference from the
> > > > > > performance side. I can see a potential performance impact for no-memcg
> > > > > > vs. memcg case but even then 15% is quite a lot.
> > > > > Less efficiency on memory reclaim caused by multi-LRU should be one of
> > > > > the reason, which has been proved by comparing per-app memcg on/off.
> > > > > Besides, theoretically workingset could also broken as LRU is too
> > > > > short to compose workingset.
> > > >
> > > > Do you have any data to back these claims? Is this something that could
> > > > be handled on the configuration level? E.g. by applying low limit
> > > > protection to keep the workingset in the memory?
> > > I don't think so. IMO, workingset works when there are pages evicted
> > > from LRU and then refault which provide refault distance for pages.
> > > Applying memcg's protection will have all LRU out of evicted which
> > > make the mechanism fail.
> >
> > It is really hard to help you out without any actual data. The idea was
> > though to use the low limit protection to adaptively configure
> > respective memcgs to reduce refaults. You already have data about
> > refaults ready so increasing the limit for often refaulting memcgs would
> > reduce the trashing.
> 
> Sorry for joining late.
> A couple years ago I tested root-memcg vs per-app memcg configurations
> on an Android phone. Here is a snapshot from my findings:
> 
> Problem
> =======
> We see tangible increase in major faults and workingset refaults when
> transitioning from root-only memory cgroup to per-application cgroups
> on Android.
> 
> Test results
> ============
> Results while running memory-demanding workload:
> root memcg     per-app memcg     delta
> workingset_refault 1771228 3874281 +118.73%
> workingset_nodereclaim 4543 13928 +206.58%
> pgpgin 13319208 20618944 +54.81%
> pgpgout 1739552 3080664 +77.1%
> pgpgoutclean 2616571 4805755 +83.67%
> pswpin 359211 3918716 +990.92%
> pswpout 1082238 5697463 +426.45%
> pgfree 28978393 32531010 +12.26%
> pgactivate 2586562 8731113 +237.56%
> pgdeactivate 3811074 11670051 +206.21%
> pgfault 38692510 46096963 +19.14%
> pgmajfault 441288 4100020 +829.1%
> pgrefill 4590451 12768165 +178.15%
> 
> Results while running application cycle test (20 apps, 20 cycles):
> root memcg     per-app memcg     delta
> workingset_refault 10634691 11429223 +7.47%
> workingset_nodereclaim 37477 59033 +57.52%
> pgpgin 70662840 69569516 -1.55%
> pgpgout 2605968 2695596 +3.44%
> pgpgoutclean 13514955 14980610 +10.84%
> pswpin 1489851 3780868 +153.77%
> pswpout 4125547 8050819 +95.15%
> pgfree 99823083 105104637 +5.29%
> pgactivate 7685275 11647913 +51.56%
> pgdeactivate 14193660 21459784 +51.19%
> pgfault 89173166 100598528 +12.81%
> pgmajfault 1856172 4227190 +127.74%
> pgrefill 16643554 23203927 +39.42%

Thanks! It would be interesting to see per memcg stats as well. Are
there any outliers? Are there any signs of over-reclaim (more pages
scanned & reclaimed by both kswapd and direct reclaim?

> Tests were conducted on an Android phone with 4GB RAM.
> Similar regression was reported a couple years ago here:
> https://www.spinics.net/lists/linux-mm/msg121665.html
> 
> I plan on checking the difference again on newer kernels (likely 5.15)
> after LPC this September.

Thanks, that would be useful!
Zhaoyang Huang Aug. 24, 2022, 9:34 a.m. UTC | #13
On Wed, Aug 24, 2022 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 24-08-22 10:23:14, Zhaoyang Huang wrote:
> > On Tue, Aug 23, 2022 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > One way to achieve that would be shaping the hierarchy the following way
> > >             root
> > >         /         \
> > > no_memcg[1]      memcg[2]
> > > ||||||||         |||||
> > > app_cgroups     app_cgroups
> > >
> > > with
> > > no_memcg.subtree_control = ""
> > > memcg.subtree_control = memory
> > >
> > > no?
> > According to my understanding, No as there will be no no_memcg. All
> > children groups under root would have its cgroup.controllers = memory
> > as long as root has memory enabled.
>
> Correct
>
> > Under this circumstance, all
> > descendants group under 'no_memcg' will charge memory to its parent
> > group.
>
> Correct. And why is that a problem? I thought you main concern was a per
> application LRUs. With the above configuration all app_cgroups which do
> not require an explicit memory control will share the same (no_memcg)
> LRU and they will be aged together.
I can't agree since this indicates the processes want memory free
depending on a specific hierarchy which could have been determined by
other subsys. IMHO, charging the pages which out of explicitly memory
enabled group to root could solve all of the above constraints with no
harm.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 24, 2022, 10:27 a.m. UTC | #14
On Wed 24-08-22 17:34:42, Zhaoyang Huang wrote:
> On Wed, Aug 24, 2022 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 24-08-22 10:23:14, Zhaoyang Huang wrote:
> > > On Tue, Aug 23, 2022 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > One way to achieve that would be shaping the hierarchy the following way
> > > >             root
> > > >         /         \
> > > > no_memcg[1]      memcg[2]
> > > > ||||||||         |||||
> > > > app_cgroups     app_cgroups
> > > >
> > > > with
> > > > no_memcg.subtree_control = ""
> > > > memcg.subtree_control = memory
> > > >
> > > > no?
> > > According to my understanding, No as there will be no no_memcg. All
> > > children groups under root would have its cgroup.controllers = memory
> > > as long as root has memory enabled.
> >
> > Correct
> >
> > > Under this circumstance, all
> > > descendants group under 'no_memcg' will charge memory to its parent
> > > group.
> >
> > Correct. And why is that a problem? I thought you main concern was a per
> > application LRUs. With the above configuration all app_cgroups which do
> > not require an explicit memory control will share the same (no_memcg)
> > LRU and they will be aged together.
> I can't agree since this indicates the processes want memory free
> depending on a specific hierarchy which could have been determined by
> other subsys.

I really fail to understand your requirements.

> IMHO, charging the pages which out of explicitly memory
> enabled group to root could solve all of the above constraints with no
> harm.

This would break the hierarchical property of the controller. So a
strong no no. Consider the following example

       root
	|
	A
controllers="memory"
memory.max = 1G
subtree_control=""
|      |      |
A1     A2     A3

althought A1,2,3 do not have their memory controller enabled explicitly
they are still constrained by the A memcg limit. If you just charge to
the root because it doesn't have memory controller enabled explicitly
then you just evade that constrain. I hope you understand why that is a
problem.
Suren Baghdasaryan Aug. 24, 2022, 5:13 p.m. UTC | #15
On Wed, Aug 24, 2022 at 12:59 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 23-08-22 09:21:16, Suren Baghdasaryan wrote:
> > On Tue, Aug 23, 2022 at 4:51 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 23-08-22 17:20:59, Zhaoyang Huang wrote:
> > > > On Tue, Aug 23, 2022 at 4:33 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Tue 23-08-22 14:03:04, Zhaoyang Huang wrote:
> > > > > > On Tue, Aug 23, 2022 at 1:21 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Tue 23-08-22 10:31:57, Zhaoyang Huang wrote:
> > > > > [...]
> > > > > > > > I would like to quote the comments from google side for more details
> > > > > > > > which can also be observed from different vendors.
> > > > > > > > "Also be advised that when you enable memcg v2 you will be using
> > > > > > > > per-app memcg configuration which implies noticeable overhead because
> > > > > > > > every app will have its own group. For example pagefault path will
> > > > > > > > regress by about 15%. And obviously there will be some memory overhead
> > > > > > > > as well. That's the reason we don't enable them in Android by
> > > > > > > > default."
> > > > > > >
> > > > > > > This should be reported and investigated. Because per-application memcg
> > > > > > > vs. memcg in general shouldn't make much of a difference from the
> > > > > > > performance side. I can see a potential performance impact for no-memcg
> > > > > > > vs. memcg case but even then 15% is quite a lot.
> > > > > > Less efficiency on memory reclaim caused by multi-LRU should be one of
> > > > > > the reason, which has been proved by comparing per-app memcg on/off.
> > > > > > Besides, theoretically workingset could also broken as LRU is too
> > > > > > short to compose workingset.
> > > > >
> > > > > Do you have any data to back these claims? Is this something that could
> > > > > be handled on the configuration level? E.g. by applying low limit
> > > > > protection to keep the workingset in the memory?
> > > > I don't think so. IMO, workingset works when there are pages evicted
> > > > from LRU and then refault which provide refault distance for pages.
> > > > Applying memcg's protection will have all LRU out of evicted which
> > > > make the mechanism fail.
> > >
> > > It is really hard to help you out without any actual data. The idea was
> > > though to use the low limit protection to adaptively configure
> > > respective memcgs to reduce refaults. You already have data about
> > > refaults ready so increasing the limit for often refaulting memcgs would
> > > reduce the trashing.
> >
> > Sorry for joining late.
> > A couple years ago I tested root-memcg vs per-app memcg configurations
> > on an Android phone. Here is a snapshot from my findings:
> >
> > Problem
> > =======
> > We see tangible increase in major faults and workingset refaults when
> > transitioning from root-only memory cgroup to per-application cgroups
> > on Android.
> >
> > Test results
> > ============
> > Results while running memory-demanding workload:
> > root memcg     per-app memcg     delta
> > workingset_refault 1771228 3874281 +118.73%
> > workingset_nodereclaim 4543 13928 +206.58%
> > pgpgin 13319208 20618944 +54.81%
> > pgpgout 1739552 3080664 +77.1%
> > pgpgoutclean 2616571 4805755 +83.67%
> > pswpin 359211 3918716 +990.92%
> > pswpout 1082238 5697463 +426.45%
> > pgfree 28978393 32531010 +12.26%
> > pgactivate 2586562 8731113 +237.56%
> > pgdeactivate 3811074 11670051 +206.21%
> > pgfault 38692510 46096963 +19.14%
> > pgmajfault 441288 4100020 +829.1%
> > pgrefill 4590451 12768165 +178.15%
> >
> > Results while running application cycle test (20 apps, 20 cycles):
> > root memcg     per-app memcg     delta
> > workingset_refault 10634691 11429223 +7.47%
> > workingset_nodereclaim 37477 59033 +57.52%
> > pgpgin 70662840 69569516 -1.55%
> > pgpgout 2605968 2695596 +3.44%
> > pgpgoutclean 13514955 14980610 +10.84%
> > pswpin 1489851 3780868 +153.77%
> > pswpout 4125547 8050819 +95.15%
> > pgfree 99823083 105104637 +5.29%
> > pgactivate 7685275 11647913 +51.56%
> > pgdeactivate 14193660 21459784 +51.19%
> > pgfault 89173166 100598528 +12.81%
> > pgmajfault 1856172 4227190 +127.74%
> > pgrefill 16643554 23203927 +39.42%
>
> Thanks! It would be interesting to see per memcg stats as well. Are
> there any outliers? Are there any signs of over-reclaim (more pages
> scanned & reclaimed by both kswapd and direct reclaim?

I don't have all the details from that study but will capture them
when I rerun the tests on newer kernels.

>
> > Tests were conducted on an Android phone with 4GB RAM.
> > Similar regression was reported a couple years ago here:
> > https://www.spinics.net/lists/linux-mm/msg121665.html
> >
> > I plan on checking the difference again on newer kernels (likely 5.15)
> > after LPC this September.
>
> Thanks, that would be useful!
>
> --
> Michal Hocko
> SUSE Labs
Zhaoyang Huang Aug. 25, 2022, 12:43 a.m. UTC | #16
On Wed, Aug 24, 2022 at 6:27 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 24-08-22 17:34:42, Zhaoyang Huang wrote:
> > On Wed, Aug 24, 2022 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 24-08-22 10:23:14, Zhaoyang Huang wrote:
> > > > On Tue, Aug 23, 2022 at 7:51 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > One way to achieve that would be shaping the hierarchy the following way
> > > > >             root
> > > > >         /         \
> > > > > no_memcg[1]      memcg[2]
> > > > > ||||||||         |||||
> > > > > app_cgroups     app_cgroups
> > > > >
> > > > > with
> > > > > no_memcg.subtree_control = ""
> > > > > memcg.subtree_control = memory
> > > > >
> > > > > no?
> > > > According to my understanding, No as there will be no no_memcg. All
> > > > children groups under root would have its cgroup.controllers = memory
> > > > as long as root has memory enabled.
> > >
> > > Correct
> > >
> > > > Under this circumstance, all
> > > > descendants group under 'no_memcg' will charge memory to its parent
> > > > group.
> > >
> > > Correct. And why is that a problem? I thought you main concern was a per
> > > application LRUs. With the above configuration all app_cgroups which do
> > > not require an explicit memory control will share the same (no_memcg)
> > > LRU and they will be aged together.
> > I can't agree since this indicates the processes want memory free
> > depending on a specific hierarchy which could have been determined by
> > other subsys.
>
> I really fail to understand your requirements.
>
> > IMHO, charging the pages which out of explicitly memory
> > enabled group to root could solve all of the above constraints with no
> > harm.
>
> This would break the hierarchical property of the controller. So a
> strong no no. Consider the following example
>
>        root
>         |
>         A
> controllers="memory"
> memory.max = 1G
> subtree_control=""
> |      |      |
> A1     A2     A3
>
> althought A1,2,3 do not have their memory controller enabled explicitly
> they are still constrained by the A memcg limit. If you just charge to
> the root because it doesn't have memory controller enabled explicitly
> then you just evade that constrain. I hope you understand why that is a
> problem.
IMO, A1-A3 should be explicitly enabled via echo "+memory" >
A/subtree_control since memory.max has been set. How should AA3
achieve the goal of compete with AA4,A1,A2 for cpu but keep memory out
of control under current policy?
        root
         |
         A
 controllers="memory,cpu"
 memory.max = 1G
 subtree_control="memory,cpu"
 |      |      |
 A1     A2     A3 subtree_control="cpu"
                      |      |
                    AA3   AA4 controllers="cpu"

> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 25, 2022, 6:40 a.m. UTC | #17
On Thu 25-08-22 08:43:52, Zhaoyang Huang wrote:
> On Wed, Aug 24, 2022 at 6:27 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 24-08-22 17:34:42, Zhaoyang Huang wrote:
[...]
> > > IMHO, charging the pages which out of explicitly memory
> > > enabled group to root could solve all of the above constraints with no
> > > harm.
> >
> > This would break the hierarchical property of the controller. So a
> > strong no no. Consider the following example
> >
> >        root
> >         |
> >         A
> > controllers="memory"
> > memory.max = 1G
> > subtree_control=""
> > |      |      |
> > A1     A2     A3
> >
> > althought A1,2,3 do not have their memory controller enabled explicitly
> > they are still constrained by the A memcg limit. If you just charge to
> > the root because it doesn't have memory controller enabled explicitly
> > then you just evade that constrain. I hope you understand why that is a
> > problem.
> IMO, A1-A3 should be explicitly enabled via echo "+memory" >
> A/subtree_control since memory.max has been set.

You seem to be missing the point I've triedy to make here. It is not
about how the respective subtree should or shouldn't be configured. It
is about the hierarchical behavior. Configuration at a higher level should be
enforced under subtree no matter how that subtree decides to
enabled/disable controllers. Such subtree might have beeb delegated
and configured differently yet the constrain should be still applied.
See the point?

What you seem to be proposing is similar to cgroup v1 use_hierarchy
configuration. It has been decided that this is undesirable very early
in the cgroup v2 development because it make delegation impossible
(among other reasons).
Zhaoyang Huang Aug. 25, 2022, 8:34 a.m. UTC | #18
On Thu, Aug 25, 2022 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 25-08-22 08:43:52, Zhaoyang Huang wrote:
> > On Wed, Aug 24, 2022 at 6:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 24-08-22 17:34:42, Zhaoyang Huang wrote:
> [...]
> > > > IMHO, charging the pages which out of explicitly memory
> > > > enabled group to root could solve all of the above constraints with no
> > > > harm.
> > >
> > > This would break the hierarchical property of the controller. So a
> > > strong no no. Consider the following example
> > >
> > >        root
> > >         |
> > >         A
> > > controllers="memory"
> > > memory.max = 1G
> > > subtree_control=""
> > > |      |      |
> > > A1     A2     A3
> > >
> > > althought A1,2,3 do not have their memory controller enabled explicitly
> > > they are still constrained by the A memcg limit. If you just charge to
> > > the root because it doesn't have memory controller enabled explicitly
> > > then you just evade that constrain. I hope you understand why that is a
> > > problem.
> > IMO, A1-A3 should be explicitly enabled via echo "+memory" >
> > A/subtree_control since memory.max has been set.
>
> You seem to be missing the point I've triedy to make here. It is not
> about how the respective subtree should or shouldn't be configured. It
> is about the hierarchical behavior. Configuration at a higher level should be
> enforced under subtree no matter how that subtree decides to
> enabled/disable controllers. Such subtree might have beeb delegated
> and configured differently yet the constrain should be still applied.
> See the point?
>
> What you seem to be proposing is similar to cgroup v1 use_hierarchy
> configuration. It has been decided that this is undesirable very early
> in the cgroup v2 development because it make delegation impossible
> (among other reasons).
Ok, I would like to know how AA3 achieve the goal of competing with A1
and A2 for cpu but keep memory out of control under current policy?
        root
         |
         A
 controllers="memory,cpu"
 memory.max = 1G
 subtree_control="memory,cpu"
 |      |            |
 A1     A2     A3 subtree_control="cpu"
                      |      |
                    AA3   AA4 controllers="cpu"
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 25, 2022, 8:50 a.m. UTC | #19
On Thu 25-08-22 16:34:04, Zhaoyang Huang wrote:
> On Thu, Aug 25, 2022 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 25-08-22 08:43:52, Zhaoyang Huang wrote:
> > > On Wed, Aug 24, 2022 at 6:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Wed 24-08-22 17:34:42, Zhaoyang Huang wrote:
> > [...]
> > > > > IMHO, charging the pages which out of explicitly memory
> > > > > enabled group to root could solve all of the above constraints with no
> > > > > harm.
> > > >
> > > > This would break the hierarchical property of the controller. So a
> > > > strong no no. Consider the following example
> > > >
> > > >        root
> > > >         |
> > > >         A
> > > > controllers="memory"
> > > > memory.max = 1G
> > > > subtree_control=""
> > > > |      |      |
> > > > A1     A2     A3
> > > >
> > > > althought A1,2,3 do not have their memory controller enabled explicitly
> > > > they are still constrained by the A memcg limit. If you just charge to
> > > > the root because it doesn't have memory controller enabled explicitly
> > > > then you just evade that constrain. I hope you understand why that is a
> > > > problem.
> > > IMO, A1-A3 should be explicitly enabled via echo "+memory" >
> > > A/subtree_control since memory.max has been set.
> >
> > You seem to be missing the point I've triedy to make here. It is not
> > about how the respective subtree should or shouldn't be configured. It
> > is about the hierarchical behavior. Configuration at a higher level should be
> > enforced under subtree no matter how that subtree decides to
> > enabled/disable controllers. Such subtree might have beeb delegated
> > and configured differently yet the constrain should be still applied.
> > See the point?
> >
> > What you seem to be proposing is similar to cgroup v1 use_hierarchy
> > configuration. It has been decided that this is undesirable very early
> > in the cgroup v2 development because it make delegation impossible
> > (among other reasons).
> Ok, I would like to know how AA3 achieve the goal of competing with A1
> and A2 for cpu but keep memory out of control under current policy?
>         root
>          |
>          A
>  controllers="memory,cpu"
>  memory.max = 1G
>  subtree_control="memory,cpu"
>  |      |            |
>  A1     A2     A3 subtree_control="cpu"
>                       |      |
>                     AA3   AA4 controllers="cpu"

I cannot really give you configuration you want without understanding
what you are trying to achieve and why do you need it that way. Really,
you can construct arbitrary hierarchies and only a very small subset of
them actually makes sense. So far you have been very terse at your goals
and intentions but rather demanding on the underlying mechanisms. This
doesn't really makes the discussion productive.

I hope you have at least understood that hierarchical property of the
cgroup v2 is a must and it won't change. If you need a help to construct
hierarchy for your specific workload I would recommend to clearly state
your final goal and reasoning behind. Maybe you will get a more specific
help that way. Good luck!
Zhaoyang Huang Aug. 25, 2022, 10:11 a.m. UTC | #20
On Thu, Aug 25, 2022 at 4:50 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 25-08-22 16:34:04, Zhaoyang Huang wrote:
> > On Thu, Aug 25, 2022 at 2:40 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 25-08-22 08:43:52, Zhaoyang Huang wrote:
> > > > On Wed, Aug 24, 2022 at 6:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Wed 24-08-22 17:34:42, Zhaoyang Huang wrote:
> > > [...]
> > > > > > IMHO, charging the pages which out of explicitly memory
> > > > > > enabled group to root could solve all of the above constraints with no
> > > > > > harm.
> > > > >
> > > > > This would break the hierarchical property of the controller. So a
> > > > > strong no no. Consider the following example
> > > > >
> > > > >        root
> > > > >         |
> > > > >         A
> > > > > controllers="memory"
> > > > > memory.max = 1G
> > > > > subtree_control=""
> > > > > |      |      |
> > > > > A1     A2     A3
> > > > >
> > > > > althought A1,2,3 do not have their memory controller enabled explicitly
> > > > > they are still constrained by the A memcg limit. If you just charge to
> > > > > the root because it doesn't have memory controller enabled explicitly
> > > > > then you just evade that constrain. I hope you understand why that is a
> > > > > problem.
> > > > IMO, A1-A3 should be explicitly enabled via echo "+memory" >
> > > > A/subtree_control since memory.max has been set.
> > >
> > > You seem to be missing the point I've triedy to make here. It is not
> > > about how the respective subtree should or shouldn't be configured. It
> > > is about the hierarchical behavior. Configuration at a higher level should be
> > > enforced under subtree no matter how that subtree decides to
> > > enabled/disable controllers. Such subtree might have beeb delegated
> > > and configured differently yet the constrain should be still applied.
> > > See the point?
> > >
> > > What you seem to be proposing is similar to cgroup v1 use_hierarchy
> > > configuration. It has been decided that this is undesirable very early
> > > in the cgroup v2 development because it make delegation impossible
> > > (among other reasons).
> > Ok, I would like to know how AA3 achieve the goal of competing with A1
> > and A2 for cpu but keep memory out of control under current policy?
> >         root
> >          |
> >          A
> >  controllers="memory,cpu"
> >  memory.max = 1G
> >  subtree_control="memory,cpu"
> >  |      |            |
> >  A1     A2     A3 subtree_control="cpu"
> >                       |      |
> >                     AA3   AA4 controllers="cpu"
>
> I cannot really give you configuration you want without understanding
> what you are trying to achieve and why do you need it that way. Really,
> you can construct arbitrary hierarchies and only a very small subset of
> them actually makes sense. So far you have been very terse at your goals
> and intentions but rather demanding on the underlying mechanisms. This
> doesn't really makes the discussion productive.
>
> I hope you have at least understood that hierarchical property of the
> cgroup v2 is a must and it won't change. If you need a help to construct
> hierarchy for your specific workload I would recommend to clearly state
> your final goal and reasoning behind. Maybe you will get a more specific
> help that way. Good luck!
Sorry for any misunderstanding among the discussion. My purpose is
real and simple as I have stated from the very beginning that I would
like to have per-app cgroup hierarchy to charge memory to root if it
is not enabled explicitly for memory. The reason has also been stated
like reclaim and workingset regression in suren's report. I don't
think my proposal will do any harm to current v2's mechanism besides
asking for the admin echo "+memory" to their desire group.
> --
> Michal Hocko
> SUSE Labs
Johannes Weiner Aug. 25, 2022, 1:35 p.m. UTC | #21
On Thu, Aug 25, 2022 at 06:11:09PM +0800, Zhaoyang Huang wrote:
> Sorry for any misunderstanding among the discussion. My purpose is
> real and simple as I have stated from the very beginning that I would
> like to have per-app cgroup hierarchy to charge memory to root if it
> is not enabled explicitly for memory. The reason has also been stated
> like reclaim and workingset regression in suren's report. I don't
> think my proposal will do any harm to current v2's mechanism besides
> asking for the admin echo "+memory" to their desire group.

It would permit children to escape parental control, completely
breaking the hierarchy and delegation guarantees currently in
place. That just isn't going to happen.

Nacked-by: Johannes Weiner <hannes@cmpxchg.org>

I would suggest focusing on finding the root cause for the reclaim
problem you are describing. It's possible all we need is a fix to
reclaim or the workingset code.
diff mbox series

Patch

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0d1ada8..747f0f4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -136,6 +136,7 @@  extern void cgroup_post_fork(struct task_struct *p,
 
 int cgroup_parse_float(const char *input, unsigned dec_shift, s64 *v);
 
+struct cgroup *get_task_cgroup(struct task_struct *task);
 /*
  * Iteration helpers and macros.
  */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccd..3f34c58 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1457,6 +1457,17 @@  struct cgroup *task_cgroup_from_root(struct task_struct *task,
 	return cset_cgroup_from_root(task_css_set(task), root);
 }
 
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+	struct cgroup *src_cgrp;
+	/* find the source cgroup */
+	spin_lock_irq(&css_set_lock);
+	src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+	spin_unlock_irq(&css_set_lock);
+
+	return src_cgrp;
+}
+
 /*
  * A task must hold cgroup_mutex to modify cgroups.
  *
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f..c81012b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -930,6 +930,7 @@  static __always_inline struct mem_cgroup *active_memcg(void)
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
 	struct mem_cgroup *memcg;
+	struct cgroup *cgrp;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -956,9 +957,11 @@  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 	}
 
 	rcu_read_lock();
+	cgrp = get_task_cgroup(rcu_dereference(mm->owner));
 	do {
 		memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
-		if (unlikely(!memcg))
+		if (unlikely(!memcg)
+			|| !(cgroup_ss_mask(cgrp) & (1 << memory_cgrp_id)))
 			memcg = root_mem_cgroup;
 	} while (!css_tryget(&memcg->css));
 	rcu_read_unlock();