diff mbox series

mm: vmpressure: don't count userspace-induced reclaim as memory pressure

Message ID 20220623000530.1194226-1-yosryahmed@google.com (mailing list archive)
State New
Headers show
Series mm: vmpressure: don't count userspace-induced reclaim as memory pressure | expand

Commit Message

Yosry Ahmed June 23, 2022, 12:05 a.m. UTC
Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
as memory pressure") made sure that memory reclaim that is induced by
userspace (limit-setting, proactive reclaim, ..) is not counted as
memory pressure for the purposes of psi.

Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
from try_charge() and reclaim_high() wrap the call to
try_to_free_mem_cgroup_pages() with psi handlers.

However, vmpressure is still counted in these cases where reclaim is
directly induced by userspace. This patch makes sure vmpressure is not
counted in those operations, in the same way as psi. Since vmpressure
calls need to happen deeper within the reclaim path, the same approach
could not be followed. Hence, a new "controlled" flag is added to struct
scan_control to flag a reclaim operation that is controlled by
userspace. This flag is set by limit-setting and proactive reclaim
operations, and is used to count vmpressure correctly.

To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
is effectively reverted and the same flag is used to control psi as
well.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/swap.h |  5 ++++-
 mm/memcontrol.c      | 40 ++++++++++++++++++++++------------------
 mm/vmscan.c          | 40 ++++++++++++++++++++++++++++++----------
 3 files changed, 56 insertions(+), 29 deletions(-)

Comments

Andrew Morton June 23, 2022, 12:16 a.m. UTC | #1
On Thu, 23 Jun 2022 00:05:30 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:

> Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
> as memory pressure") made sure that memory reclaim that is induced by
> userspace (limit-setting, proactive reclaim, ..) is not counted as
> memory pressure for the purposes of psi.
> 
> Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
> from try_charge() and reclaim_high() wrap the call to
> try_to_free_mem_cgroup_pages() with psi handlers.
> 
> However, vmpressure is still counted in these cases where reclaim is
> directly induced by userspace. This patch makes sure vmpressure is not
> counted in those operations, in the same way as psi. Since vmpressure
> calls need to happen deeper within the reclaim path, the same approach
> could not be followed. Hence, a new "controlled" flag is added to struct
> scan_control to flag a reclaim operation that is controlled by
> userspace. This flag is set by limit-setting and proactive reclaim
> operations, and is used to count vmpressure correctly.
> 
> To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
> ("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
> is effectively reverted and the same flag is used to control psi as
> well.

I'll await reviewer input on this, but I can always do trivia!

> @@ -3502,6 +3497,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>  {
>  	int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED |
> +		MEMCG_RECLAIM_MAY_SWAP;

If it doesn't fit, it's nicer to do

	unsigned int reclaim_options;
	...

	reclaim_options = MEMCG_RECLAIM_CONTROLLED | MEMCG_RECLAIM_MAY_SWAP;

(several places)

> @@ -3751,6 +3757,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.may_writepage = !laptop_mode,
>  		.may_unmap = 1,
>  		.may_swap = 1,
> +		.controlled = 0,
>  	};

Let's just skip all these initializations to zero, let the compiler take
care of it.

> @@ -4095,6 +4112,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
>  		.gfp_mask = GFP_KERNEL,
>  		.order = order,
>  		.may_unmap = 1,
> +		.controlled = 0,
>  	};
>  
>  	set_task_reclaim_state(current, &sc.reclaim_state);
> @@ -4555,6 +4573,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
>  		.may_unmap = 1,
>  		.may_swap = 1,
>  		.hibernation_mode = 1,
> +		.controlled = 0,
>  	};
>  	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
>  	unsigned long nr_reclaimed;
> @@ -4707,6 +4726,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
>  		.may_swap = 1,
>  		.reclaim_idx = gfp_zone(gfp_mask),
> +		.controlled = 0,
>  	};
>  	unsigned long pflags;
Yosry Ahmed June 23, 2022, 12:24 a.m. UTC | #2
On Wed, Jun 22, 2022 at 5:16 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 23 Jun 2022 00:05:30 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
> > as memory pressure") made sure that memory reclaim that is induced by
> > userspace (limit-setting, proactive reclaim, ..) is not counted as
> > memory pressure for the purposes of psi.
> >
> > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
> > from try_charge() and reclaim_high() wrap the call to
> > try_to_free_mem_cgroup_pages() with psi handlers.
> >
> > However, vmpressure is still counted in these cases where reclaim is
> > directly induced by userspace. This patch makes sure vmpressure is not
> > counted in those operations, in the same way as psi. Since vmpressure
> > calls need to happen deeper within the reclaim path, the same approach
> > could not be followed. Hence, a new "controlled" flag is added to struct
> > scan_control to flag a reclaim operation that is controlled by
> > userspace. This flag is set by limit-setting and proactive reclaim
> > operations, and is used to count vmpressure correctly.
> >
> > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
> > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
> > is effectively reverted and the same flag is used to control psi as
> > well.
>
> I'll await reviewer input on this, but I can always do trivia!

Thanks for taking a look so quickly, will address and send v2 soon!

>
> > @@ -3502,6 +3497,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> >  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> >  {
> >       int nr_retries = MAX_RECLAIM_RETRIES;
> > +     unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED |
> > +             MEMCG_RECLAIM_MAY_SWAP;
>
> If it doesn't fit, it's nicer to do
>
>         unsigned int reclaim_options;
>         ...
>
>         reclaim_options = MEMCG_RECLAIM_CONTROLLED | MEMCG_RECLAIM_MAY_SWAP;
>
> (several places)
>
> > @@ -3751,6 +3757,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >               .may_writepage = !laptop_mode,
> >               .may_unmap = 1,
> >               .may_swap = 1,
> > +             .controlled = 0,
> >       };
>
> Let's just skip all these initializations to zero, let the compiler take
> care of it.
>
> > @@ -4095,6 +4112,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
> >               .gfp_mask = GFP_KERNEL,
> >               .order = order,
> >               .may_unmap = 1,
> > +             .controlled = 0,
> >       };
> >
> >       set_task_reclaim_state(current, &sc.reclaim_state);
> > @@ -4555,6 +4573,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
> >               .may_unmap = 1,
> >               .may_swap = 1,
> >               .hibernation_mode = 1,
> > +             .controlled = 0,
> >       };
> >       struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
> >       unsigned long nr_reclaimed;
> > @@ -4707,6 +4726,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
> >               .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
> >               .may_swap = 1,
> >               .reclaim_idx = gfp_zone(gfp_mask),
> > +             .controlled = 0,
> >       };
> >       unsigned long pflags;
>
>
Michal Hocko June 23, 2022, 8:05 a.m. UTC | #3
On Thu 23-06-22 00:05:30, Yosry Ahmed wrote:
> Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
> as memory pressure") made sure that memory reclaim that is induced by
> userspace (limit-setting, proactive reclaim, ..) is not counted as
> memory pressure for the purposes of psi.
> 
> Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
> from try_charge() and reclaim_high() wrap the call to
> try_to_free_mem_cgroup_pages() with psi handlers.
> 
> However, vmpressure is still counted in these cases where reclaim is
> directly induced by userspace. This patch makes sure vmpressure is not
> counted in those operations, in the same way as psi. Since vmpressure
> calls need to happen deeper within the reclaim path, the same approach
> could not be followed. Hence, a new "controlled" flag is added to struct
> scan_control to flag a reclaim operation that is controlled by
> userspace. This flag is set by limit-setting and proactive reclaim
> operations, and is used to count vmpressure correctly.
> 
> To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
> ("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
> is effectively reverted and the same flag is used to control psi as
> well.

Why do we need to add this is a legacy interface now? Are there any
pre-existing users who realized this is bugging them? Please be more
specific about the usecase.
Yosry Ahmed June 23, 2022, 8:35 a.m. UTC | #4
On Thu, Jun 23, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 00:05:30, Yosry Ahmed wrote:
> > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
> > as memory pressure") made sure that memory reclaim that is induced by
> > userspace (limit-setting, proactive reclaim, ..) is not counted as
> > memory pressure for the purposes of psi.
> >
> > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
> > from try_charge() and reclaim_high() wrap the call to
> > try_to_free_mem_cgroup_pages() with psi handlers.
> >
> > However, vmpressure is still counted in these cases where reclaim is
> > directly induced by userspace. This patch makes sure vmpressure is not
> > counted in those operations, in the same way as psi. Since vmpressure
> > calls need to happen deeper within the reclaim path, the same approach
> > could not be followed. Hence, a new "controlled" flag is added to struct
> > scan_control to flag a reclaim operation that is controlled by
> > userspace. This flag is set by limit-setting and proactive reclaim
> > operations, and is used to count vmpressure correctly.
> >
> > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
> > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
> > is effectively reverted and the same flag is used to control psi as
> > well.
>
> Why do we need to add this is a legacy interface now? Are there any
> pre-existing users who realized this is bugging them? Please be more
> specific about the usecase.

Sorry if I wasn't clear enough. Unfortunately we still have userspace
workloads at Google that use vmpressure notifications.

In our internal version of memory.reclaim that we recently upstreamed,
we do not account vmpressure during proactive reclaim (similar to how
psi is handled upstream). We want to make sure this behavior also
exists in the upstream version so that consolidating them does not
break our users who rely on vmpressure and will start seeing increased
pressure due to proactive reclaim.

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 23, 2022, 9:42 a.m. UTC | #5
On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> On Thu, Jun 23, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 00:05:30, Yosry Ahmed wrote:
> > > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
> > > as memory pressure") made sure that memory reclaim that is induced by
> > > userspace (limit-setting, proactive reclaim, ..) is not counted as
> > > memory pressure for the purposes of psi.
> > >
> > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
> > > from try_charge() and reclaim_high() wrap the call to
> > > try_to_free_mem_cgroup_pages() with psi handlers.
> > >
> > > However, vmpressure is still counted in these cases where reclaim is
> > > directly induced by userspace. This patch makes sure vmpressure is not
> > > counted in those operations, in the same way as psi. Since vmpressure
> > > calls need to happen deeper within the reclaim path, the same approach
> > > could not be followed. Hence, a new "controlled" flag is added to struct
> > > scan_control to flag a reclaim operation that is controlled by
> > > userspace. This flag is set by limit-setting and proactive reclaim
> > > operations, and is used to count vmpressure correctly.
> > >
> > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
> > > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
> > > is effectively reverted and the same flag is used to control psi as
> > > well.
> >
> > Why do we need to add this is a legacy interface now? Are there any
> > pre-existing users who realized this is bugging them? Please be more
> > specific about the usecase.
> 
> Sorry if I wasn't clear enough. Unfortunately we still have userspace
> workloads at Google that use vmpressure notifications.
> 
> In our internal version of memory.reclaim that we recently upstreamed,
> we do not account vmpressure during proactive reclaim (similar to how
> psi is handled upstream). We want to make sure this behavior also
> exists in the upstream version so that consolidating them does not
> break our users who rely on vmpressure and will start seeing increased
> pressure due to proactive reclaim.

These are good reasons to have this patch in your tree. But why is this
patch benefitial for the upstream kernel? It clearly adds some code and
some special casing which will add a maintenance overhead.
Yosry Ahmed June 23, 2022, 4:22 p.m. UTC | #6
On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > On Thu, Jun 23, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 00:05:30, Yosry Ahmed wrote:
> > > > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
> > > > as memory pressure") made sure that memory reclaim that is induced by
> > > > userspace (limit-setting, proactive reclaim, ..) is not counted as
> > > > memory pressure for the purposes of psi.
> > > >
> > > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers
> > > > from try_charge() and reclaim_high() wrap the call to
> > > > try_to_free_mem_cgroup_pages() with psi handlers.
> > > >
> > > > However, vmpressure is still counted in these cases where reclaim is
> > > > directly induced by userspace. This patch makes sure vmpressure is not
> > > > counted in those operations, in the same way as psi. Since vmpressure
> > > > calls need to happen deeper within the reclaim path, the same approach
> > > > could not be followed. Hence, a new "controlled" flag is added to struct
> > > > scan_control to flag a reclaim operation that is controlled by
> > > > userspace. This flag is set by limit-setting and proactive reclaim
> > > > operations, and is used to count vmpressure correctly.
> > > >
> > > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9
> > > > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure")
> > > > is effectively reverted and the same flag is used to control psi as
> > > > well.
> > >
> > > Why do we need to add this is a legacy interface now? Are there any
> > > pre-existing users who realized this is bugging them? Please be more
> > > specific about the usecase.
> >
> > Sorry if I wasn't clear enough. Unfortunately we still have userspace
> > workloads at Google that use vmpressure notifications.
> >
> > In our internal version of memory.reclaim that we recently upstreamed,
> > we do not account vmpressure during proactive reclaim (similar to how
> > psi is handled upstream). We want to make sure this behavior also
> > exists in the upstream version so that consolidating them does not
> > break our users who rely on vmpressure and will start seeing increased
> > pressure due to proactive reclaim.
>
> These are good reasons to have this patch in your tree. But why is this
> patch benefitial for the upstream kernel? It clearly adds some code and
> some special casing which will add a maintenance overhead.

It is not just Google, any existing vmpressure users will start seeing
false pressure notifications with memory.reclaim. The main goal of the
patch is to make sure memory.reclaim does not break pre-existing users
of vmpressure, and doing it in a way that is consistent with psi makes
sense.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 23, 2022, 4:37 p.m. UTC | #7
On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
[...]
> > > In our internal version of memory.reclaim that we recently upstreamed,
> > > we do not account vmpressure during proactive reclaim (similar to how
> > > psi is handled upstream). We want to make sure this behavior also
> > > exists in the upstream version so that consolidating them does not
> > > break our users who rely on vmpressure and will start seeing increased
> > > pressure due to proactive reclaim.
> >
> > These are good reasons to have this patch in your tree. But why is this
> > patch benefitial for the upstream kernel? It clearly adds some code and
> > some special casing which will add a maintenance overhead.
> 
> It is not just Google, any existing vmpressure users will start seeing
> false pressure notifications with memory.reclaim. The main goal of the
> patch is to make sure memory.reclaim does not break pre-existing users
> of vmpressure, and doing it in a way that is consistent with psi makes
> sense.

memory.reclaim is v2 only feature which doesn't have vmpressure
interface. So I do not see how pre-existing users of the upstream kernel
can see any breakage.
Shakeel Butt June 23, 2022, 4:42 p.m. UTC | #8
On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> [...]
> > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > psi is handled upstream). We want to make sure this behavior also
> > > > exists in the upstream version so that consolidating them does not
> > > > break our users who rely on vmpressure and will start seeing increased
> > > > pressure due to proactive reclaim.
> > >
> > > These are good reasons to have this patch in your tree. But why is this
> > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > some special casing which will add a maintenance overhead.
> >
> > It is not just Google, any existing vmpressure users will start seeing
> > false pressure notifications with memory.reclaim. The main goal of the
> > patch is to make sure memory.reclaim does not break pre-existing users
> > of vmpressure, and doing it in a way that is consistent with psi makes
> > sense.
>
> memory.reclaim is v2 only feature which doesn't have vmpressure
> interface. So I do not see how pre-existing users of the upstream kernel
> can see any breakage.
>

Please note that vmpressure is still being used in v2 by the
networking layer (see mem_cgroup_under_socket_pressure()) for
detecting memory pressure.

Though IMO we should deprecate vmpressure altogether.
Yosry Ahmed June 23, 2022, 4:49 p.m. UTC | #9
On Thu, Jun 23, 2022 at 9:42 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > [...]
> > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > exists in the upstream version so that consolidating them does not
> > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > pressure due to proactive reclaim.
> > > >
> > > > These are good reasons to have this patch in your tree. But why is this
> > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > some special casing which will add a maintenance overhead.
> > >
> > > It is not just Google, any existing vmpressure users will start seeing
> > > false pressure notifications with memory.reclaim. The main goal of the
> > > patch is to make sure memory.reclaim does not break pre-existing users
> > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > sense.
> >
> > memory.reclaim is v2 only feature which doesn't have vmpressure
> > interface. So I do not see how pre-existing users of the upstream kernel
> > can see any breakage.
> >
>
> Please note that vmpressure is still being used in v2 by the
> networking layer (see mem_cgroup_under_socket_pressure()) for
> detecting memory pressure.
>
> Though IMO we should deprecate vmpressure altogether.

Thanks Shakeel for mentioning that, I was just about to. Although I
agree vmpressure should be deprecated at some point, the current state
is that memory.reclaim will give incorrect vmpressure signals.

IMO psi and vmpressure (though legacy) both signify memory pressure
and should both be consistent as to what is being accounted for.
Michal Hocko June 23, 2022, 5:04 p.m. UTC | #10
On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > [...]
> > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > exists in the upstream version so that consolidating them does not
> > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > pressure due to proactive reclaim.
> > > >
> > > > These are good reasons to have this patch in your tree. But why is this
> > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > some special casing which will add a maintenance overhead.
> > >
> > > It is not just Google, any existing vmpressure users will start seeing
> > > false pressure notifications with memory.reclaim. The main goal of the
> > > patch is to make sure memory.reclaim does not break pre-existing users
> > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > sense.
> >
> > memory.reclaim is v2 only feature which doesn't have vmpressure
> > interface. So I do not see how pre-existing users of the upstream kernel
> > can see any breakage.
> >
> 
> Please note that vmpressure is still being used in v2 by the
> networking layer (see mem_cgroup_under_socket_pressure()) for
> detecting memory pressure.

I have missed this. It is hidden quite good. I thought that v2 is
completely vmpressure free. I have to admit that the effect of
mem_cgroup_under_socket_pressure is not really clear to me. Not to
mention whether it should or shouldn't be triggered for the user
triggered memory reclaim. So this would really need some explanation.
 
> Though IMO we should deprecate vmpressure altogether.

Yes it should be really limited to v1. But as I've said the effect on
mem_cgroup_under_socket_pressure is not really clear to me. It really
seems the v2 support has been introduced deliberately.
Yosry Ahmed June 23, 2022, 5:26 p.m. UTC | #11
On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > [...]
> > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > exists in the upstream version so that consolidating them does not
> > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > pressure due to proactive reclaim.
> > > > >
> > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > some special casing which will add a maintenance overhead.
> > > >
> > > > It is not just Google, any existing vmpressure users will start seeing
> > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > sense.
> > >
> > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > interface. So I do not see how pre-existing users of the upstream kernel
> > > can see any breakage.
> > >
> >
> > Please note that vmpressure is still being used in v2 by the
> > networking layer (see mem_cgroup_under_socket_pressure()) for
> > detecting memory pressure.
>
> I have missed this. It is hidden quite good. I thought that v2 is
> completely vmpressure free. I have to admit that the effect of
> mem_cgroup_under_socket_pressure is not really clear to me. Not to
> mention whether it should or shouldn't be triggered for the user
> triggered memory reclaim. So this would really need some explanation.

vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
memcontrol: hook up vmpressure to socket pressure"). A quick look at
the commit log and the code suggests that this is used all over the
socket and tcp code to throttles the memory consumption of the
networking layer if we are under pressure.

However, for proactive reclaim like memory.reclaim, the target is to
probe the memcg for cold memory. Reclaiming such memory should not
have a visible effect on the workload performance. I don't think that
any network throttling side effects are correct here.

>
> > Though IMO we should deprecate vmpressure altogether.
>
> Yes it should be really limited to v1. But as I've said the effect on
> mem_cgroup_under_socket_pressure is not really clear to me. It really
> seems the v2 support has been introduced deliberately.
>
> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan June 24, 2022, 10:10 p.m. UTC | #12
On Thu, Jun 23, 2022 at 10:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > > [...]
> > > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > > exists in the upstream version so that consolidating them does not
> > > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > > pressure due to proactive reclaim.
> > > > > >
> > > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > > some special casing which will add a maintenance overhead.
> > > > >
> > > > > It is not just Google, any existing vmpressure users will start seeing
> > > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > > sense.
> > > >
> > > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > > interface. So I do not see how pre-existing users of the upstream kernel
> > > > can see any breakage.
> > > >
> > >
> > > Please note that vmpressure is still being used in v2 by the
> > > networking layer (see mem_cgroup_under_socket_pressure()) for
> > > detecting memory pressure.
> >
> > I have missed this. It is hidden quite good. I thought that v2 is
> > completely vmpressure free. I have to admit that the effect of
> > mem_cgroup_under_socket_pressure is not really clear to me. Not to
> > mention whether it should or shouldn't be triggered for the user
> > triggered memory reclaim. So this would really need some explanation.
>
> vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
> memcontrol: hook up vmpressure to socket pressure"). A quick look at
> the commit log and the code suggests that this is used all over the
> socket and tcp code to throttles the memory consumption of the
> networking layer if we are under pressure.
>
> However, for proactive reclaim like memory.reclaim, the target is to
> probe the memcg for cold memory. Reclaiming such memory should not
> have a visible effect on the workload performance. I don't think that
> any network throttling side effects are correct here.

IIUC, this change is fixing two mechanisms during userspace-induced
memory pressure:
1. psi accounting, which I think is not controversial and makes sense to me;
2. vmpressure signal, which is a "kinda" obsolete interface and might
be viewed as controversial.
I would suggest splitting the patch into two, first to fix psi
accounting and second to fix vmpressure signal. This way the first one
(probably the bigger of the two) can be reviewed and accepted easily
while debates continue on the second one.

>
> >
> > > Though IMO we should deprecate vmpressure altogether.
> >
> > Yes it should be really limited to v1. But as I've said the effect on
> > mem_cgroup_under_socket_pressure is not really clear to me. It really
> > seems the v2 support has been introduced deliberately.
> >
> > --
> > Michal Hocko
> > SUSE Labs
Yosry Ahmed June 24, 2022, 10:13 p.m. UTC | #13
On Fri, Jun 24, 2022 at 3:10 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Jun 23, 2022 at 10:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > > > [...]
> > > > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > > > exists in the upstream version so that consolidating them does not
> > > > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > > > pressure due to proactive reclaim.
> > > > > > >
> > > > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > > > some special casing which will add a maintenance overhead.
> > > > > >
> > > > > > It is not just Google, any existing vmpressure users will start seeing
> > > > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > > > sense.
> > > > >
> > > > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > > > interface. So I do not see how pre-existing users of the upstream kernel
> > > > > can see any breakage.
> > > > >
> > > >
> > > > Please note that vmpressure is still being used in v2 by the
> > > > networking layer (see mem_cgroup_under_socket_pressure()) for
> > > > detecting memory pressure.
> > >
> > > I have missed this. It is hidden quite good. I thought that v2 is
> > > completely vmpressure free. I have to admit that the effect of
> > > mem_cgroup_under_socket_pressure is not really clear to me. Not to
> > > mention whether it should or shouldn't be triggered for the user
> > > triggered memory reclaim. So this would really need some explanation.
> >
> > vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
> > memcontrol: hook up vmpressure to socket pressure"). A quick look at
> > the commit log and the code suggests that this is used all over the
> > socket and tcp code to throttles the memory consumption of the
> > networking layer if we are under pressure.
> >
> > However, for proactive reclaim like memory.reclaim, the target is to
> > probe the memcg for cold memory. Reclaiming such memory should not
> > have a visible effect on the workload performance. I don't think that
> > any network throttling side effects are correct here.
>
> IIUC, this change is fixing two mechanisms during userspace-induced
> memory pressure:
> 1. psi accounting, which I think is not controversial and makes sense to me;
> 2. vmpressure signal, which is a "kinda" obsolete interface and might
> be viewed as controversial.
> I would suggest splitting the patch into two, first to fix psi
> accounting and second to fix vmpressure signal. This way the first one
> (probably the bigger of the two) can be reviewed and accepted easily
> while debates continue on the second one.

This change should be NOP for psi. psi was already fixed by
e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
as memory pressure") by Johannes a while ago. This patch does the same
for vmpressure, but in a different way, as the same approach of
e22c6ed90aa9 cannot be used.

The changes you are seeing in this patch for psi are basically
reverting e22c6ed90aa9 and using the newly introduced flag that
handles vmpressure to handle psi as well, to avoid having two separate
ways to address accounting memory pressure during userspace-induced
reclaim.

>
> >
> > >
> > > > Though IMO we should deprecate vmpressure altogether.
> > >
> > > Yes it should be really limited to v1. But as I've said the effect on
> > > mem_cgroup_under_socket_pressure is not really clear to me. It really
> > > seems the v2 support has been introduced deliberately.
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
Suren Baghdasaryan June 24, 2022, 10:41 p.m. UTC | #14
On Fri, Jun 24, 2022 at 3:14 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Jun 24, 2022 at 3:10 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Jun 23, 2022 at 10:26 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > > > > exists in the upstream version so that consolidating them does not
> > > > > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > > > > pressure due to proactive reclaim.
> > > > > > > >
> > > > > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > > > > some special casing which will add a maintenance overhead.
> > > > > > >
> > > > > > > It is not just Google, any existing vmpressure users will start seeing
> > > > > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > > > > sense.
> > > > > >
> > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > > > > interface. So I do not see how pre-existing users of the upstream kernel
> > > > > > can see any breakage.
> > > > > >
> > > > >
> > > > > Please note that vmpressure is still being used in v2 by the
> > > > > networking layer (see mem_cgroup_under_socket_pressure()) for
> > > > > detecting memory pressure.
> > > >
> > > > I have missed this. It is hidden quite good. I thought that v2 is
> > > > completely vmpressure free. I have to admit that the effect of
> > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to
> > > > mention whether it should or shouldn't be triggered for the user
> > > > triggered memory reclaim. So this would really need some explanation.
> > >
> > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
> > > memcontrol: hook up vmpressure to socket pressure"). A quick look at
> > > the commit log and the code suggests that this is used all over the
> > > socket and tcp code to throttles the memory consumption of the
> > > networking layer if we are under pressure.
> > >
> > > However, for proactive reclaim like memory.reclaim, the target is to
> > > probe the memcg for cold memory. Reclaiming such memory should not
> > > have a visible effect on the workload performance. I don't think that
> > > any network throttling side effects are correct here.
> >
> > IIUC, this change is fixing two mechanisms during userspace-induced
> > memory pressure:
> > 1. psi accounting, which I think is not controversial and makes sense to me;
> > 2. vmpressure signal, which is a "kinda" obsolete interface and might
> > be viewed as controversial.
> > I would suggest splitting the patch into two, first to fix psi
> > accounting and second to fix vmpressure signal. This way the first one
> > (probably the bigger of the two) can be reviewed and accepted easily
> > while debates continue on the second one.
>
> This change should be NOP for psi. psi was already fixed by
> e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim
> as memory pressure") by Johannes a while ago. This patch does the same
> for vmpressure, but in a different way, as the same approach of
> e22c6ed90aa9 cannot be used.
>
> The changes you are seeing in this patch for psi are basically
> reverting e22c6ed90aa9 and using the newly introduced flag that
> handles vmpressure to handle psi as well, to avoid having two separate
> ways to address accounting memory pressure during userspace-induced
> reclaim.

Ah, I see. Thanks for clarifying that.

>
> >
> > >
> > > >
> > > > > Though IMO we should deprecate vmpressure altogether.
> > > >
> > > > Yes it should be really limited to v1. But as I've said the effect on
> > > > mem_cgroup_under_socket_pressure is not really clear to me. It really
> > > > seems the v2 support has been introduced deliberately.
> > > >
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
Michal Hocko June 27, 2022, 8:25 a.m. UTC | #15
On Thu 23-06-22 10:26:11, Yosry Ahmed wrote:
> On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > > [...]
> > > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > > exists in the upstream version so that consolidating them does not
> > > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > > pressure due to proactive reclaim.
> > > > > >
> > > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > > some special casing which will add a maintenance overhead.
> > > > >
> > > > > It is not just Google, any existing vmpressure users will start seeing
> > > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > > sense.
> > > >
> > > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > > interface. So I do not see how pre-existing users of the upstream kernel
> > > > can see any breakage.
> > > >
> > >
> > > Please note that vmpressure is still being used in v2 by the
> > > networking layer (see mem_cgroup_under_socket_pressure()) for
> > > detecting memory pressure.
> >
> > I have missed this. It is hidden quite good. I thought that v2 is
> > completely vmpressure free. I have to admit that the effect of
> > mem_cgroup_under_socket_pressure is not really clear to me. Not to
> > mention whether it should or shouldn't be triggered for the user
> > triggered memory reclaim. So this would really need some explanation.
> 
> vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
> memcontrol: hook up vmpressure to socket pressure"). A quick look at
> the commit log and the code suggests that this is used all over the
> socket and tcp code to throttles the memory consumption of the
> networking layer if we are under pressure.
> 
> However, for proactive reclaim like memory.reclaim, the target is to
> probe the memcg for cold memory. Reclaiming such memory should not
> have a visible effect on the workload performance. I don't think that
> any network throttling side effects are correct here.

Please describe the user visible effects of this change. IIUC this is
changing the vmpressure semantic for pre-existing users (v1 when setting
the hard limit for example) and it really should be explained why
this is good for them after those years. I do not see any actual bug
being described explicitly so please make sure this is all properly
documented.
Yosry Ahmed June 27, 2022, 8:39 a.m. UTC | #16
On Mon, Jun 27, 2022 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 23-06-22 10:26:11, Yosry Ahmed wrote:
> > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > > > [...]
> > > > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > > > exists in the upstream version so that consolidating them does not
> > > > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > > > pressure due to proactive reclaim.
> > > > > > >
> > > > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > > > some special casing which will add a maintenance overhead.
> > > > > >
> > > > > > It is not just Google, any existing vmpressure users will start seeing
> > > > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > > > sense.
> > > > >
> > > > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > > > interface. So I do not see how pre-existing users of the upstream kernel
> > > > > can see any breakage.
> > > > >
> > > >
> > > > Please note that vmpressure is still being used in v2 by the
> > > > networking layer (see mem_cgroup_under_socket_pressure()) for
> > > > detecting memory pressure.
> > >
> > > I have missed this. It is hidden quite good. I thought that v2 is
> > > completely vmpressure free. I have to admit that the effect of
> > > mem_cgroup_under_socket_pressure is not really clear to me. Not to
> > > mention whether it should or shouldn't be triggered for the user
> > > triggered memory reclaim. So this would really need some explanation.
> >
> > vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
> > memcontrol: hook up vmpressure to socket pressure"). A quick look at
> > the commit log and the code suggests that this is used all over the
> > socket and tcp code to throttles the memory consumption of the
> > networking layer if we are under pressure.
> >
> > However, for proactive reclaim like memory.reclaim, the target is to
> > probe the memcg for cold memory. Reclaiming such memory should not
> > have a visible effect on the workload performance. I don't think that
> > any network throttling side effects are correct here.
>
> Please describe the user visible effects of this change. IIUC this is
> changing the vmpressure semantic for pre-existing users (v1 when setting
> the hard limit for example) and it really should be explained why
> this is good for them after those years. I do not see any actual bug
> being described explicitly so please make sure this is all properly
> documented.

In cgroup v1, user-induced reclaim that is caused by limit-setting (or
memory.reclaim for systems that choose to expose it in cgroup v1) will
no longer cause vmpressure notifications, which makes the vmpressure
behavior consistent with the current psi behavior.

In cgroup v2, user-induced reclaim (limit-setting, memory.reclaim, ..)
would currently cause the networking layer to perceive the memcg as
being under memory pressure, reducing memory consumption and possibly
causing throttling. This patch makes the networking layer only
perceive the memcg as being under pressure when the "pressure" is
caused by increased memory usage, not limit-setting or proactive
reclaim, which also makes the definition of memcg memory pressure
consistent with psi today.

In short, the purpose of this patch is to unify the definition of
memcg memory pressure across psi and vmpressure (which indirectly also
defines the definition of memcg memory pressure for the networking
layer). If this sounds good to you, I can add this explanation to the
commit log, and possibly anywhere you see appropriate in the
code/docs.

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 27, 2022, 9:20 a.m. UTC | #17
On Mon 27-06-22 01:39:46, Yosry Ahmed wrote:
> On Mon, Jun 27, 2022 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 23-06-22 10:26:11, Yosry Ahmed wrote:
> > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > >
> > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > > > > exists in the upstream version so that consolidating them does not
> > > > > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > > > > pressure due to proactive reclaim.
> > > > > > > >
> > > > > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > > > > some special casing which will add a maintenance overhead.
> > > > > > >
> > > > > > > It is not just Google, any existing vmpressure users will start seeing
> > > > > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > > > > sense.
> > > > > >
> > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > > > > interface. So I do not see how pre-existing users of the upstream kernel
> > > > > > can see any breakage.
> > > > > >
> > > > >
> > > > > Please note that vmpressure is still being used in v2 by the
> > > > > networking layer (see mem_cgroup_under_socket_pressure()) for
> > > > > detecting memory pressure.
> > > >
> > > > I have missed this. It is hidden quite good. I thought that v2 is
> > > > completely vmpressure free. I have to admit that the effect of
> > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to
> > > > mention whether it should or shouldn't be triggered for the user
> > > > triggered memory reclaim. So this would really need some explanation.
> > >
> > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
> > > memcontrol: hook up vmpressure to socket pressure"). A quick look at
> > > the commit log and the code suggests that this is used all over the
> > > socket and tcp code to throttles the memory consumption of the
> > > networking layer if we are under pressure.
> > >
> > > However, for proactive reclaim like memory.reclaim, the target is to
> > > probe the memcg for cold memory. Reclaiming such memory should not
> > > have a visible effect on the workload performance. I don't think that
> > > any network throttling side effects are correct here.
> >
> > Please describe the user visible effects of this change. IIUC this is
> > changing the vmpressure semantic for pre-existing users (v1 when setting
> > the hard limit for example) and it really should be explained why
> > this is good for them after those years. I do not see any actual bug
> > being described explicitly so please make sure this is all properly
> > documented.
> 
> In cgroup v1, user-induced reclaim that is caused by limit-setting (or
> memory.reclaim for systems that choose to expose it in cgroup v1) will
> no longer cause vmpressure notifications, which makes the vmpressure
> behavior consistent with the current psi behavior.

Yes it makes the behavior consistent with PSI. But is this what existing
users really want or need? This is a user visible long term behavior
change for a legacy interface and there should be a very good reason to
change that.

> In cgroup v2, user-induced reclaim (limit-setting, memory.reclaim, ..)
> would currently cause the networking layer to perceive the memcg as
> being under memory pressure, reducing memory consumption and possibly
> causing throttling. This patch makes the networking layer only
> perceive the memcg as being under pressure when the "pressure" is
> caused by increased memory usage, not limit-setting or proactive
> reclaim, which also makes the definition of memcg memory pressure
> consistent with psi today.

I do understand the argument about the pro-active reclaim.
memory.reclaim is a new interface and it a) makes sense to exclude it
from different memory pressure notification interfaces and b) there are
unlikely too many user applications depending on the exact behavior so
changes are still rather low on the risk scale.

> In short, the purpose of this patch is to unify the definition of
> memcg memory pressure across psi and vmpressure (which indirectly also
> defines the definition of memcg memory pressure for the networking
> layer). If this sounds good to you, I can add this explanation to the
> commit log, and possibly anywhere you see appropriate in the
> code/docs.

The consistency on its own sounds like a very weak argument to change a
long term behavior. I do not really see any serious arguments or
evaluation what kind of fallout this change can have on old applications
that are still sticking with v1.

After it has been made clear that the vmpressure is still used for the
pro-active reclaim in v2 I do agree that this is likely something we
want to have addressed. But I wouldn't touch v1 semantics as this
doesn't really buy much and it can potentially break existing users.
Yosry Ahmed June 27, 2022, 9:39 a.m. UTC | #18
On Mon, Jun 27, 2022 at 2:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 27-06-22 01:39:46, Yosry Ahmed wrote:
> > On Mon, Jun 27, 2022 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 23-06-22 10:26:11, Yosry Ahmed wrote:
> > > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote:
> > > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote:
> > > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote:
> > > > > > > [...]
> > > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed,
> > > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how
> > > > > > > > > > psi is handled upstream). We want to make sure this behavior also
> > > > > > > > > > exists in the upstream version so that consolidating them does not
> > > > > > > > > > break our users who rely on vmpressure and will start seeing increased
> > > > > > > > > > pressure due to proactive reclaim.
> > > > > > > > >
> > > > > > > > > These are good reasons to have this patch in your tree. But why is this
> > > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and
> > > > > > > > > some special casing which will add a maintenance overhead.
> > > > > > > >
> > > > > > > > It is not just Google, any existing vmpressure users will start seeing
> > > > > > > > false pressure notifications with memory.reclaim. The main goal of the
> > > > > > > > patch is to make sure memory.reclaim does not break pre-existing users
> > > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes
> > > > > > > > sense.
> > > > > > >
> > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure
> > > > > > > interface. So I do not see how pre-existing users of the upstream kernel
> > > > > > > can see any breakage.
> > > > > > >
> > > > > >
> > > > > > Please note that vmpressure is still being used in v2 by the
> > > > > > networking layer (see mem_cgroup_under_socket_pressure()) for
> > > > > > detecting memory pressure.
> > > > >
> > > > > I have missed this. It is hidden quite good. I thought that v2 is
> > > > > completely vmpressure free. I have to admit that the effect of
> > > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to
> > > > > mention whether it should or shouldn't be triggered for the user
> > > > > triggered memory reclaim. So this would really need some explanation.
> > > >
> > > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm:
> > > > memcontrol: hook up vmpressure to socket pressure"). A quick look at
> > > > the commit log and the code suggests that this is used all over the
> > > > socket and tcp code to throttles the memory consumption of the
> > > > networking layer if we are under pressure.
> > > >
> > > > However, for proactive reclaim like memory.reclaim, the target is to
> > > > probe the memcg for cold memory. Reclaiming such memory should not
> > > > have a visible effect on the workload performance. I don't think that
> > > > any network throttling side effects are correct here.
> > >
> > > Please describe the user visible effects of this change. IIUC this is
> > > changing the vmpressure semantic for pre-existing users (v1 when setting
> > > the hard limit for example) and it really should be explained why
> > > this is good for them after those years. I do not see any actual bug
> > > being described explicitly so please make sure this is all properly
> > > documented.
> >
> > In cgroup v1, user-induced reclaim that is caused by limit-setting (or
> > memory.reclaim for systems that choose to expose it in cgroup v1) will
> > no longer cause vmpressure notifications, which makes the vmpressure
> > behavior consistent with the current psi behavior.
>
> Yes it makes the behavior consistent with PSI. But is this what existing
> users really want or need? This is a user visible long term behavior
> change for a legacy interface and there should be a very good reason to
> change that.
>
> > In cgroup v2, user-induced reclaim (limit-setting, memory.reclaim, ..)
> > would currently cause the networking layer to perceive the memcg as
> > being under memory pressure, reducing memory consumption and possibly
> > causing throttling. This patch makes the networking layer only
> > perceive the memcg as being under pressure when the "pressure" is
> > caused by increased memory usage, not limit-setting or proactive
> > reclaim, which also makes the definition of memcg memory pressure
> > consistent with psi today.
>
> I do understand the argument about the pro-active reclaim.
> memory.reclaim is a new interface and it a) makes sense to exclude it
> from different memory pressure notification interfaces and b) there are
> unlikely too many user applications depending on the exact behavior so
> changes are still rather low on the risk scale.
>
> > In short, the purpose of this patch is to unify the definition of
> > memcg memory pressure across psi and vmpressure (which indirectly also
> > defines the definition of memcg memory pressure for the networking
> > layer). If this sounds good to you, I can add this explanation to the
> > commit log, and possibly anywhere you see appropriate in the
> > code/docs.
>
> The consistency on its own sounds like a very weak argument to change a
> long term behavior. I do not really see any serious arguments or
> evaluation what kind of fallout this change can have on old applications
> that are still sticking with v1.
>
> After it has been made clear that the vmpressure is still used for the
> pro-active reclaim in v2 I do agree that this is likely something we
> want to have addressed. But I wouldn't touch v1 semantics as this
> doesn't really buy much and it can potentially break existing users.
>

Understood, and fair enough. There are 3 behavioral changes in this patch.

(a) Do not count vmpressure for mem_cgroup_resize_max() and
mem_cgroup_force_empty() in v1.
(b) Do not count vmpressure (consequently,
mem_cgroup_under_socket_pressure()) in v2 where psi is not counted
(writing to memory.max, memory.high, and memory.reclaim).

Do you want us to drop (a) and keep (b) ? or do you want to further
break down (b) to only limit the change to proactive reclaim through
memory.reclaim (IOW keep socket pressure on limit-setting although it
is not considered pressure in terms of psi) ?

> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 27, 2022, 12:31 p.m. UTC | #19
On Mon 27-06-22 02:39:49, Yosry Ahmed wrote:
[...]
> (a) Do not count vmpressure for mem_cgroup_resize_max() and
> mem_cgroup_force_empty() in v1.

yes, unless you have a very good reason to change that. E.g. this has
been buggy and we have finally understood that. But I do not see any
indications so far.

> (b) Do not count vmpressure (consequently,
> mem_cgroup_under_socket_pressure()) in v2 where psi is not counted
> (writing to memory.max, memory.high, and memory.reclaim).

I can see clear arguments for memory.reclaim opt out for vmpressure
because we have established that this is not a measure to express a
memory pressure on the cgroup.

Max/High are less clear to me, TBH. I do understand reasoning for PSI
exclusion because considering the calling process to be stalled and
non-productive is misleading. It just does its work so in a way it is
a productive time in the end. For the vmpressure, which measures how
hard/easy it is to reclaim memory why this should special for this
particular reclaim?

Again, an explanation of the effect on the socket pressure could give a
better picture. Say that I somebody reduces the limit (hard/high) and it
takes quite some effort to shrink the consumption down. Should the
networking layer react to that in any way or should it wait for the
active allocation during that process to find that out?
Yosry Ahmed June 27, 2022, 5:03 p.m. UTC | #20
On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 27-06-22 02:39:49, Yosry Ahmed wrote:
> [...]
> > (a) Do not count vmpressure for mem_cgroup_resize_max() and
> > mem_cgroup_force_empty() in v1.
>
> yes, unless you have a very good reason to change that. E.g. this has
> been buggy and we have finally understood that. But I do not see any
> indications so far.

I don't have any bug reports. It makes sense that users do not expect
vmpressure notifications when they resize the limits below the current
usage, because it should be expected that reclaim will happen so
receiving notifications here is redundant, and may be incorrectly
perceived by a different user space thread as being under memory
pressure. But I get your point that what the user sees as memory
pressure or not could be different, and is probably already defined by
the current behavior anyway, whether it makes sense or not.

I can also see some userspace applications depending on this behavior
in some way, either by handling that limit resize notification in a
certain way or deliberately dropping it. Either way, making this
change could throw them off. I don't expect any userspace applications
to crash of course (because there are cases where they won't receive
notifications, e.g. scanned < vmpressure_win), but perhaps it's not
worth even risk misguiding them.

So I agree that just because it doesn't make sense or is inconsistent
with other definitions of behavior then we can make a visible change
for userspace. I will drop the v1 changes in the next version anyway.

Thanks!

>
> > (b) Do not count vmpressure (consequently,
> > mem_cgroup_under_socket_pressure()) in v2 where psi is not counted
> > (writing to memory.max, memory.high, and memory.reclaim).
>
> I can see clear arguments for memory.reclaim opt out for vmpressure
> because we have established that this is not a measure to express a
> memory pressure on the cgroup.
>
> Max/High are less clear to me, TBH. I do understand reasoning for PSI
> exclusion because considering the calling process to be stalled and
> non-productive is misleading. It just does its work so in a way it is
> a productive time in the end. For the vmpressure, which measures how
> hard/easy it is to reclaim memory why this should special for this
> particular reclaim?
>
> Again, an explanation of the effect on the socket pressure could give a
> better picture. Say that I somebody reduces the limit (hard/high) and it
> takes quite some effort to shrink the consumption down. Should the
> networking layer react to that in any way or should it wait for the
> active allocation during that process to find that out?

I am out of my depth here. Any answer on my side would be purely
speculation at this point. Shakeel, can you help us here or tag some
networking people?
Thanks!

> --
> Michal Hocko
> SUSE Labs
Shakeel Butt June 30, 2022, 1:07 a.m. UTC | #21
On Mon, Jun 27, 2022 at 10:04 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote:
> >
[...]
> >
> > I can see clear arguments for memory.reclaim opt out for vmpressure
> > because we have established that this is not a measure to express a
> > memory pressure on the cgroup.
> >
> > Max/High are less clear to me, TBH. I do understand reasoning for PSI
> > exclusion because considering the calling process to be stalled and
> > non-productive is misleading. It just does its work so in a way it is
> > a productive time in the end. For the vmpressure, which measures how
> > hard/easy it is to reclaim memory why this should special for this
> > particular reclaim?
> >
> > Again, an explanation of the effect on the socket pressure could give a
> > better picture. Say that I somebody reduces the limit (hard/high) and it
> > takes quite some effort to shrink the consumption down. Should the
> > networking layer react to that in any way or should it wait for the
> > active allocation during that process to find that out?
>
> I am out of my depth here. Any answer on my side would be purely
> speculation at this point. Shakeel, can you help us here or tag some
> networking people?

So, the effect of returning true from mem_cgroup_under_socket_pressure() are:

1. Reducing send and receive buffers of the current socket.
2. May drop packets on the rx path.
3. May throttle current thread on the tx path.

Now regarding the behavior from the reclaim due to reducing max or
high, I think the kernel should not ignore vmpressure. Please note
that unlike PSI which is associated with the current process,
vmpressure is associated with the target memcg. So, any reclaim on
that memcg due to real shortage of memory should not be ignored. That
reclaim can be global reclaim or limit reclaim of ancestor or itself
or reclaim due to lowering the limit of ancestor or itself.
Yosry Ahmed June 30, 2022, 2:08 a.m. UTC | #22
On Wed, Jun 29, 2022 at 6:07 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Jun 27, 2022 at 10:04 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> [...]
> > >
> > > I can see clear arguments for memory.reclaim opt out for vmpressure
> > > because we have established that this is not a measure to express a
> > > memory pressure on the cgroup.
> > >
> > > Max/High are less clear to me, TBH. I do understand reasoning for PSI
> > > exclusion because considering the calling process to be stalled and
> > > non-productive is misleading. It just does its work so in a way it is
> > > a productive time in the end. For the vmpressure, which measures how
> > > hard/easy it is to reclaim memory why this should special for this
> > > particular reclaim?
> > >
> > > Again, an explanation of the effect on the socket pressure could give a
> > > better picture. Say that I somebody reduces the limit (hard/high) and it
> > > takes quite some effort to shrink the consumption down. Should the
> > > networking layer react to that in any way or should it wait for the
> > > active allocation during that process to find that out?
> >
> > I am out of my depth here. Any answer on my side would be purely
> > speculation at this point. Shakeel, can you help us here or tag some
> > networking people?
>
> So, the effect of returning true from mem_cgroup_under_socket_pressure() are:
>
> 1. Reducing send and receive buffers of the current socket.
> 2. May drop packets on the rx path.
> 3. May throttle current thread on the tx path.
>
> Now regarding the behavior from the reclaim due to reducing max or
> high, I think the kernel should not ignore vmpressure. Please note
> that unlike PSI which is associated with the current process,
> vmpressure is associated with the target memcg. So, any reclaim on
> that memcg due to real shortage of memory should not be ignored. That
> reclaim can be global reclaim or limit reclaim of ancestor or itself
> or reclaim due to lowering the limit of ancestor or itself.

So it seems like we should only ignore vmpressure for proactive
reclaim (aka memory.reclaim).

Michal, let me know what you think here, I can drop psi and
limit-setting changes in v3 and basically just ignore vmpressure for
memory.reclaim (MEMCG_RECLAIM_PROACTIVE / sc->proactive instead of
MEMCG_RECLAIM_CONTROLLED / sc->controlled maybe).
Michal Hocko June 30, 2022, 8:22 a.m. UTC | #23
On Wed 29-06-22 19:08:42, Yosry Ahmed wrote:
> On Wed, Jun 29, 2022 at 6:07 PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Mon, Jun 27, 2022 at 10:04 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > [...]
> > > >
> > > > I can see clear arguments for memory.reclaim opt out for vmpressure
> > > > because we have established that this is not a measure to express a
> > > > memory pressure on the cgroup.
> > > >
> > > > Max/High are less clear to me, TBH. I do understand reasoning for PSI
> > > > exclusion because considering the calling process to be stalled and
> > > > non-productive is misleading. It just does its work so in a way it is
> > > > a productive time in the end. For the vmpressure, which measures how
> > > > hard/easy it is to reclaim memory why this should special for this
> > > > particular reclaim?
> > > >
> > > > Again, an explanation of the effect on the socket pressure could give a
> > > > better picture. Say that I somebody reduces the limit (hard/high) and it
> > > > takes quite some effort to shrink the consumption down. Should the
> > > > networking layer react to that in any way or should it wait for the
> > > > active allocation during that process to find that out?
> > >
> > > I am out of my depth here. Any answer on my side would be purely
> > > speculation at this point. Shakeel, can you help us here or tag some
> > > networking people?
> >
> > So, the effect of returning true from mem_cgroup_under_socket_pressure() are:
> >
> > 1. Reducing send and receive buffers of the current socket.
> > 2. May drop packets on the rx path.
> > 3. May throttle current thread on the tx path.
> >
> > Now regarding the behavior from the reclaim due to reducing max or
> > high, I think the kernel should not ignore vmpressure. Please note
> > that unlike PSI which is associated with the current process,
> > vmpressure is associated with the target memcg. So, any reclaim on
> > that memcg due to real shortage of memory should not be ignored. That
> > reclaim can be global reclaim or limit reclaim of ancestor or itself
> > or reclaim due to lowering the limit of ancestor or itself.
> 
> So it seems like we should only ignore vmpressure for proactive
> reclaim (aka memory.reclaim).
> 
> Michal, let me know what you think here, I can drop psi and
> limit-setting changes in v3 and basically just ignore vmpressure for
> memory.reclaim (MEMCG_RECLAIM_PROACTIVE / sc->proactive instead of
> MEMCG_RECLAIM_CONTROLLED / sc->controlled maybe).

Yes, that makes much more sense to me.
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0c0fed1b348f2..5a6766e417afe 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -411,10 +411,13 @@  extern void lru_cache_add_inactive_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
+
+#define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
+#define MEMCG_RECLAIM_CONTROLLED (1 << 2)
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  bool may_swap);
+						  unsigned int reclaim_options);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index abec50f31fe64..a76bb7ae76f73 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2319,20 +2319,16 @@  static unsigned long reclaim_high(struct mem_cgroup *memcg,
 				  gfp_t gfp_mask)
 {
 	unsigned long nr_reclaimed = 0;
+	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
 
 	do {
-		unsigned long pflags;
-
 		if (page_counter_read(&memcg->memory) <=
 		    READ_ONCE(memcg->memory.high))
 			continue;
-
 		memcg_memory_event(memcg, MEMCG_HIGH);
-
-		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
-							     gfp_mask, true);
-		psi_memstall_leave(&pflags);
+							     gfp_mask,
+							     reclaim_options);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
 
@@ -2576,9 +2572,8 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	unsigned long nr_reclaimed;
 	bool passed_oom = false;
-	bool may_swap = true;
+	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
 	bool drained = false;
-	unsigned long pflags;
 
 retry:
 	if (consume_stock(memcg, nr_pages))
@@ -2593,7 +2588,7 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		mem_over_limit = mem_cgroup_from_counter(counter, memory);
 	} else {
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
-		may_swap = false;
+		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
 	}
 
 	if (batch > nr_pages) {
@@ -2618,10 +2613,8 @@  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	memcg_memory_event(mem_over_limit, MEMCG_MAX);
 
-	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, may_swap);
-	psi_memstall_leave(&pflags);
+						    gfp_mask, reclaim_options);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		goto retry;
@@ -3369,7 +3362,9 @@  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 	int ret;
 	bool limits_invariant;
 	struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory;
+	unsigned int reclaim_options = memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP;
 
+	reclaim_options |= MEMCG_RECLAIM_CONTROLLED;
 	do {
 		if (signal_pending(current)) {
 			ret = -EINTR;
@@ -3403,7 +3398,7 @@  static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1,
-					GFP_KERNEL, !memsw)) {
+					GFP_KERNEL, reclaim_options)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3502,6 +3497,8 @@  unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 {
 	int nr_retries = MAX_RECLAIM_RETRIES;
+	unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED |
+		MEMCG_RECLAIM_MAY_SWAP;
 
 	/* we call try-to-free pages for make this cgroup empty */
 	lru_add_drain_all();
@@ -3513,7 +3510,8 @@  static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 		if (signal_pending(current))
 			return -EINTR;
 
-		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true))
+		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
+						  reclaim_options))
 			nr_retries--;
 	}
 
@@ -6215,6 +6213,8 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	bool drained = false;
 	unsigned long high;
+	unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED |
+		MEMCG_RECLAIM_MAY_SWAP;
 	int err;
 
 	buf = strstrip(buf);
@@ -6241,7 +6241,7 @@  static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-							 GFP_KERNEL, true);
+						GFP_KERNEL, reclaim_options);
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6264,6 +6264,8 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,
 	unsigned int nr_reclaims = MAX_RECLAIM_RETRIES;
 	bool drained = false;
 	unsigned long max;
+	unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED |
+		MEMCG_RECLAIM_MAY_SWAP;
 	int err;
 
 	buf = strstrip(buf);
@@ -6290,7 +6292,7 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-							  GFP_KERNEL, true))
+						GFP_KERNEL, reclaim_options))
 				nr_reclaims--;
 			continue;
 		}
@@ -6419,6 +6421,8 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED |
+		MEMCG_RECLAIM_MAY_SWAP;
 	int err;
 
 	buf = strstrip(buf);
@@ -6442,7 +6446,7 @@  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 						nr_to_reclaim - nr_reclaimed,
-						GFP_KERNEL, true);
+						GFP_KERNEL, reclaim_options);
 
 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7d9a683e3a7d..6efe7660f7f78 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -102,6 +102,9 @@  struct scan_control {
 	/* Can pages be swapped as part of reclaim? */
 	unsigned int may_swap:1;
 
+	/* Reclaim is controlled by userspace */
+	unsigned int controlled:1;
+
 	/*
 	 * Cgroup memory below memory.low is protected as long as we
 	 * don't threaten to OOM. If any cgroup is reclaimed at
@@ -3125,9 +3128,10 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 			    sc->priority);
 
 		/* Record the group's reclaim efficiency */
-		vmpressure(sc->gfp_mask, memcg, false,
-			   sc->nr_scanned - scanned,
-			   sc->nr_reclaimed - reclaimed);
+		if (!sc->controlled)
+			vmpressure(sc->gfp_mask, memcg, false,
+				   sc->nr_scanned - scanned,
+				   sc->nr_reclaimed - reclaimed);
 
 	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
 }
@@ -3250,9 +3254,10 @@  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	}
 
 	/* Record the subtree's reclaim efficiency */
-	vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
-		   sc->nr_scanned - nr_scanned,
-		   sc->nr_reclaimed - nr_reclaimed);
+	if (!sc->controlled)
+		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
+			   sc->nr_scanned - nr_scanned,
+			   sc->nr_reclaimed - nr_reclaimed);
 
 	if (sc->nr_reclaimed - nr_reclaimed)
 		reclaimable = true;
@@ -3534,8 +3539,9 @@  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		__count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
 
 	do {
-		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
-				sc->priority);
+		if (!sc->controlled)
+			vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
+					sc->priority);
 		sc->nr_scanned = 0;
 		shrink_zones(zonelist, sc);
 
@@ -3751,6 +3757,7 @@  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
 		.may_swap = 1,
+		.controlled = 0,
 	};
 
 	/*
@@ -3825,10 +3832,12 @@  unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   bool may_swap)
+					   unsigned int reclaim_options)
 {
 	unsigned long nr_reclaimed;
+	unsigned long pflags;
 	unsigned int noreclaim_flag;
+	bool controlled_reclaim = reclaim_options & MEMCG_RECLAIM_CONTROLLED;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
 		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
@@ -3838,7 +3847,8 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.priority = DEF_PRIORITY,
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.may_swap = may_swap,
+		.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
+		.controlled = controlled_reclaim,
 	};
 	/*
 	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
@@ -3848,12 +3858,19 @@  unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
+
 	trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask);
+
+	if (!controlled_reclaim)
+		psi_memstall_enter(&pflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 
 	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
+	if (!controlled_reclaim)
+		psi_memstall_leave(&pflags);
+
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 	set_task_reclaim_state(current, NULL);
 
@@ -4095,6 +4112,7 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
 		.may_unmap = 1,
+		.controlled = 0,
 	};
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
@@ -4555,6 +4573,7 @@  unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.may_unmap = 1,
 		.may_swap = 1,
 		.hibernation_mode = 1,
+		.controlled = 0,
 	};
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	unsigned long nr_reclaimed;
@@ -4707,6 +4726,7 @@  static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
 		.may_swap = 1,
 		.reclaim_idx = gfp_zone(gfp_mask),
+		.controlled = 0,
 	};
 	unsigned long pflags;