Message ID | 20220623000530.1194226-1-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: vmpressure: don't count userspace-induced reclaim as memory pressure | expand |
On Thu, 23 Jun 2022 00:05:30 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim > as memory pressure") made sure that memory reclaim that is induced by > userspace (limit-setting, proactive reclaim, ..) is not counted as > memory pressure for the purposes of psi. > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers > from try_charge() and reclaim_high() wrap the call to > try_to_free_mem_cgroup_pages() with psi handlers. > > However, vmpressure is still counted in these cases where reclaim is > directly induced by userspace. This patch makes sure vmpressure is not > counted in those operations, in the same way as psi. Since vmpressure > calls need to happen deeper within the reclaim path, the same approach > could not be followed. Hence, a new "controlled" flag is added to struct > scan_control to flag a reclaim operation that is controlled by > userspace. This flag is set by limit-setting and proactive reclaim > operations, and is used to count vmpressure correctly. > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9 > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") > is effectively reverted and the same flag is used to control psi as > well. I'll await reviewer input on this, but I can always do trivia! > @@ -3502,6 +3497,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > { > int nr_retries = MAX_RECLAIM_RETRIES; > + unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED | > + MEMCG_RECLAIM_MAY_SWAP; If it doesn't fit, it's nicer to do unsigned int reclaim_options; ... reclaim_options = MEMCG_RECLAIM_CONTROLLED | MEMCG_RECLAIM_MAY_SWAP; (several places) > @@ -3751,6 +3757,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > .may_writepage = !laptop_mode, > .may_unmap = 1, > .may_swap = 1, > + .controlled = 0, > }; Let's just skip all these initializations to zero, let the compiler take care of it. > @@ -4095,6 +4112,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > .gfp_mask = GFP_KERNEL, > .order = order, > .may_unmap = 1, > + .controlled = 0, > }; > > set_task_reclaim_state(current, &sc.reclaim_state); > @@ -4555,6 +4573,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) > .may_unmap = 1, > .may_swap = 1, > .hibernation_mode = 1, > + .controlled = 0, > }; > struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); > unsigned long nr_reclaimed; > @@ -4707,6 +4726,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > .may_swap = 1, > .reclaim_idx = gfp_zone(gfp_mask), > + .controlled = 0, > }; > unsigned long pflags;
On Wed, Jun 22, 2022 at 5:16 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 23 Jun 2022 00:05:30 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > > > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim > > as memory pressure") made sure that memory reclaim that is induced by > > userspace (limit-setting, proactive reclaim, ..) is not counted as > > memory pressure for the purposes of psi. > > > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers > > from try_charge() and reclaim_high() wrap the call to > > try_to_free_mem_cgroup_pages() with psi handlers. > > > > However, vmpressure is still counted in these cases where reclaim is > > directly induced by userspace. This patch makes sure vmpressure is not > > counted in those operations, in the same way as psi. Since vmpressure > > calls need to happen deeper within the reclaim path, the same approach > > could not be followed. Hence, a new "controlled" flag is added to struct > > scan_control to flag a reclaim operation that is controlled by > > userspace. This flag is set by limit-setting and proactive reclaim > > operations, and is used to count vmpressure correctly. > > > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9 > > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") > > is effectively reverted and the same flag is used to control psi as > > well. > > I'll await reviewer input on this, but I can always do trivia! Thanks for taking a look so quickly, will address and send v2 soon! > > > @@ -3502,6 +3497,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, > > static int mem_cgroup_force_empty(struct mem_cgroup *memcg) > > { > > int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED | > > + MEMCG_RECLAIM_MAY_SWAP; > > If it doesn't fit, it's nicer to do > > unsigned int reclaim_options; > ... > > reclaim_options = MEMCG_RECLAIM_CONTROLLED | MEMCG_RECLAIM_MAY_SWAP; > > (several places) > > > @@ -3751,6 +3757,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > > .may_writepage = !laptop_mode, > > .may_unmap = 1, > > .may_swap = 1, > > + .controlled = 0, > > }; > > Let's just skip all these initializations to zero, let the compiler take > care of it. > > > @@ -4095,6 +4112,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > > .gfp_mask = GFP_KERNEL, > > .order = order, > > .may_unmap = 1, > > + .controlled = 0, > > }; > > > > set_task_reclaim_state(current, &sc.reclaim_state); > > @@ -4555,6 +4573,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) > > .may_unmap = 1, > > .may_swap = 1, > > .hibernation_mode = 1, > > + .controlled = 0, > > }; > > struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); > > unsigned long nr_reclaimed; > > @@ -4707,6 +4726,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > .may_swap = 1, > > .reclaim_idx = gfp_zone(gfp_mask), > > + .controlled = 0, > > }; > > unsigned long pflags; > >
On Thu 23-06-22 00:05:30, Yosry Ahmed wrote: > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim > as memory pressure") made sure that memory reclaim that is induced by > userspace (limit-setting, proactive reclaim, ..) is not counted as > memory pressure for the purposes of psi. > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers > from try_charge() and reclaim_high() wrap the call to > try_to_free_mem_cgroup_pages() with psi handlers. > > However, vmpressure is still counted in these cases where reclaim is > directly induced by userspace. This patch makes sure vmpressure is not > counted in those operations, in the same way as psi. Since vmpressure > calls need to happen deeper within the reclaim path, the same approach > could not be followed. Hence, a new "controlled" flag is added to struct > scan_control to flag a reclaim operation that is controlled by > userspace. This flag is set by limit-setting and proactive reclaim > operations, and is used to count vmpressure correctly. > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9 > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") > is effectively reverted and the same flag is used to control psi as > well. Why do we need to add this is a legacy interface now? Are there any pre-existing users who realized this is bugging them? Please be more specific about the usecase.
On Thu, Jun 23, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 23-06-22 00:05:30, Yosry Ahmed wrote: > > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim > > as memory pressure") made sure that memory reclaim that is induced by > > userspace (limit-setting, proactive reclaim, ..) is not counted as > > memory pressure for the purposes of psi. > > > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers > > from try_charge() and reclaim_high() wrap the call to > > try_to_free_mem_cgroup_pages() with psi handlers. > > > > However, vmpressure is still counted in these cases where reclaim is > > directly induced by userspace. This patch makes sure vmpressure is not > > counted in those operations, in the same way as psi. Since vmpressure > > calls need to happen deeper within the reclaim path, the same approach > > could not be followed. Hence, a new "controlled" flag is added to struct > > scan_control to flag a reclaim operation that is controlled by > > userspace. This flag is set by limit-setting and proactive reclaim > > operations, and is used to count vmpressure correctly. > > > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9 > > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") > > is effectively reverted and the same flag is used to control psi as > > well. > > Why do we need to add this is a legacy interface now? Are there any > pre-existing users who realized this is bugging them? Please be more > specific about the usecase. Sorry if I wasn't clear enough. Unfortunately we still have userspace workloads at Google that use vmpressure notifications. In our internal version of memory.reclaim that we recently upstreamed, we do not account vmpressure during proactive reclaim (similar to how psi is handled upstream). We want to make sure this behavior also exists in the upstream version so that consolidating them does not break our users who rely on vmpressure and will start seeing increased pressure due to proactive reclaim. > > -- > Michal Hocko > SUSE Labs
On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > On Thu, Jun 23, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 00:05:30, Yosry Ahmed wrote: > > > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim > > > as memory pressure") made sure that memory reclaim that is induced by > > > userspace (limit-setting, proactive reclaim, ..) is not counted as > > > memory pressure for the purposes of psi. > > > > > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers > > > from try_charge() and reclaim_high() wrap the call to > > > try_to_free_mem_cgroup_pages() with psi handlers. > > > > > > However, vmpressure is still counted in these cases where reclaim is > > > directly induced by userspace. This patch makes sure vmpressure is not > > > counted in those operations, in the same way as psi. Since vmpressure > > > calls need to happen deeper within the reclaim path, the same approach > > > could not be followed. Hence, a new "controlled" flag is added to struct > > > scan_control to flag a reclaim operation that is controlled by > > > userspace. This flag is set by limit-setting and proactive reclaim > > > operations, and is used to count vmpressure correctly. > > > > > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9 > > > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") > > > is effectively reverted and the same flag is used to control psi as > > > well. > > > > Why do we need to add this is a legacy interface now? Are there any > > pre-existing users who realized this is bugging them? Please be more > > specific about the usecase. > > Sorry if I wasn't clear enough. Unfortunately we still have userspace > workloads at Google that use vmpressure notifications. > > In our internal version of memory.reclaim that we recently upstreamed, > we do not account vmpressure during proactive reclaim (similar to how > psi is handled upstream). We want to make sure this behavior also > exists in the upstream version so that consolidating them does not > break our users who rely on vmpressure and will start seeing increased > pressure due to proactive reclaim. These are good reasons to have this patch in your tree. But why is this patch benefitial for the upstream kernel? It clearly adds some code and some special casing which will add a maintenance overhead.
On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > On Thu, Jun 23, 2022 at 1:05 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 23-06-22 00:05:30, Yosry Ahmed wrote: > > > > Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim > > > > as memory pressure") made sure that memory reclaim that is induced by > > > > userspace (limit-setting, proactive reclaim, ..) is not counted as > > > > memory pressure for the purposes of psi. > > > > > > > > Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers > > > > from try_charge() and reclaim_high() wrap the call to > > > > try_to_free_mem_cgroup_pages() with psi handlers. > > > > > > > > However, vmpressure is still counted in these cases where reclaim is > > > > directly induced by userspace. This patch makes sure vmpressure is not > > > > counted in those operations, in the same way as psi. Since vmpressure > > > > calls need to happen deeper within the reclaim path, the same approach > > > > could not be followed. Hence, a new "controlled" flag is added to struct > > > > scan_control to flag a reclaim operation that is controlled by > > > > userspace. This flag is set by limit-setting and proactive reclaim > > > > operations, and is used to count vmpressure correctly. > > > > > > > > To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9 > > > > ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") > > > > is effectively reverted and the same flag is used to control psi as > > > > well. > > > > > > Why do we need to add this is a legacy interface now? Are there any > > > pre-existing users who realized this is bugging them? Please be more > > > specific about the usecase. > > > > Sorry if I wasn't clear enough. Unfortunately we still have userspace > > workloads at Google that use vmpressure notifications. > > > > In our internal version of memory.reclaim that we recently upstreamed, > > we do not account vmpressure during proactive reclaim (similar to how > > psi is handled upstream). We want to make sure this behavior also > > exists in the upstream version so that consolidating them does not > > break our users who rely on vmpressure and will start seeing increased > > pressure due to proactive reclaim. > > These are good reasons to have this patch in your tree. But why is this > patch benefitial for the upstream kernel? It clearly adds some code and > some special casing which will add a maintenance overhead. It is not just Google, any existing vmpressure users will start seeing false pressure notifications with memory.reclaim. The main goal of the patch is to make sure memory.reclaim does not break pre-existing users of vmpressure, and doing it in a way that is consistent with psi makes sense. > -- > Michal Hocko > SUSE Labs
On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: [...] > > > In our internal version of memory.reclaim that we recently upstreamed, > > > we do not account vmpressure during proactive reclaim (similar to how > > > psi is handled upstream). We want to make sure this behavior also > > > exists in the upstream version so that consolidating them does not > > > break our users who rely on vmpressure and will start seeing increased > > > pressure due to proactive reclaim. > > > > These are good reasons to have this patch in your tree. But why is this > > patch benefitial for the upstream kernel? It clearly adds some code and > > some special casing which will add a maintenance overhead. > > It is not just Google, any existing vmpressure users will start seeing > false pressure notifications with memory.reclaim. The main goal of the > patch is to make sure memory.reclaim does not break pre-existing users > of vmpressure, and doing it in a way that is consistent with psi makes > sense. memory.reclaim is v2 only feature which doesn't have vmpressure interface. So I do not see how pre-existing users of the upstream kernel can see any breakage.
On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > [...] > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > psi is handled upstream). We want to make sure this behavior also > > > > exists in the upstream version so that consolidating them does not > > > > break our users who rely on vmpressure and will start seeing increased > > > > pressure due to proactive reclaim. > > > > > > These are good reasons to have this patch in your tree. But why is this > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > some special casing which will add a maintenance overhead. > > > > It is not just Google, any existing vmpressure users will start seeing > > false pressure notifications with memory.reclaim. The main goal of the > > patch is to make sure memory.reclaim does not break pre-existing users > > of vmpressure, and doing it in a way that is consistent with psi makes > > sense. > > memory.reclaim is v2 only feature which doesn't have vmpressure > interface. So I do not see how pre-existing users of the upstream kernel > can see any breakage. > Please note that vmpressure is still being used in v2 by the networking layer (see mem_cgroup_under_socket_pressure()) for detecting memory pressure. Though IMO we should deprecate vmpressure altogether.
On Thu, Jun 23, 2022 at 9:42 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > [...] > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > exists in the upstream version so that consolidating them does not > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > pressure due to proactive reclaim. > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > some special casing which will add a maintenance overhead. > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > false pressure notifications with memory.reclaim. The main goal of the > > > patch is to make sure memory.reclaim does not break pre-existing users > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > sense. > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > interface. So I do not see how pre-existing users of the upstream kernel > > can see any breakage. > > > > Please note that vmpressure is still being used in v2 by the > networking layer (see mem_cgroup_under_socket_pressure()) for > detecting memory pressure. > > Though IMO we should deprecate vmpressure altogether. Thanks Shakeel for mentioning that, I was just about to. Although I agree vmpressure should be deprecated at some point, the current state is that memory.reclaim will give incorrect vmpressure signals. IMO psi and vmpressure (though legacy) both signify memory pressure and should both be consistent as to what is being accounted for.
On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > [...] > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > exists in the upstream version so that consolidating them does not > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > pressure due to proactive reclaim. > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > some special casing which will add a maintenance overhead. > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > false pressure notifications with memory.reclaim. The main goal of the > > > patch is to make sure memory.reclaim does not break pre-existing users > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > sense. > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > interface. So I do not see how pre-existing users of the upstream kernel > > can see any breakage. > > > > Please note that vmpressure is still being used in v2 by the > networking layer (see mem_cgroup_under_socket_pressure()) for > detecting memory pressure. I have missed this. It is hidden quite good. I thought that v2 is completely vmpressure free. I have to admit that the effect of mem_cgroup_under_socket_pressure is not really clear to me. Not to mention whether it should or shouldn't be triggered for the user triggered memory reclaim. So this would really need some explanation. > Though IMO we should deprecate vmpressure altogether. Yes it should be really limited to v1. But as I've said the effect on mem_cgroup_under_socket_pressure is not really clear to me. It really seems the v2 support has been introduced deliberately.
On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > [...] > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > exists in the upstream version so that consolidating them does not > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > pressure due to proactive reclaim. > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > some special casing which will add a maintenance overhead. > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > sense. > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > interface. So I do not see how pre-existing users of the upstream kernel > > > can see any breakage. > > > > > > > Please note that vmpressure is still being used in v2 by the > > networking layer (see mem_cgroup_under_socket_pressure()) for > > detecting memory pressure. > > I have missed this. It is hidden quite good. I thought that v2 is > completely vmpressure free. I have to admit that the effect of > mem_cgroup_under_socket_pressure is not really clear to me. Not to > mention whether it should or shouldn't be triggered for the user > triggered memory reclaim. So this would really need some explanation. vmpressure was tied into socket pressure by 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure"). A quick look at the commit log and the code suggests that this is used all over the socket and tcp code to throttles the memory consumption of the networking layer if we are under pressure. However, for proactive reclaim like memory.reclaim, the target is to probe the memcg for cold memory. Reclaiming such memory should not have a visible effect on the workload performance. I don't think that any network throttling side effects are correct here. > > > Though IMO we should deprecate vmpressure altogether. > > Yes it should be really limited to v1. But as I've said the effect on > mem_cgroup_under_socket_pressure is not really clear to me. It really > seems the v2 support has been introduced deliberately. > > -- > Michal Hocko > SUSE Labs
On Thu, Jun 23, 2022 at 10:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > > [...] > > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > > exists in the upstream version so that consolidating them does not > > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > > pressure due to proactive reclaim. > > > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > > some special casing which will add a maintenance overhead. > > > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > > sense. > > > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > > interface. So I do not see how pre-existing users of the upstream kernel > > > > can see any breakage. > > > > > > > > > > Please note that vmpressure is still being used in v2 by the > > > networking layer (see mem_cgroup_under_socket_pressure()) for > > > detecting memory pressure. > > > > I have missed this. It is hidden quite good. I thought that v2 is > > completely vmpressure free. I have to admit that the effect of > > mem_cgroup_under_socket_pressure is not really clear to me. Not to > > mention whether it should or shouldn't be triggered for the user > > triggered memory reclaim. So this would really need some explanation. > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm: > memcontrol: hook up vmpressure to socket pressure"). A quick look at > the commit log and the code suggests that this is used all over the > socket and tcp code to throttles the memory consumption of the > networking layer if we are under pressure. > > However, for proactive reclaim like memory.reclaim, the target is to > probe the memcg for cold memory. Reclaiming such memory should not > have a visible effect on the workload performance. I don't think that > any network throttling side effects are correct here. IIUC, this change is fixing two mechanisms during userspace-induced memory pressure: 1. psi accounting, which I think is not controversial and makes sense to me; 2. vmpressure signal, which is a "kinda" obsolete interface and might be viewed as controversial. I would suggest splitting the patch into two, first to fix psi accounting and second to fix vmpressure signal. This way the first one (probably the bigger of the two) can be reviewed and accepted easily while debates continue on the second one. > > > > > > Though IMO we should deprecate vmpressure altogether. > > > > Yes it should be really limited to v1. But as I've said the effect on > > mem_cgroup_under_socket_pressure is not really clear to me. It really > > seems the v2 support has been introduced deliberately. > > > > -- > > Michal Hocko > > SUSE Labs
On Fri, Jun 24, 2022 at 3:10 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Jun 23, 2022 at 10:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > > > [...] > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > > > exists in the upstream version so that consolidating them does not > > > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > > > pressure due to proactive reclaim. > > > > > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > > > some special casing which will add a maintenance overhead. > > > > > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > > > sense. > > > > > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > > > interface. So I do not see how pre-existing users of the upstream kernel > > > > > can see any breakage. > > > > > > > > > > > > > Please note that vmpressure is still being used in v2 by the > > > > networking layer (see mem_cgroup_under_socket_pressure()) for > > > > detecting memory pressure. > > > > > > I have missed this. It is hidden quite good. I thought that v2 is > > > completely vmpressure free. I have to admit that the effect of > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to > > > mention whether it should or shouldn't be triggered for the user > > > triggered memory reclaim. So this would really need some explanation. > > > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm: > > memcontrol: hook up vmpressure to socket pressure"). A quick look at > > the commit log and the code suggests that this is used all over the > > socket and tcp code to throttles the memory consumption of the > > networking layer if we are under pressure. > > > > However, for proactive reclaim like memory.reclaim, the target is to > > probe the memcg for cold memory. Reclaiming such memory should not > > have a visible effect on the workload performance. I don't think that > > any network throttling side effects are correct here. > > IIUC, this change is fixing two mechanisms during userspace-induced > memory pressure: > 1. psi accounting, which I think is not controversial and makes sense to me; > 2. vmpressure signal, which is a "kinda" obsolete interface and might > be viewed as controversial. > I would suggest splitting the patch into two, first to fix psi > accounting and second to fix vmpressure signal. This way the first one > (probably the bigger of the two) can be reviewed and accepted easily > while debates continue on the second one. This change should be NOP for psi. psi was already fixed by e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") by Johannes a while ago. This patch does the same for vmpressure, but in a different way, as the same approach of e22c6ed90aa9 cannot be used. The changes you are seeing in this patch for psi are basically reverting e22c6ed90aa9 and using the newly introduced flag that handles vmpressure to handle psi as well, to avoid having two separate ways to address accounting memory pressure during userspace-induced reclaim. > > > > > > > > > > Though IMO we should deprecate vmpressure altogether. > > > > > > Yes it should be really limited to v1. But as I've said the effect on > > > mem_cgroup_under_socket_pressure is not really clear to me. It really > > > seems the v2 support has been introduced deliberately. > > > > > > -- > > > Michal Hocko > > > SUSE Labs
On Fri, Jun 24, 2022 at 3:14 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Fri, Jun 24, 2022 at 3:10 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Jun 23, 2022 at 10:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > > > > [...] > > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > > > > exists in the upstream version so that consolidating them does not > > > > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > > > > pressure due to proactive reclaim. > > > > > > > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > > > > some special casing which will add a maintenance overhead. > > > > > > > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > > > > sense. > > > > > > > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > > > > interface. So I do not see how pre-existing users of the upstream kernel > > > > > > can see any breakage. > > > > > > > > > > > > > > > > Please note that vmpressure is still being used in v2 by the > > > > > networking layer (see mem_cgroup_under_socket_pressure()) for > > > > > detecting memory pressure. > > > > > > > > I have missed this. It is hidden quite good. I thought that v2 is > > > > completely vmpressure free. I have to admit that the effect of > > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to > > > > mention whether it should or shouldn't be triggered for the user > > > > triggered memory reclaim. So this would really need some explanation. > > > > > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm: > > > memcontrol: hook up vmpressure to socket pressure"). A quick look at > > > the commit log and the code suggests that this is used all over the > > > socket and tcp code to throttles the memory consumption of the > > > networking layer if we are under pressure. > > > > > > However, for proactive reclaim like memory.reclaim, the target is to > > > probe the memcg for cold memory. Reclaiming such memory should not > > > have a visible effect on the workload performance. I don't think that > > > any network throttling side effects are correct here. > > > > IIUC, this change is fixing two mechanisms during userspace-induced > > memory pressure: > > 1. psi accounting, which I think is not controversial and makes sense to me; > > 2. vmpressure signal, which is a "kinda" obsolete interface and might > > be viewed as controversial. > > I would suggest splitting the patch into two, first to fix psi > > accounting and second to fix vmpressure signal. This way the first one > > (probably the bigger of the two) can be reviewed and accepted easily > > while debates continue on the second one. > > This change should be NOP for psi. psi was already fixed by > e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim > as memory pressure") by Johannes a while ago. This patch does the same > for vmpressure, but in a different way, as the same approach of > e22c6ed90aa9 cannot be used. > > The changes you are seeing in this patch for psi are basically > reverting e22c6ed90aa9 and using the newly introduced flag that > handles vmpressure to handle psi as well, to avoid having two separate > ways to address accounting memory pressure during userspace-induced > reclaim. Ah, I see. Thanks for clarifying that. > > > > > > > > > > > > > > > Though IMO we should deprecate vmpressure altogether. > > > > > > > > Yes it should be really limited to v1. But as I've said the effect on > > > > mem_cgroup_under_socket_pressure is not really clear to me. It really > > > > seems the v2 support has been introduced deliberately. > > > > > > > > -- > > > > Michal Hocko > > > > SUSE Labs
On Thu 23-06-22 10:26:11, Yosry Ahmed wrote: > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > > [...] > > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > > exists in the upstream version so that consolidating them does not > > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > > pressure due to proactive reclaim. > > > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > > some special casing which will add a maintenance overhead. > > > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > > sense. > > > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > > interface. So I do not see how pre-existing users of the upstream kernel > > > > can see any breakage. > > > > > > > > > > Please note that vmpressure is still being used in v2 by the > > > networking layer (see mem_cgroup_under_socket_pressure()) for > > > detecting memory pressure. > > > > I have missed this. It is hidden quite good. I thought that v2 is > > completely vmpressure free. I have to admit that the effect of > > mem_cgroup_under_socket_pressure is not really clear to me. Not to > > mention whether it should or shouldn't be triggered for the user > > triggered memory reclaim. So this would really need some explanation. > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm: > memcontrol: hook up vmpressure to socket pressure"). A quick look at > the commit log and the code suggests that this is used all over the > socket and tcp code to throttles the memory consumption of the > networking layer if we are under pressure. > > However, for proactive reclaim like memory.reclaim, the target is to > probe the memcg for cold memory. Reclaiming such memory should not > have a visible effect on the workload performance. I don't think that > any network throttling side effects are correct here. Please describe the user visible effects of this change. IIUC this is changing the vmpressure semantic for pre-existing users (v1 when setting the hard limit for example) and it really should be explained why this is good for them after those years. I do not see any actual bug being described explicitly so please make sure this is all properly documented.
On Mon, Jun 27, 2022 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 23-06-22 10:26:11, Yosry Ahmed wrote: > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > > > [...] > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > > > exists in the upstream version so that consolidating them does not > > > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > > > pressure due to proactive reclaim. > > > > > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > > > some special casing which will add a maintenance overhead. > > > > > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > > > sense. > > > > > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > > > interface. So I do not see how pre-existing users of the upstream kernel > > > > > can see any breakage. > > > > > > > > > > > > > Please note that vmpressure is still being used in v2 by the > > > > networking layer (see mem_cgroup_under_socket_pressure()) for > > > > detecting memory pressure. > > > > > > I have missed this. It is hidden quite good. I thought that v2 is > > > completely vmpressure free. I have to admit that the effect of > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to > > > mention whether it should or shouldn't be triggered for the user > > > triggered memory reclaim. So this would really need some explanation. > > > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm: > > memcontrol: hook up vmpressure to socket pressure"). A quick look at > > the commit log and the code suggests that this is used all over the > > socket and tcp code to throttles the memory consumption of the > > networking layer if we are under pressure. > > > > However, for proactive reclaim like memory.reclaim, the target is to > > probe the memcg for cold memory. Reclaiming such memory should not > > have a visible effect on the workload performance. I don't think that > > any network throttling side effects are correct here. > > Please describe the user visible effects of this change. IIUC this is > changing the vmpressure semantic for pre-existing users (v1 when setting > the hard limit for example) and it really should be explained why > this is good for them after those years. I do not see any actual bug > being described explicitly so please make sure this is all properly > documented. In cgroup v1, user-induced reclaim that is caused by limit-setting (or memory.reclaim for systems that choose to expose it in cgroup v1) will no longer cause vmpressure notifications, which makes the vmpressure behavior consistent with the current psi behavior. In cgroup v2, user-induced reclaim (limit-setting, memory.reclaim, ..) would currently cause the networking layer to perceive the memcg as being under memory pressure, reducing memory consumption and possibly causing throttling. This patch makes the networking layer only perceive the memcg as being under pressure when the "pressure" is caused by increased memory usage, not limit-setting or proactive reclaim, which also makes the definition of memcg memory pressure consistent with psi today. In short, the purpose of this patch is to unify the definition of memcg memory pressure across psi and vmpressure (which indirectly also defines the definition of memcg memory pressure for the networking layer). If this sounds good to you, I can add this explanation to the commit log, and possibly anywhere you see appropriate in the code/docs. > > -- > Michal Hocko > SUSE Labs
On Mon 27-06-22 01:39:46, Yosry Ahmed wrote: > On Mon, Jun 27, 2022 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Thu 23-06-22 10:26:11, Yosry Ahmed wrote: > > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > > > > [...] > > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > > > > exists in the upstream version so that consolidating them does not > > > > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > > > > pressure due to proactive reclaim. > > > > > > > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > > > > some special casing which will add a maintenance overhead. > > > > > > > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > > > > sense. > > > > > > > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > > > > interface. So I do not see how pre-existing users of the upstream kernel > > > > > > can see any breakage. > > > > > > > > > > > > > > > > Please note that vmpressure is still being used in v2 by the > > > > > networking layer (see mem_cgroup_under_socket_pressure()) for > > > > > detecting memory pressure. > > > > > > > > I have missed this. It is hidden quite good. I thought that v2 is > > > > completely vmpressure free. I have to admit that the effect of > > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to > > > > mention whether it should or shouldn't be triggered for the user > > > > triggered memory reclaim. So this would really need some explanation. > > > > > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm: > > > memcontrol: hook up vmpressure to socket pressure"). A quick look at > > > the commit log and the code suggests that this is used all over the > > > socket and tcp code to throttles the memory consumption of the > > > networking layer if we are under pressure. > > > > > > However, for proactive reclaim like memory.reclaim, the target is to > > > probe the memcg for cold memory. Reclaiming such memory should not > > > have a visible effect on the workload performance. I don't think that > > > any network throttling side effects are correct here. > > > > Please describe the user visible effects of this change. IIUC this is > > changing the vmpressure semantic for pre-existing users (v1 when setting > > the hard limit for example) and it really should be explained why > > this is good for them after those years. I do not see any actual bug > > being described explicitly so please make sure this is all properly > > documented. > > In cgroup v1, user-induced reclaim that is caused by limit-setting (or > memory.reclaim for systems that choose to expose it in cgroup v1) will > no longer cause vmpressure notifications, which makes the vmpressure > behavior consistent with the current psi behavior. Yes it makes the behavior consistent with PSI. But is this what existing users really want or need? This is a user visible long term behavior change for a legacy interface and there should be a very good reason to change that. > In cgroup v2, user-induced reclaim (limit-setting, memory.reclaim, ..) > would currently cause the networking layer to perceive the memcg as > being under memory pressure, reducing memory consumption and possibly > causing throttling. This patch makes the networking layer only > perceive the memcg as being under pressure when the "pressure" is > caused by increased memory usage, not limit-setting or proactive > reclaim, which also makes the definition of memcg memory pressure > consistent with psi today. I do understand the argument about the pro-active reclaim. memory.reclaim is a new interface and it a) makes sense to exclude it from different memory pressure notification interfaces and b) there are unlikely too many user applications depending on the exact behavior so changes are still rather low on the risk scale. > In short, the purpose of this patch is to unify the definition of > memcg memory pressure across psi and vmpressure (which indirectly also > defines the definition of memcg memory pressure for the networking > layer). If this sounds good to you, I can add this explanation to the > commit log, and possibly anywhere you see appropriate in the > code/docs. The consistency on its own sounds like a very weak argument to change a long term behavior. I do not really see any serious arguments or evaluation what kind of fallout this change can have on old applications that are still sticking with v1. After it has been made clear that the vmpressure is still used for the pro-active reclaim in v2 I do agree that this is likely something we want to have addressed. But I wouldn't touch v1 semantics as this doesn't really buy much and it can potentially break existing users.
On Mon, Jun 27, 2022 at 2:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 27-06-22 01:39:46, Yosry Ahmed wrote: > > On Mon, Jun 27, 2022 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Thu 23-06-22 10:26:11, Yosry Ahmed wrote: > > > > On Thu, Jun 23, 2022 at 10:04 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > On Thu 23-06-22 09:42:43, Shakeel Butt wrote: > > > > > > On Thu, Jun 23, 2022 at 9:37 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > On Thu 23-06-22 09:22:35, Yosry Ahmed wrote: > > > > > > > > On Thu, Jun 23, 2022 at 2:43 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > > > > > > On Thu 23-06-22 01:35:59, Yosry Ahmed wrote: > > > > > > > [...] > > > > > > > > > > In our internal version of memory.reclaim that we recently upstreamed, > > > > > > > > > > we do not account vmpressure during proactive reclaim (similar to how > > > > > > > > > > psi is handled upstream). We want to make sure this behavior also > > > > > > > > > > exists in the upstream version so that consolidating them does not > > > > > > > > > > break our users who rely on vmpressure and will start seeing increased > > > > > > > > > > pressure due to proactive reclaim. > > > > > > > > > > > > > > > > > > These are good reasons to have this patch in your tree. But why is this > > > > > > > > > patch benefitial for the upstream kernel? It clearly adds some code and > > > > > > > > > some special casing which will add a maintenance overhead. > > > > > > > > > > > > > > > > It is not just Google, any existing vmpressure users will start seeing > > > > > > > > false pressure notifications with memory.reclaim. The main goal of the > > > > > > > > patch is to make sure memory.reclaim does not break pre-existing users > > > > > > > > of vmpressure, and doing it in a way that is consistent with psi makes > > > > > > > > sense. > > > > > > > > > > > > > > memory.reclaim is v2 only feature which doesn't have vmpressure > > > > > > > interface. So I do not see how pre-existing users of the upstream kernel > > > > > > > can see any breakage. > > > > > > > > > > > > > > > > > > > Please note that vmpressure is still being used in v2 by the > > > > > > networking layer (see mem_cgroup_under_socket_pressure()) for > > > > > > detecting memory pressure. > > > > > > > > > > I have missed this. It is hidden quite good. I thought that v2 is > > > > > completely vmpressure free. I have to admit that the effect of > > > > > mem_cgroup_under_socket_pressure is not really clear to me. Not to > > > > > mention whether it should or shouldn't be triggered for the user > > > > > triggered memory reclaim. So this would really need some explanation. > > > > > > > > vmpressure was tied into socket pressure by 8e8ae645249b ("mm: > > > > memcontrol: hook up vmpressure to socket pressure"). A quick look at > > > > the commit log and the code suggests that this is used all over the > > > > socket and tcp code to throttles the memory consumption of the > > > > networking layer if we are under pressure. > > > > > > > > However, for proactive reclaim like memory.reclaim, the target is to > > > > probe the memcg for cold memory. Reclaiming such memory should not > > > > have a visible effect on the workload performance. I don't think that > > > > any network throttling side effects are correct here. > > > > > > Please describe the user visible effects of this change. IIUC this is > > > changing the vmpressure semantic for pre-existing users (v1 when setting > > > the hard limit for example) and it really should be explained why > > > this is good for them after those years. I do not see any actual bug > > > being described explicitly so please make sure this is all properly > > > documented. > > > > In cgroup v1, user-induced reclaim that is caused by limit-setting (or > > memory.reclaim for systems that choose to expose it in cgroup v1) will > > no longer cause vmpressure notifications, which makes the vmpressure > > behavior consistent with the current psi behavior. > > Yes it makes the behavior consistent with PSI. But is this what existing > users really want or need? This is a user visible long term behavior > change for a legacy interface and there should be a very good reason to > change that. > > > In cgroup v2, user-induced reclaim (limit-setting, memory.reclaim, ..) > > would currently cause the networking layer to perceive the memcg as > > being under memory pressure, reducing memory consumption and possibly > > causing throttling. This patch makes the networking layer only > > perceive the memcg as being under pressure when the "pressure" is > > caused by increased memory usage, not limit-setting or proactive > > reclaim, which also makes the definition of memcg memory pressure > > consistent with psi today. > > I do understand the argument about the pro-active reclaim. > memory.reclaim is a new interface and it a) makes sense to exclude it > from different memory pressure notification interfaces and b) there are > unlikely too many user applications depending on the exact behavior so > changes are still rather low on the risk scale. > > > In short, the purpose of this patch is to unify the definition of > > memcg memory pressure across psi and vmpressure (which indirectly also > > defines the definition of memcg memory pressure for the networking > > layer). If this sounds good to you, I can add this explanation to the > > commit log, and possibly anywhere you see appropriate in the > > code/docs. > > The consistency on its own sounds like a very weak argument to change a > long term behavior. I do not really see any serious arguments or > evaluation what kind of fallout this change can have on old applications > that are still sticking with v1. > > After it has been made clear that the vmpressure is still used for the > pro-active reclaim in v2 I do agree that this is likely something we > want to have addressed. But I wouldn't touch v1 semantics as this > doesn't really buy much and it can potentially break existing users. > Understood, and fair enough. There are 3 behavioral changes in this patch. (a) Do not count vmpressure for mem_cgroup_resize_max() and mem_cgroup_force_empty() in v1. (b) Do not count vmpressure (consequently, mem_cgroup_under_socket_pressure()) in v2 where psi is not counted (writing to memory.max, memory.high, and memory.reclaim). Do you want us to drop (a) and keep (b) ? or do you want to further break down (b) to only limit the change to proactive reclaim through memory.reclaim (IOW keep socket pressure on limit-setting although it is not considered pressure in terms of psi) ? > -- > Michal Hocko > SUSE Labs
On Mon 27-06-22 02:39:49, Yosry Ahmed wrote: [...] > (a) Do not count vmpressure for mem_cgroup_resize_max() and > mem_cgroup_force_empty() in v1. yes, unless you have a very good reason to change that. E.g. this has been buggy and we have finally understood that. But I do not see any indications so far. > (b) Do not count vmpressure (consequently, > mem_cgroup_under_socket_pressure()) in v2 where psi is not counted > (writing to memory.max, memory.high, and memory.reclaim). I can see clear arguments for memory.reclaim opt out for vmpressure because we have established that this is not a measure to express a memory pressure on the cgroup. Max/High are less clear to me, TBH. I do understand reasoning for PSI exclusion because considering the calling process to be stalled and non-productive is misleading. It just does its work so in a way it is a productive time in the end. For the vmpressure, which measures how hard/easy it is to reclaim memory why this should special for this particular reclaim? Again, an explanation of the effect on the socket pressure could give a better picture. Say that I somebody reduces the limit (hard/high) and it takes quite some effort to shrink the consumption down. Should the networking layer react to that in any way or should it wait for the active allocation during that process to find that out?
On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 27-06-22 02:39:49, Yosry Ahmed wrote: > [...] > > (a) Do not count vmpressure for mem_cgroup_resize_max() and > > mem_cgroup_force_empty() in v1. > > yes, unless you have a very good reason to change that. E.g. this has > been buggy and we have finally understood that. But I do not see any > indications so far. I don't have any bug reports. It makes sense that users do not expect vmpressure notifications when they resize the limits below the current usage, because it should be expected that reclaim will happen so receiving notifications here is redundant, and may be incorrectly perceived by a different user space thread as being under memory pressure. But I get your point that what the user sees as memory pressure or not could be different, and is probably already defined by the current behavior anyway, whether it makes sense or not. I can also see some userspace applications depending on this behavior in some way, either by handling that limit resize notification in a certain way or deliberately dropping it. Either way, making this change could throw them off. I don't expect any userspace applications to crash of course (because there are cases where they won't receive notifications, e.g. scanned < vmpressure_win), but perhaps it's not worth even risk misguiding them. So I agree that just because it doesn't make sense or is inconsistent with other definitions of behavior then we can make a visible change for userspace. I will drop the v1 changes in the next version anyway. Thanks! > > > (b) Do not count vmpressure (consequently, > > mem_cgroup_under_socket_pressure()) in v2 where psi is not counted > > (writing to memory.max, memory.high, and memory.reclaim). > > I can see clear arguments for memory.reclaim opt out for vmpressure > because we have established that this is not a measure to express a > memory pressure on the cgroup. > > Max/High are less clear to me, TBH. I do understand reasoning for PSI > exclusion because considering the calling process to be stalled and > non-productive is misleading. It just does its work so in a way it is > a productive time in the end. For the vmpressure, which measures how > hard/easy it is to reclaim memory why this should special for this > particular reclaim? > > Again, an explanation of the effect on the socket pressure could give a > better picture. Say that I somebody reduces the limit (hard/high) and it > takes quite some effort to shrink the consumption down. Should the > networking layer react to that in any way or should it wait for the > active allocation during that process to find that out? I am out of my depth here. Any answer on my side would be purely speculation at this point. Shakeel, can you help us here or tag some networking people? Thanks! > -- > Michal Hocko > SUSE Labs
On Mon, Jun 27, 2022 at 10:04 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote: > > [...] > > > > I can see clear arguments for memory.reclaim opt out for vmpressure > > because we have established that this is not a measure to express a > > memory pressure on the cgroup. > > > > Max/High are less clear to me, TBH. I do understand reasoning for PSI > > exclusion because considering the calling process to be stalled and > > non-productive is misleading. It just does its work so in a way it is > > a productive time in the end. For the vmpressure, which measures how > > hard/easy it is to reclaim memory why this should special for this > > particular reclaim? > > > > Again, an explanation of the effect on the socket pressure could give a > > better picture. Say that I somebody reduces the limit (hard/high) and it > > takes quite some effort to shrink the consumption down. Should the > > networking layer react to that in any way or should it wait for the > > active allocation during that process to find that out? > > I am out of my depth here. Any answer on my side would be purely > speculation at this point. Shakeel, can you help us here or tag some > networking people? So, the effect of returning true from mem_cgroup_under_socket_pressure() are: 1. Reducing send and receive buffers of the current socket. 2. May drop packets on the rx path. 3. May throttle current thread on the tx path. Now regarding the behavior from the reclaim due to reducing max or high, I think the kernel should not ignore vmpressure. Please note that unlike PSI which is associated with the current process, vmpressure is associated with the target memcg. So, any reclaim on that memcg due to real shortage of memory should not be ignored. That reclaim can be global reclaim or limit reclaim of ancestor or itself or reclaim due to lowering the limit of ancestor or itself.
On Wed, Jun 29, 2022 at 6:07 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, Jun 27, 2022 at 10:04 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote: > > > > [...] > > > > > > I can see clear arguments for memory.reclaim opt out for vmpressure > > > because we have established that this is not a measure to express a > > > memory pressure on the cgroup. > > > > > > Max/High are less clear to me, TBH. I do understand reasoning for PSI > > > exclusion because considering the calling process to be stalled and > > > non-productive is misleading. It just does its work so in a way it is > > > a productive time in the end. For the vmpressure, which measures how > > > hard/easy it is to reclaim memory why this should special for this > > > particular reclaim? > > > > > > Again, an explanation of the effect on the socket pressure could give a > > > better picture. Say that I somebody reduces the limit (hard/high) and it > > > takes quite some effort to shrink the consumption down. Should the > > > networking layer react to that in any way or should it wait for the > > > active allocation during that process to find that out? > > > > I am out of my depth here. Any answer on my side would be purely > > speculation at this point. Shakeel, can you help us here or tag some > > networking people? > > So, the effect of returning true from mem_cgroup_under_socket_pressure() are: > > 1. Reducing send and receive buffers of the current socket. > 2. May drop packets on the rx path. > 3. May throttle current thread on the tx path. > > Now regarding the behavior from the reclaim due to reducing max or > high, I think the kernel should not ignore vmpressure. Please note > that unlike PSI which is associated with the current process, > vmpressure is associated with the target memcg. So, any reclaim on > that memcg due to real shortage of memory should not be ignored. That > reclaim can be global reclaim or limit reclaim of ancestor or itself > or reclaim due to lowering the limit of ancestor or itself. So it seems like we should only ignore vmpressure for proactive reclaim (aka memory.reclaim). Michal, let me know what you think here, I can drop psi and limit-setting changes in v3 and basically just ignore vmpressure for memory.reclaim (MEMCG_RECLAIM_PROACTIVE / sc->proactive instead of MEMCG_RECLAIM_CONTROLLED / sc->controlled maybe).
On Wed 29-06-22 19:08:42, Yosry Ahmed wrote: > On Wed, Jun 29, 2022 at 6:07 PM Shakeel Butt <shakeelb@google.com> wrote: > > > > On Mon, Jun 27, 2022 at 10:04 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Mon, Jun 27, 2022 at 5:31 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > [...] > > > > > > > > I can see clear arguments for memory.reclaim opt out for vmpressure > > > > because we have established that this is not a measure to express a > > > > memory pressure on the cgroup. > > > > > > > > Max/High are less clear to me, TBH. I do understand reasoning for PSI > > > > exclusion because considering the calling process to be stalled and > > > > non-productive is misleading. It just does its work so in a way it is > > > > a productive time in the end. For the vmpressure, which measures how > > > > hard/easy it is to reclaim memory why this should special for this > > > > particular reclaim? > > > > > > > > Again, an explanation of the effect on the socket pressure could give a > > > > better picture. Say that I somebody reduces the limit (hard/high) and it > > > > takes quite some effort to shrink the consumption down. Should the > > > > networking layer react to that in any way or should it wait for the > > > > active allocation during that process to find that out? > > > > > > I am out of my depth here. Any answer on my side would be purely > > > speculation at this point. Shakeel, can you help us here or tag some > > > networking people? > > > > So, the effect of returning true from mem_cgroup_under_socket_pressure() are: > > > > 1. Reducing send and receive buffers of the current socket. > > 2. May drop packets on the rx path. > > 3. May throttle current thread on the tx path. > > > > Now regarding the behavior from the reclaim due to reducing max or > > high, I think the kernel should not ignore vmpressure. Please note > > that unlike PSI which is associated with the current process, > > vmpressure is associated with the target memcg. So, any reclaim on > > that memcg due to real shortage of memory should not be ignored. That > > reclaim can be global reclaim or limit reclaim of ancestor or itself > > or reclaim due to lowering the limit of ancestor or itself. > > So it seems like we should only ignore vmpressure for proactive > reclaim (aka memory.reclaim). > > Michal, let me know what you think here, I can drop psi and > limit-setting changes in v3 and basically just ignore vmpressure for > memory.reclaim (MEMCG_RECLAIM_PROACTIVE / sc->proactive instead of > MEMCG_RECLAIM_CONTROLLED / sc->controlled maybe). Yes, that makes much more sense to me.
diff --git a/include/linux/swap.h b/include/linux/swap.h index 0c0fed1b348f2..5a6766e417afe 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -411,10 +411,13 @@ extern void lru_cache_add_inactive_or_unevictable(struct page *page, extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); + +#define MEMCG_RECLAIM_MAY_SWAP (1 << 1) +#define MEMCG_RECLAIM_CONTROLLED (1 << 2) extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - bool may_swap); + unsigned int reclaim_options); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abec50f31fe64..a76bb7ae76f73 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2319,20 +2319,16 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, gfp_t gfp_mask) { unsigned long nr_reclaimed = 0; + unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP; do { - unsigned long pflags; - if (page_counter_read(&memcg->memory) <= READ_ONCE(memcg->memory.high)) continue; - memcg_memory_event(memcg, MEMCG_HIGH); - - psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, - gfp_mask, true); - psi_memstall_leave(&pflags); + gfp_mask, + reclaim_options); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2576,9 +2572,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; unsigned long nr_reclaimed; bool passed_oom = false; - bool may_swap = true; + unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP; bool drained = false; - unsigned long pflags; retry: if (consume_stock(memcg, nr_pages)) @@ -2593,7 +2588,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, mem_over_limit = mem_cgroup_from_counter(counter, memory); } else { mem_over_limit = mem_cgroup_from_counter(counter, memsw); - may_swap = false; + reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP; } if (batch > nr_pages) { @@ -2618,10 +2613,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, memcg_memory_event(mem_over_limit, MEMCG_MAX); - psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, may_swap); - psi_memstall_leave(&pflags); + gfp_mask, reclaim_options); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; @@ -3369,7 +3362,9 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, int ret; bool limits_invariant; struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory; + unsigned int reclaim_options = memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP; + reclaim_options |= MEMCG_RECLAIM_CONTROLLED; do { if (signal_pending(current)) { ret = -EINTR; @@ -3403,7 +3398,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, - GFP_KERNEL, !memsw)) { + GFP_KERNEL, reclaim_options)) { ret = -EBUSY; break; } @@ -3502,6 +3497,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, static int mem_cgroup_force_empty(struct mem_cgroup *memcg) { int nr_retries = MAX_RECLAIM_RETRIES; + unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED | + MEMCG_RECLAIM_MAY_SWAP; /* we call try-to-free pages for make this cgroup empty */ lru_add_drain_all(); @@ -3513,7 +3510,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) if (signal_pending(current)) return -EINTR; - if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true)) + if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, + reclaim_options)) nr_retries--; } @@ -6215,6 +6213,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, unsigned int nr_retries = MAX_RECLAIM_RETRIES; bool drained = false; unsigned long high; + unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED | + MEMCG_RECLAIM_MAY_SWAP; int err; buf = strstrip(buf); @@ -6241,7 +6241,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, true); + GFP_KERNEL, reclaim_options); if (!reclaimed && !nr_retries--) break; @@ -6264,6 +6264,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, unsigned int nr_reclaims = MAX_RECLAIM_RETRIES; bool drained = false; unsigned long max; + unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED | + MEMCG_RECLAIM_MAY_SWAP; int err; buf = strstrip(buf); @@ -6290,7 +6292,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, true)) + GFP_KERNEL, reclaim_options)) nr_reclaims--; continue; } @@ -6419,6 +6421,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; + unsigned int reclaim_options = MEMCG_RECLAIM_CONTROLLED | + MEMCG_RECLAIM_MAY_SWAP; int err; buf = strstrip(buf); @@ -6442,7 +6446,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim - nr_reclaimed, - GFP_KERNEL, true); + GFP_KERNEL, reclaim_options); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index f7d9a683e3a7d..6efe7660f7f78 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -102,6 +102,9 @@ struct scan_control { /* Can pages be swapped as part of reclaim? */ unsigned int may_swap:1; + /* Reclaim is controlled by userspace */ + unsigned int controlled:1; + /* * Cgroup memory below memory.low is protected as long as we * don't threaten to OOM. If any cgroup is reclaimed at @@ -3125,9 +3128,10 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc) sc->priority); /* Record the group's reclaim efficiency */ - vmpressure(sc->gfp_mask, memcg, false, - sc->nr_scanned - scanned, - sc->nr_reclaimed - reclaimed); + if (!sc->controlled) + vmpressure(sc->gfp_mask, memcg, false, + sc->nr_scanned - scanned, + sc->nr_reclaimed - reclaimed); } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL))); } @@ -3250,9 +3254,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) } /* Record the subtree's reclaim efficiency */ - vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, - sc->nr_scanned - nr_scanned, - sc->nr_reclaimed - nr_reclaimed); + if (!sc->controlled) + vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true, + sc->nr_scanned - nr_scanned, + sc->nr_reclaimed - nr_reclaimed); if (sc->nr_reclaimed - nr_reclaimed) reclaimable = true; @@ -3534,8 +3539,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, __count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1); do { - vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, - sc->priority); + if (!sc->controlled) + vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup, + sc->priority); sc->nr_scanned = 0; shrink_zones(zonelist, sc); @@ -3751,6 +3757,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = 1, + .controlled = 0, }; /* @@ -3825,10 +3832,12 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - bool may_swap) + unsigned int reclaim_options) { unsigned long nr_reclaimed; + unsigned long pflags; unsigned int noreclaim_flag; + bool controlled_reclaim = reclaim_options & MEMCG_RECLAIM_CONTROLLED; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | @@ -3838,7 +3847,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, .may_unmap = 1, - .may_swap = may_swap, + .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), + .controlled = controlled_reclaim, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put @@ -3848,12 +3858,19 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); set_task_reclaim_state(current, &sc.reclaim_state); + trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask); + + if (!controlled_reclaim) + psi_memstall_enter(&pflags); noreclaim_flag = memalloc_noreclaim_save(); nr_reclaimed = do_try_to_free_pages(zonelist, &sc); memalloc_noreclaim_restore(noreclaim_flag); + if (!controlled_reclaim) + psi_memstall_leave(&pflags); + trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); set_task_reclaim_state(current, NULL); @@ -4095,6 +4112,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) .gfp_mask = GFP_KERNEL, .order = order, .may_unmap = 1, + .controlled = 0, }; set_task_reclaim_state(current, &sc.reclaim_state); @@ -4555,6 +4573,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) .may_unmap = 1, .may_swap = 1, .hibernation_mode = 1, + .controlled = 0, }; struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); unsigned long nr_reclaimed; @@ -4707,6 +4726,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), + .controlled = 0, }; unsigned long pflags;
Commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") made sure that memory reclaim that is induced by userspace (limit-setting, proactive reclaim, ..) is not counted as memory pressure for the purposes of psi. Instead of counting psi inside try_to_free_mem_cgroup_pages(), callers from try_charge() and reclaim_high() wrap the call to try_to_free_mem_cgroup_pages() with psi handlers. However, vmpressure is still counted in these cases where reclaim is directly induced by userspace. This patch makes sure vmpressure is not counted in those operations, in the same way as psi. Since vmpressure calls need to happen deeper within the reclaim path, the same approach could not be followed. Hence, a new "controlled" flag is added to struct scan_control to flag a reclaim operation that is controlled by userspace. This flag is set by limit-setting and proactive reclaim operations, and is used to count vmpressure correctly. To prevent future divergence of psi and vmpressure, commit e22c6ed90aa9 ("mm: memcontrol: don't count limit-setting reclaim as memory pressure") is effectively reverted and the same flag is used to control psi as well. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- include/linux/swap.h | 5 ++++- mm/memcontrol.c | 40 ++++++++++++++++++++++------------------ mm/vmscan.c | 40 ++++++++++++++++++++++++++++++---------- 3 files changed, 56 insertions(+), 29 deletions(-)