diff mbox series

[resend] memcg: introduce per-memcg reclaim interface

Message ID	20220331084151.2600229-1-yosryahmed@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 31 Mar 2022 08:41:51 +0000 Message-Id: <20220331084151.2600229-1-yosryahmed@google.com> Mime-Version: 1.0 Subject: [PATCH resend] memcg: introduce per-memcg reclaim interface From: Yosry Ahmed <yosryahmed@google.com> To: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, Andrew Morton <akpm@linux-foundation.org>, David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Roman Gushchin <roman.gushchin@linux.dev>, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Jonathan Corbet <corbet@lwn.net>, Yu Zhao <yuzhao@google.com>, Dave Hansen <dave.hansen@linux.intel.com>, Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[resend] memcg: introduce per-memcg reclaim interface \| expand [resend] memcg: introduce per-memcg reclaim interface

Commit Message

Yosry Ahmed March 31, 2022, 8:41 a.m. UTC

From: Shakeel Butt <shakeelb@google.com>

Introduce an memcg interface to trigger memory reclaim on a memory cgroup.

Use case: Proactive Reclaim
---------------------------

A userspace proactive reclaimer can continuously probe the memcg to
reclaim a small amount of memory. This gives more accurate and
up-to-date workingset estimation as the LRUs are continuously
sorted and can potentially provide more deterministic memory
overcommit behavior. The memory overcommit controller can provide
more proactive response to the changing behavior of the running
applications instead of being reactive.

A userspace reclaimer's purpose in this case is not a complete replacement
for kswapd or direct reclaim, it is to proactively identify memory savings
opportunities and reclaim some amount of cold pages set by the policy
to free up the memory for more demanding jobs or scheduling new jobs.

A user space proactive reclaimer is used in Google data centers.
Additionally, Meta's TMO paper recently referenced a very similar
interface used for user space proactive reclaim:
https://dl.acm.org/doi/pdf/10.1145/3503222.3507731

Benefits of a user space reclaimer:
-----------------------------------

1) More flexible on who should be charged for the cpu of the memory
reclaim. For proactive reclaim, it makes more sense to be centralized.

2) More flexible on dedicating the resources (like cpu). The memory
overcommit controller can balance the cost between the cpu usage and
the memory reclaimed.

3) Provides a way to the applications to keep their LRUs sorted, so,
under memory pressure better reclaim candidates are selected. This also
gives more accurate and uptodate notion of working set for an
application.

Why memory.high is not enough?
------------------------------

- memory.high can be used to trigger reclaim in a memcg and can
  potentially be used for proactive reclaim.
  However there is a big downside in using memory.high. It can potentially
  introduce high reclaim stalls in the target application as the
  allocations from the processes or the threads of the application can hit
  the temporary memory.high limit.

- Userspace proactive reclaimers usually use feedback loops to decide
  how much memory to proactively reclaim from a workload. The metrics
  used for this are usually either refaults or PSI, and these metrics
  will become messy if the application gets throttled by hitting the
  high limit.

- memory.high is a stateful interface, if the userspace proactive
  reclaimer crashes for any reason while triggering reclaim it can leave
  the application in a bad state.

- If a workload is rapidly expanding, setting memory.high to proactively
  reclaim memory can result in actually reclaiming more memory than
  intended.

The benefits of such interface and shortcomings of existing interface
were further discussed in this RFC thread:
https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/

Interface:
----------

Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
trigger reclaim in the target memory cgroup.


Possible Extensions:
--------------------

- This interface can be extended with an additional parameter or flags
  to allow specifying one or more types of memory to reclaim from (e.g.
  file, anon, ..).

- The interface can also be extended with a node mask to reclaim from
  specific nodes. This has use cases for reclaim-based demotion in memory
  tiering systens.

- A similar per-node interface can also be added to support proactive
  reclaim and reclaim-based demotion in systems without memcg.

For now, let's keep things simple by adding the basic functionality.

[yosryahmed@google.com: refreshed to current master, updated commit
message based on recent discussions and use cases]
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
 mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

Comments

Roman Gushchin March 31, 2022, 5:25 p.m. UTC | #1

On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> From: Shakeel Butt <shakeelb@google.com>
> 
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> 
> Use case: Proactive Reclaim
> ---------------------------
> 
> A userspace proactive reclaimer can continuously probe the memcg to
> reclaim a small amount of memory. This gives more accurate and
> up-to-date workingset estimation as the LRUs are continuously
> sorted and can potentially provide more deterministic memory
> overcommit behavior. The memory overcommit controller can provide
> more proactive response to the changing behavior of the running
> applications instead of being reactive.
> 
> A userspace reclaimer's purpose in this case is not a complete replacement
> for kswapd or direct reclaim, it is to proactively identify memory savings
> opportunities and reclaim some amount of cold pages set by the policy
> to free up the memory for more demanding jobs or scheduling new jobs.
> 
> A user space proactive reclaimer is used in Google data centers.
> Additionally, Meta's TMO paper recently referenced a very similar
> interface used for user space proactive reclaim:
> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> 
> Benefits of a user space reclaimer:
> -----------------------------------
> 
> 1) More flexible on who should be charged for the cpu of the memory
> reclaim. For proactive reclaim, it makes more sense to be centralized.
> 
> 2) More flexible on dedicating the resources (like cpu). The memory
> overcommit controller can balance the cost between the cpu usage and
> the memory reclaimed.
> 
> 3) Provides a way to the applications to keep their LRUs sorted, so,
> under memory pressure better reclaim candidates are selected. This also
> gives more accurate and uptodate notion of working set for an
> application.
> 
> Why memory.high is not enough?
> ------------------------------
> 
> - memory.high can be used to trigger reclaim in a memcg and can
>   potentially be used for proactive reclaim.
>   However there is a big downside in using memory.high. It can potentially
>   introduce high reclaim stalls in the target application as the
>   allocations from the processes or the threads of the application can hit
>   the temporary memory.high limit.
> 
> - Userspace proactive reclaimers usually use feedback loops to decide
>   how much memory to proactively reclaim from a workload. The metrics
>   used for this are usually either refaults or PSI, and these metrics
>   will become messy if the application gets throttled by hitting the
>   high limit.
> 
> - memory.high is a stateful interface, if the userspace proactive
>   reclaimer crashes for any reason while triggering reclaim it can leave
>   the application in a bad state.
> 
> - If a workload is rapidly expanding, setting memory.high to proactively
>   reclaim memory can result in actually reclaiming more memory than
>   intended.
> 
> The benefits of such interface and shortcomings of existing interface
> were further discussed in this RFC thread:
> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/

Hello!

I'm totally up for the proposed feature! It makes total sense and is proved
to be useful, let's add it.

> 
> Interface:
> ----------
> 
> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> trigger reclaim in the target memory cgroup.
> 
> 
> Possible Extensions:
> --------------------
> 
> - This interface can be extended with an additional parameter or flags
>   to allow specifying one or more types of memory to reclaim from (e.g.
>   file, anon, ..).
> 
> - The interface can also be extended with a node mask to reclaim from
>   specific nodes. This has use cases for reclaim-based demotion in memory
>   tiering systens.
> 
> - A similar per-node interface can also be added to support proactive
>   reclaim and reclaim-based demotion in systems without memcg.

Maybe an option to specify a timeout? That might simplify the userspace part.
Also, please please add a test to selftests/cgroup/memcg tests.
It will also provide an example on how the userspace can use the feature.

> 
> For now, let's keep things simple by adding the basic functionality.

What I'm worried about is how we gonna extend it? How do you see the interface
with 2-3 extensions from the list above? All these extensions look very
reasonable to me, so we'll likely have to implement them soon. So let's think
about the extensibility now.

I wonder if it makes more sense to introduce a sys_reclaim() syscall instead?
In the end, such a feature might make sense on the system level too.
Yes, there is the drop_caches sysctl, but it's too radical for many cases.

> 
> [yosryahmed@google.com: refreshed to current master, updated commit
> message based on recent discussions and use cases]
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
>  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69d7a6983f78..925aaabb2247 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
>  	high limit is used and monitored properly, this limit's
>  	utility is limited to providing the final safety net.
>  
> +  memory.reclaim
> +	A write-only file which exists on non-root cgroups.
> +
> +	This is a simple interface to trigger memory reclaim in the
> +	target cgroup. Write the number of bytes to reclaim to this
> +	file and the kernel will try to reclaim that much memory.
> +	Please note that the kernel can over or under reclaim from
> +	the target cgroup.
> +
>    memory.oom.group
>  	A read-write single value file which exists on non-root
>  	cgroups.  The default value is "0".
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 725f76723220..994849fab7df 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> +			      size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +	if (err)
> +		return err;
> +
> +	while (nr_reclaimed < nr_to_reclaim) {
> +		unsigned long reclaimed;
> +
> +		if (signal_pending(current))
> +			break;
> +
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> +						nr_to_reclaim - nr_reclaimed,
> +						GFP_KERNEL, true);
> +
> +		if (!reclaimed && !nr_retries--)
> +			break;
> +
> +		nr_reclaimed += reclaimed;
> +	}
> +
> +	return nbytes;
> +}
> +
>  static struct cftype memory_files[] = {
>  	{
>  		.name = "current",
> @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
>  		.seq_show = memory_oom_group_show,
>  		.write = memory_oom_group_write,
>  	},
> +	{
> +		.name = "reclaim",
> +		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> +		.write = memory_reclaim,

Btw, why not on root?

Johannes Weiner March 31, 2022, 7:25 p.m. UTC | #2

On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> From: Shakeel Butt <shakeelb@google.com>
> 
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> 
> Use case: Proactive Reclaim
> ---------------------------
> 
> A userspace proactive reclaimer can continuously probe the memcg to
> reclaim a small amount of memory. This gives more accurate and
> up-to-date workingset estimation as the LRUs are continuously
> sorted and can potentially provide more deterministic memory
> overcommit behavior. The memory overcommit controller can provide
> more proactive response to the changing behavior of the running
> applications instead of being reactive.
> 
> A userspace reclaimer's purpose in this case is not a complete replacement
> for kswapd or direct reclaim, it is to proactively identify memory savings
> opportunities and reclaim some amount of cold pages set by the policy
> to free up the memory for more demanding jobs or scheduling new jobs.
> 
> A user space proactive reclaimer is used in Google data centers.
> Additionally, Meta's TMO paper recently referenced a very similar
> interface used for user space proactive reclaim:
> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> 
> Benefits of a user space reclaimer:
> -----------------------------------
> 
> 1) More flexible on who should be charged for the cpu of the memory
> reclaim. For proactive reclaim, it makes more sense to be centralized.
> 
> 2) More flexible on dedicating the resources (like cpu). The memory
> overcommit controller can balance the cost between the cpu usage and
> the memory reclaimed.
> 
> 3) Provides a way to the applications to keep their LRUs sorted, so,
> under memory pressure better reclaim candidates are selected. This also
> gives more accurate and uptodate notion of working set for an
> application.
> 
> Why memory.high is not enough?
> ------------------------------
> 
> - memory.high can be used to trigger reclaim in a memcg and can
>   potentially be used for proactive reclaim.
>   However there is a big downside in using memory.high. It can potentially
>   introduce high reclaim stalls in the target application as the
>   allocations from the processes or the threads of the application can hit
>   the temporary memory.high limit.
> 
> - Userspace proactive reclaimers usually use feedback loops to decide
>   how much memory to proactively reclaim from a workload. The metrics
>   used for this are usually either refaults or PSI, and these metrics
>   will become messy if the application gets throttled by hitting the
>   high limit.
> 
> - memory.high is a stateful interface, if the userspace proactive
>   reclaimer crashes for any reason while triggering reclaim it can leave
>   the application in a bad state.
> 
> - If a workload is rapidly expanding, setting memory.high to proactively
>   reclaim memory can result in actually reclaiming more memory than
>   intended.
> 
> The benefits of such interface and shortcomings of existing interface
> were further discussed in this RFC thread:
> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> 
> Interface:
> ----------
> 
> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> trigger reclaim in the target memory cgroup.
> 
> 
> Possible Extensions:
> --------------------
> 
> - This interface can be extended with an additional parameter or flags
>   to allow specifying one or more types of memory to reclaim from (e.g.
>   file, anon, ..).
> 
> - The interface can also be extended with a node mask to reclaim from
>   specific nodes. This has use cases for reclaim-based demotion in memory
>   tiering systens.
> 
> - A similar per-node interface can also be added to support proactive
>   reclaim and reclaim-based demotion in systems without memcg.
> 
> For now, let's keep things simple by adding the basic functionality.
> 
> [yosryahmed@google.com: refreshed to current master, updated commit
> message based on recent discussions and use cases]
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks for compiling all the history and arguments around this change!

Andrew Morton April 1, 2022, 12:33 a.m. UTC | #3

On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:

> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> +			      size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +	if (err)
> +		return err;
> +
> +	while (nr_reclaimed < nr_to_reclaim) {
> +		unsigned long reclaimed;
> +
> +		if (signal_pending(current))
> +			break;
> +
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> +						nr_to_reclaim - nr_reclaimed,
> +						GFP_KERNEL, true);
> +
> +		if (!reclaimed && !nr_retries--)
> +			break;
> +
> +		nr_reclaimed += reclaimed;
> +	}

Is there any way in which this can be provoked into triggering the
softlockup detector?

Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel? 
Would additional flexibility be gained by letting userspace handle
retrying?

Chen Wandun April 1, 2022, 3:05 a.m. UTC | #4

在 2022/3/31 16:41, Yosry Ahmed 写道:
> From: Shakeel Butt <shakeelb@google.com>
>
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
>
> Use case: Proactive Reclaim
> ---------------------------
>
> A userspace proactive reclaimer can continuously probe the memcg to
> reclaim a small amount of memory. This gives more accurate and
> up-to-date workingset estimation as the LRUs are continuously
> sorted and can potentially provide more deterministic memory
> overcommit behavior. The memory overcommit controller can provide
> more proactive response to the changing behavior of the running
> applications instead of being reactive.
>
> A userspace reclaimer's purpose in this case is not a complete replacement
> for kswapd or direct reclaim, it is to proactively identify memory savings
> opportunities and reclaim some amount of cold pages set by the policy
> to free up the memory for more demanding jobs or scheduling new jobs.
>
> A user space proactive reclaimer is used in Google data centers.
> Additionally, Meta's TMO paper recently referenced a very similar
> interface used for user space proactive reclaim:
> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
>
> Benefits of a user space reclaimer:
> -----------------------------------
>
> 1) More flexible on who should be charged for the cpu of the memory
> reclaim. For proactive reclaim, it makes more sense to be centralized.
>
> 2) More flexible on dedicating the resources (like cpu). The memory
> overcommit controller can balance the cost between the cpu usage and
> the memory reclaimed.
>
> 3) Provides a way to the applications to keep their LRUs sorted, so,
> under memory pressure better reclaim candidates are selected. This also
> gives more accurate and uptodate notion of working set for an
> application.
>
> Why memory.high is not enough?
> ------------------------------
>
> - memory.high can be used to trigger reclaim in a memcg and can
>    potentially be used for proactive reclaim.
>    However there is a big downside in using memory.high. It can potentially
>    introduce high reclaim stalls in the target application as the
>    allocations from the processes or the threads of the application can hit
>    the temporary memory.high limit.
>
> - Userspace proactive reclaimers usually use feedback loops to decide
>    how much memory to proactively reclaim from a workload. The metrics
>    used for this are usually either refaults or PSI, and these metrics
>    will become messy if the application gets throttled by hitting the
>    high limit.
>
> - memory.high is a stateful interface, if the userspace proactive
>    reclaimer crashes for any reason while triggering reclaim it can leave
>    the application in a bad state.
>
> - If a workload is rapidly expanding, setting memory.high to proactively
>    reclaim memory can result in actually reclaiming more memory than
>    intended.
>
> The benefits of such interface and shortcomings of existing interface
> were further discussed in this RFC thread:
> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
>
> Interface:
> ----------
>
> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> trigger reclaim in the target memory cgroup.
>
>
> Possible Extensions:
> --------------------
>
> - This interface can be extended with an additional parameter or flags
>    to allow specifying one or more types of memory to reclaim from (e.g.
>    file, anon, ..).
>
> - The interface can also be extended with a node mask to reclaim from
>    specific nodes. This has use cases for reclaim-based demotion in memory
>    tiering systens.
>
> - A similar per-node interface can also be added to support proactive
>    reclaim and reclaim-based demotion in systems without memcg.
>
> For now, let's keep things simple by adding the basic functionality.
>
> [yosryahmed@google.com: refreshed to current master, updated commit
> message based on recent discussions and use cases]
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>   Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
>   mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
>   2 files changed, 46 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69d7a6983f78..925aaabb2247 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
>   	high limit is used and monitored properly, this limit's
>   	utility is limited to providing the final safety net.
>   
> +  memory.reclaim
> +	A write-only file which exists on non-root cgroups.
> +
> +	This is a simple interface to trigger memory reclaim in the
> +	target cgroup. Write the number of bytes to reclaim to this
> +	file and the kernel will try to reclaim that much memory.
> +	Please note that the kernel can over or under reclaim from
> +	the target cgroup.
> +
>     memory.oom.group
>   	A read-write single value file which exists on non-root
>   	cgroups.  The default value is "0".
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 725f76723220..994849fab7df 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>   	return nbytes;
>   }
>   
> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> +			      size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +	if (err)
> +		return err;
> +
> +	while (nr_reclaimed < nr_to_reclaim) {
> +		unsigned long reclaimed;
> +
> +		if (signal_pending(current))
> +			break;
> +
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> +						nr_to_reclaim - nr_reclaimed,
> +						GFP_KERNEL, true);
In some scenario there are lots of page cache,  and we only want to 
reclaim page cache,
how about add may_swap option?
> +
> +		if (!reclaimed && !nr_retries--)
> +			break;
> +
> +		nr_reclaimed += reclaimed;
> +	}
> +
> +	return nbytes;
> +}
> +
>   static struct cftype memory_files[] = {
>   	{
>   		.name = "current",
> @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
>   		.seq_show = memory_oom_group_show,
>   		.write = memory_oom_group_write,
>   	},
> +	{
> +		.name = "reclaim",
> +		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> +		.write = memory_reclaim,
> +	},
>   	{ }	/* terminate */
>   };
>

Wei Xu April 1, 2022, 3:38 a.m. UTC | #5

On Thu, Mar 31, 2022 at 5:33 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:
>
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >  }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                           size_t nbytes, loff_t off)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +     int err;
> > +
> > +     buf = strstrip(buf);
> > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +     if (err)
> > +             return err;
> > +
> > +     while (nr_reclaimed < nr_to_reclaim) {
> > +             unsigned long reclaimed;
> > +
> > +             if (signal_pending(current))
> > +                     break;
> > +
> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                             nr_to_reclaim - nr_reclaimed,
> > +                                             GFP_KERNEL, true);
> > +
> > +             if (!reclaimed && !nr_retries--)
> > +                     break;
> > +
> > +             nr_reclaimed += reclaimed;
> > +     }
>
> Is there any way in which this can be provoked into triggering the
> softlockup detector?

memory.reclaim is similar to memory.high w.r.t. reclaiming memory,
except that memory.reclaim is stateless, while the kernel remembers
the state set by memory.high.  So memory.reclaim should not bring in
any new risks of triggering soft lockup, if any.

> Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel?
> Would additional flexibility be gained by letting userspace handle
> retrying?

I agree it is better to retry from the userspace.

Wei Xu April 1, 2022, 4:05 a.m. UTC | #6

On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> From: Shakeel Butt <shakeelb@google.com>
>
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
>
> Use case: Proactive Reclaim
> ---------------------------
>
> A userspace proactive reclaimer can continuously probe the memcg to
> reclaim a small amount of memory. This gives more accurate and
> up-to-date workingset estimation as the LRUs are continuously
> sorted and can potentially provide more deterministic memory
> overcommit behavior. The memory overcommit controller can provide
> more proactive response to the changing behavior of the running
> applications instead of being reactive.
>
> A userspace reclaimer's purpose in this case is not a complete replacement
> for kswapd or direct reclaim, it is to proactively identify memory savings
> opportunities and reclaim some amount of cold pages set by the policy
> to free up the memory for more demanding jobs or scheduling new jobs.
>
> A user space proactive reclaimer is used in Google data centers.
> Additionally, Meta's TMO paper recently referenced a very similar
> interface used for user space proactive reclaim:
> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
>
> Benefits of a user space reclaimer:
> -----------------------------------
>
> 1) More flexible on who should be charged for the cpu of the memory
> reclaim. For proactive reclaim, it makes more sense to be centralized.
>
> 2) More flexible on dedicating the resources (like cpu). The memory
> overcommit controller can balance the cost between the cpu usage and
> the memory reclaimed.
>
> 3) Provides a way to the applications to keep their LRUs sorted, so,
> under memory pressure better reclaim candidates are selected. This also
> gives more accurate and uptodate notion of working set for an
> application.
>
> Why memory.high is not enough?
> ------------------------------
>
> - memory.high can be used to trigger reclaim in a memcg and can
>   potentially be used for proactive reclaim.
>   However there is a big downside in using memory.high. It can potentially
>   introduce high reclaim stalls in the target application as the
>   allocations from the processes or the threads of the application can hit
>   the temporary memory.high limit.
>
> - Userspace proactive reclaimers usually use feedback loops to decide
>   how much memory to proactively reclaim from a workload. The metrics
>   used for this are usually either refaults or PSI, and these metrics
>   will become messy if the application gets throttled by hitting the
>   high limit.
>
> - memory.high is a stateful interface, if the userspace proactive
>   reclaimer crashes for any reason while triggering reclaim it can leave
>   the application in a bad state.
>
> - If a workload is rapidly expanding, setting memory.high to proactively
>   reclaim memory can result in actually reclaiming more memory than
>   intended.
>
> The benefits of such interface and shortcomings of existing interface
> were further discussed in this RFC thread:
> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
>
> Interface:
> ----------
>
> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> trigger reclaim in the target memory cgroup.
>
>
> Possible Extensions:
> --------------------
>
> - This interface can be extended with an additional parameter or flags
>   to allow specifying one or more types of memory to reclaim from (e.g.
>   file, anon, ..).
>
> - The interface can also be extended with a node mask to reclaim from
>   specific nodes. This has use cases for reclaim-based demotion in memory
>   tiering systens.
>
> - A similar per-node interface can also be added to support proactive
>   reclaim and reclaim-based demotion in systems without memcg.
>
> For now, let's keep things simple by adding the basic functionality.
>
> [yosryahmed@google.com: refreshed to current master, updated commit
> message based on recent discussions and use cases]
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
>  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69d7a6983f78..925aaabb2247 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
>         high limit is used and monitored properly, this limit's
>         utility is limited to providing the final safety net.
>
> +  memory.reclaim
> +       A write-only file which exists on non-root cgroups.
> +
> +       This is a simple interface to trigger memory reclaim in the
> +       target cgroup. Write the number of bytes to reclaim to this
> +       file and the kernel will try to reclaim that much memory.
> +       Please note that the kernel can over or under reclaim from
> +       the target cgroup.
> +
>    memory.oom.group
>         A read-write single value file which exists on non-root
>         cgroups.  The default value is "0".
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 725f76723220..994849fab7df 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> +                             size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +       int err;
> +
> +       buf = strstrip(buf);
> +       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +       if (err)
> +               return err;
> +
> +       while (nr_reclaimed < nr_to_reclaim) {
> +               unsigned long reclaimed;
> +
> +               if (signal_pending(current))
> +                       break;
> +
> +               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> +                                               nr_to_reclaim - nr_reclaimed,
> +                                               GFP_KERNEL, true);
> +
> +               if (!reclaimed && !nr_retries--)
> +                       break;
> +
> +               nr_reclaimed += reclaimed;
> +       }
> +
> +       return nbytes;

It is better to return an error code (e.g. -EBUSY) when
memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory,
except if the cgroup memory usage is already 0.  We can also return
-EINVAL if nr_to_reclaim is too large (e.g. > limit).

> +}
> +
>  static struct cftype memory_files[] = {
>         {
>                 .name = "current",
> @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
>                 .seq_show = memory_oom_group_show,
>                 .write = memory_oom_group_write,
>         },
> +       {
> +               .name = "reclaim",
> +               .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> +               .write = memory_reclaim,
> +       },
>         { }     /* terminate */
>  };
>
> --
> 2.35.1.1021.g381101b075-goog
>

Wei Xu April 1, 2022, 6:01 a.m. UTC | #7

On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >   potentially be used for proactive reclaim.
> >   However there is a big downside in using memory.high. It can potentially
> >   introduce high reclaim stalls in the target application as the
> >   allocations from the processes or the threads of the application can hit
> >   the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >   how much memory to proactively reclaim from a workload. The metrics
> >   used for this are usually either refaults or PSI, and these metrics
> >   will become messy if the application gets throttled by hitting the
> >   high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >   reclaimer crashes for any reason while triggering reclaim it can leave
> >   the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >   reclaim memory can result in actually reclaiming more memory than
> >   intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
>
> Hello!
>
> I'm totally up for the proposed feature! It makes total sense and is proved
> to be useful, let's add it.
>
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >   file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >   tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
>
> Maybe an option to specify a timeout? That might simplify the userspace part.

A timeout is a good idea.  I think it can be added as an extension,
similar to other extensions.

> Also, please please add a test to selftests/cgroup/memcg tests.
> It will also provide an example on how the userspace can use the feature.

+1

> >
> > For now, let's keep things simple by adding the basic functionality.
>
> What I'm worried about is how we gonna extend it? How do you see the interface
> with 2-3 extensions from the list above? All these extensions look very
> reasonable to me, so we'll likely have to implement them soon. So let's think
> about the extensibility now.

For the first two extensions (flags and nodemask), they can be
implemented as additional positional arguments of memory.reclaim.

The non-memcg use cases will need a different interface, which can be
either a sysfs file or a syscall.

> I wonder if it makes more sense to introduce a sys_reclaim() syscall instead?
> In the end, such a feature might make sense on the system level too.
> Yes, there is the drop_caches sysctl, but it's too radical for many cases.

sys_reclaim() syscall is a good proposal for non-memcg use cases.  But
for memcg-based proactive reclaim,  memory.reclaim should be more
natural. It is not common to have cgroup as a syscall argument.

> >
> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..925aaabb2247 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> >       high limit is used and monitored properly, this limit's
> >       utility is limited to providing the final safety net.
> >
> > +  memory.reclaim
> > +     A write-only file which exists on non-root cgroups.
> > +
> > +     This is a simple interface to trigger memory reclaim in the
> > +     target cgroup. Write the number of bytes to reclaim to this
> > +     file and the kernel will try to reclaim that much memory.
> > +     Please note that the kernel can over or under reclaim from
> > +     the target cgroup.
> > +
> >    memory.oom.group
> >       A read-write single value file which exists on non-root
> >       cgroups.  The default value is "0".
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 725f76723220..994849fab7df 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >  }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                           size_t nbytes, loff_t off)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +     int err;
> > +
> > +     buf = strstrip(buf);
> > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +     if (err)
> > +             return err;
> > +
> > +     while (nr_reclaimed < nr_to_reclaim) {
> > +             unsigned long reclaimed;
> > +
> > +             if (signal_pending(current))
> > +                     break;
> > +
> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                             nr_to_reclaim - nr_reclaimed,
> > +                                             GFP_KERNEL, true);
> > +
> > +             if (!reclaimed && !nr_retries--)
> > +                     break;
> > +
> > +             nr_reclaimed += reclaimed;
> > +     }
> > +
> > +     return nbytes;
> > +}
> > +
> >  static struct cftype memory_files[] = {
> >       {
> >               .name = "current",
> > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
> >               .seq_show = memory_oom_group_show,
> >               .write = memory_oom_group_write,
> >       },
> > +     {
> > +             .name = "reclaim",
> > +             .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> > +             .write = memory_reclaim,
>
> Btw, why not on root?

Vaibhav Jain April 1, 2022, 8:39 a.m. UTC | #8

Yosry Ahmed <yosryahmed@google.com> writes:
> From: Shakeel Butt <shakeelb@google.com>
>
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
<snip>

> +
> +	while (nr_reclaimed < nr_to_reclaim) {
> +		unsigned long reclaimed;
> +
> +		if (signal_pending(current))
> +			break;
> +
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> +						nr_to_reclaim - nr_reclaimed,
> +						GFP_KERNEL, true);
> +
> +		if (!reclaimed && !nr_retries--)
> +			break;
> +
> +		nr_reclaimed += reclaimed;

I think there should be a cond_resched() in this loop before
try_to_free_mem_cgroup_pages() to have better chances of reclaim
succeding early.

<snip>

Yosry Ahmed April 1, 2022, 9:11 a.m. UTC | #9

On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >   potentially be used for proactive reclaim.
> >   However there is a big downside in using memory.high. It can potentially
> >   introduce high reclaim stalls in the target application as the
> >   allocations from the processes or the threads of the application can hit
> >   the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >   how much memory to proactively reclaim from a workload. The metrics
> >   used for this are usually either refaults or PSI, and these metrics
> >   will become messy if the application gets throttled by hitting the
> >   high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >   reclaimer crashes for any reason while triggering reclaim it can leave
> >   the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >   reclaim memory can result in actually reclaiming more memory than
> >   intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
>
> Hello!
>
> I'm totally up for the proposed feature! It makes total sense and is proved
> to be useful, let's add it.
>
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >   file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >   tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
>
> Maybe an option to specify a timeout? That might simplify the userspace part.
> Also, please please add a test to selftests/cgroup/memcg tests.
> It will also provide an example on how the userspace can use the feature.
>

Hi Roman, thanks for taking the time to review this!

A timeout can be a good extension, I will add it to the commit message
in the next version in possible extensions.

I will add a test in v2, thanks!

>
> >
> > For now, let's keep things simple by adding the basic functionality.
>
> What I'm worried about is how we gonna extend it? How do you see the interface
> with 2-3 extensions from the list above? All these extensions look very
> reasonable to me, so we'll likely have to implement them soon. So let's think
> about the extensibility now.
>

My idea is to have these extensions as optional positional arguments
(like Wei suggested), so that the interface does not get too
complicated for users who don't care about tuning these options. If
this is the case then I think there is nothing to worry about.
Otherwise, if you think some of these options make sense to be a
required argument instead, we can rethink the initial interface.

> I wonder if it makes more sense to introduce a sys_reclaim() syscall instead?
> In the end, such a feature might make sense on the system level too.
> Yes, there is the drop_caches sysctl, but it's too radical for many cases.
>

I think in the RFC discussion there was consensus to add both a
per-memcg knob, as well as per-node / per-system knobs (through sysfs
or syscalls) later. Wei also points out that it's not common for a
syscall to have a cgroup argument.

> >
> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..925aaabb2247 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> >       high limit is used and monitored properly, this limit's
> >       utility is limited to providing the final safety net.
> >
> > +  memory.reclaim
> > +     A write-only file which exists on non-root cgroups.
> > +
> > +     This is a simple interface to trigger memory reclaim in the
> > +     target cgroup. Write the number of bytes to reclaim to this
> > +     file and the kernel will try to reclaim that much memory.
> > +     Please note that the kernel can over or under reclaim from
> > +     the target cgroup.
> > +
> >    memory.oom.group
> >       A read-write single value file which exists on non-root
> >       cgroups.  The default value is "0".
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 725f76723220..994849fab7df 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >  }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                           size_t nbytes, loff_t off)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +     int err;
> > +
> > +     buf = strstrip(buf);
> > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +     if (err)
> > +             return err;
> > +
> > +     while (nr_reclaimed < nr_to_reclaim) {
> > +             unsigned long reclaimed;
> > +
> > +             if (signal_pending(current))
> > +                     break;
> > +
> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                             nr_to_reclaim - nr_reclaimed,
> > +                                             GFP_KERNEL, true);
> > +
> > +             if (!reclaimed && !nr_retries--)
> > +                     break;
> > +
> > +             nr_reclaimed += reclaimed;
> > +     }
> > +
> > +     return nbytes;
> > +}
> > +
> >  static struct cftype memory_files[] = {
> >       {
> >               .name = "current",
> > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
> >               .seq_show = memory_oom_group_show,
> >               .write = memory_oom_group_write,
> >       },
> > +     {
> > +             .name = "reclaim",
> > +             .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> > +             .write = memory_reclaim,
>
> Btw, why not on root?

Yosry Ahmed April 1, 2022, 9:15 a.m. UTC | #10

On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >   potentially be used for proactive reclaim.
> >   However there is a big downside in using memory.high. It can potentially
> >   introduce high reclaim stalls in the target application as the
> >   allocations from the processes or the threads of the application can hit
> >   the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >   how much memory to proactively reclaim from a workload. The metrics
> >   used for this are usually either refaults or PSI, and these metrics
> >   will become messy if the application gets throttled by hitting the
> >   high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >   reclaimer crashes for any reason while triggering reclaim it can leave
> >   the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >   reclaim memory can result in actually reclaiming more memory than
> >   intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
>
> Hello!
>
> I'm totally up for the proposed feature! It makes total sense and is proved
> to be useful, let's add it.
>
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >   file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >   tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
>
> Maybe an option to specify a timeout? That might simplify the userspace part.
> Also, please please add a test to selftests/cgroup/memcg tests.
> It will also provide an example on how the userspace can use the feature.
>
> >
> > For now, let's keep things simple by adding the basic functionality.
>
> What I'm worried about is how we gonna extend it? How do you see the interface
> with 2-3 extensions from the list above? All these extensions look very
> reasonable to me, so we'll likely have to implement them soon. So let's think
> about the extensibility now.
>
> I wonder if it makes more sense to introduce a sys_reclaim() syscall instead?
> In the end, such a feature might make sense on the system level too.
> Yes, there is the drop_caches sysctl, but it's too radical for many cases.
>
> >
> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..925aaabb2247 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> >       high limit is used and monitored properly, this limit's
> >       utility is limited to providing the final safety net.
> >
> > +  memory.reclaim
> > +     A write-only file which exists on non-root cgroups.
> > +
> > +     This is a simple interface to trigger memory reclaim in the
> > +     target cgroup. Write the number of bytes to reclaim to this
> > +     file and the kernel will try to reclaim that much memory.
> > +     Please note that the kernel can over or under reclaim from
> > +     the target cgroup.
> > +
> >    memory.oom.group
> >       A read-write single value file which exists on non-root
> >       cgroups.  The default value is "0".
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 725f76723220..994849fab7df 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >  }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                           size_t nbytes, loff_t off)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +     int err;
> > +
> > +     buf = strstrip(buf);
> > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +     if (err)
> > +             return err;
> > +
> > +     while (nr_reclaimed < nr_to_reclaim) {
> > +             unsigned long reclaimed;
> > +
> > +             if (signal_pending(current))
> > +                     break;
> > +
> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                             nr_to_reclaim - nr_reclaimed,
> > +                                             GFP_KERNEL, true);
> > +
> > +             if (!reclaimed && !nr_retries--)
> > +                     break;
> > +
> > +             nr_reclaimed += reclaimed;
> > +     }
> > +
> > +     return nbytes;
> > +}
> > +
> >  static struct cftype memory_files[] = {
> >       {
> >               .name = "current",
> > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
> >               .seq_show = memory_oom_group_show,
> >               .write = memory_oom_group_write,
> >       },
> > +     {
> > +             .name = "reclaim",
> > +             .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> > +             .write = memory_reclaim,
>
> Btw, why not on root?

I missed the root question in my first reply. I think this was
originally modeled after the memory.high interface, but I don't know
if there are other reasons. Shakeel would know better.

AFAIK this should work naturally on root as well, but I think it makes
more sense then to use a global interface (hopefully introduced soon)?
I don't have an opinion here let me know what you prefer for v2.

Yosry Ahmed April 1, 2022, 9:17 a.m. UTC | #11

On Thu, Mar 31, 2022 at 8:38 PM Wei Xu <weixugc@google.com> wrote:
>
> On Thu, Mar 31, 2022 at 5:33 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> > >       return nbytes;
> > >  }
> > >
> > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > > +                           size_t nbytes, loff_t off)
> > > +{
> > > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > +     int err;
> > > +
> > > +     buf = strstrip(buf);
> > > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     while (nr_reclaimed < nr_to_reclaim) {
> > > +             unsigned long reclaimed;
> > > +
> > > +             if (signal_pending(current))
> > > +                     break;
> > > +
> > > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > +                                             nr_to_reclaim - nr_reclaimed,
> > > +                                             GFP_KERNEL, true);
> > > +
> > > +             if (!reclaimed && !nr_retries--)
> > > +                     break;
> > > +
> > > +             nr_reclaimed += reclaimed;
> > > +     }
> >
> > Is there any way in which this can be provoked into triggering the
> > softlockup detector?
>
> memory.reclaim is similar to memory.high w.r.t. reclaiming memory,
> except that memory.reclaim is stateless, while the kernel remembers
> the state set by memory.high.  So memory.reclaim should not bring in
> any new risks of triggering soft lockup, if any.
>
> > Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel?
> > Would additional flexibility be gained by letting userspace handle
> > retrying?
>
> I agree it is better to retry from the userspace.

Thanks Andrew and Wei for looking at this. IIUC the
MAX_RECLAIM_RETRIES loop was modeled after the loop in memory.high as
well. Is there a reason why it should be different here?

Yosry Ahmed April 1, 2022, 9:20 a.m. UTC | #12

On Thu, Mar 31, 2022 at 8:05 PM Chen Wandun <chenwandun@huawei.com> wrote:
>
>
>
> 在 2022/3/31 16:41, Yosry Ahmed 写道:
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >    potentially be used for proactive reclaim.
> >    However there is a big downside in using memory.high. It can potentially
> >    introduce high reclaim stalls in the target application as the
> >    allocations from the processes or the threads of the application can hit
> >    the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >    how much memory to proactively reclaim from a workload. The metrics
> >    used for this are usually either refaults or PSI, and these metrics
> >    will become messy if the application gets throttled by hitting the
> >    high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >    reclaimer crashes for any reason while triggering reclaim it can leave
> >    the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >    reclaim memory can result in actually reclaiming more memory than
> >    intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >    to allow specifying one or more types of memory to reclaim from (e.g.
> >    file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >    specific nodes. This has use cases for reclaim-based demotion in memory
> >    tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >    reclaim and reclaim-based demotion in systems without memcg.
> >
> > For now, let's keep things simple by adding the basic functionality.
> >
> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >   Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >   mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >   2 files changed, 46 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..925aaabb2247 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> >       high limit is used and monitored properly, this limit's
> >       utility is limited to providing the final safety net.
> >
> > +  memory.reclaim
> > +     A write-only file which exists on non-root cgroups.
> > +
> > +     This is a simple interface to trigger memory reclaim in the
> > +     target cgroup. Write the number of bytes to reclaim to this
> > +     file and the kernel will try to reclaim that much memory.
> > +     Please note that the kernel can over or under reclaim from
> > +     the target cgroup.
> > +
> >     memory.oom.group
> >       A read-write single value file which exists on non-root
> >       cgroups.  The default value is "0".
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 725f76723220..994849fab7df 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >   }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                           size_t nbytes, loff_t off)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +     int err;
> > +
> > +     buf = strstrip(buf);
> > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +     if (err)
> > +             return err;
> > +
> > +     while (nr_reclaimed < nr_to_reclaim) {
> > +             unsigned long reclaimed;
> > +
> > +             if (signal_pending(current))
> > +                     break;
> > +
> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                             nr_to_reclaim - nr_reclaimed,
> > +                                             GFP_KERNEL, true);
> In some scenario there are lots of page cache,  and we only want to
> reclaim page cache,
> how about add may_swap option?

Thanks for taking a look at this!

The first listed extension is an argument/flags to specify the type of
memory that we want to reclaim, I think this covers this use case, or
am I missing something?

> > +
> > +             if (!reclaimed && !nr_retries--)
> > +                     break;
> > +
> > +             nr_reclaimed += reclaimed;
> > +     }
> > +
> > +     return nbytes;
> > +}
> > +
> >   static struct cftype memory_files[] = {
> >       {
> >               .name = "current",
> > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
> >               .seq_show = memory_oom_group_show,
> >               .write = memory_oom_group_write,
> >       },
> > +     {
> > +             .name = "reclaim",
> > +             .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> > +             .write = memory_reclaim,
> > +     },
> >       { }     /* terminate */
> >   };
> >
>

Yosry Ahmed April 1, 2022, 9:22 a.m. UTC | #13

On Thu, Mar 31, 2022 at 9:05 PM Wei Xu <weixugc@google.com> wrote:
>
> On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >   potentially be used for proactive reclaim.
> >   However there is a big downside in using memory.high. It can potentially
> >   introduce high reclaim stalls in the target application as the
> >   allocations from the processes or the threads of the application can hit
> >   the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >   how much memory to proactively reclaim from a workload. The metrics
> >   used for this are usually either refaults or PSI, and these metrics
> >   will become messy if the application gets throttled by hitting the
> >   high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >   reclaimer crashes for any reason while triggering reclaim it can leave
> >   the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >   reclaim memory can result in actually reclaiming more memory than
> >   intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >   file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >   tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
> >
> > For now, let's keep things simple by adding the basic functionality.
> >
> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..925aaabb2247 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> >         high limit is used and monitored properly, this limit's
> >         utility is limited to providing the final safety net.
> >
> > +  memory.reclaim
> > +       A write-only file which exists on non-root cgroups.
> > +
> > +       This is a simple interface to trigger memory reclaim in the
> > +       target cgroup. Write the number of bytes to reclaim to this
> > +       file and the kernel will try to reclaim that much memory.
> > +       Please note that the kernel can over or under reclaim from
> > +       the target cgroup.
> > +
> >    memory.oom.group
> >         A read-write single value file which exists on non-root
> >         cgroups.  The default value is "0".
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 725f76723220..994849fab7df 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >         return nbytes;
> >  }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                             size_t nbytes, loff_t off)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +       int err;
> > +
> > +       buf = strstrip(buf);
> > +       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +       if (err)
> > +               return err;
> > +
> > +       while (nr_reclaimed < nr_to_reclaim) {
> > +               unsigned long reclaimed;
> > +
> > +               if (signal_pending(current))
> > +                       break;
> > +
> > +               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                               nr_to_reclaim - nr_reclaimed,
> > +                                               GFP_KERNEL, true);
> > +
> > +               if (!reclaimed && !nr_retries--)
> > +                       break;
> > +
> > +               nr_reclaimed += reclaimed;
> > +       }
> > +
> > +       return nbytes;
>
> It is better to return an error code (e.g. -EBUSY) when
> memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory,
> except if the cgroup memory usage is already 0.  We can also return
> -EINVAL if nr_to_reclaim is too large (e.g. > limit).

IIUC this interface is modeled after memory.high, which returns nbytes
as well. If you think it's better here to do this instead of
maintaining consistency with memory.high we can certainly do this for
v2.

>
> > +}
> > +
> >  static struct cftype memory_files[] = {
> >         {
> >                 .name = "current",
> > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
> >                 .seq_show = memory_oom_group_show,
> >                 .write = memory_oom_group_write,
> >         },
> > +       {
> > +               .name = "reclaim",
> > +               .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> > +               .write = memory_reclaim,
> > +       },
> >         { }     /* terminate */
> >  };
> >
> > --
> > 2.35.1.1021.g381101b075-goog
> >

Yosry Ahmed April 1, 2022, 9:23 a.m. UTC | #14

On Fri, Apr 1, 2022 at 1:39 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
>
>
> Yosry Ahmed <yosryahmed@google.com> writes:
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> <snip>
>
> > +
> > +     while (nr_reclaimed < nr_to_reclaim) {
> > +             unsigned long reclaimed;
> > +
> > +             if (signal_pending(current))
> > +                     break;
> > +
> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                             nr_to_reclaim - nr_reclaimed,
> > +                                             GFP_KERNEL, true);
> > +
> > +             if (!reclaimed && !nr_retries--)
> > +                     break;
> > +
> > +             nr_reclaimed += reclaimed;
>
> I think there should be a cond_resched() in this loop before
> try_to_free_mem_cgroup_pages() to have better chances of reclaim
> succeding early.
>
Thanks for taking the time to look at this!

I believe this loop is modeled after the loop in memory_high_write()
for the memory.high interface. Is there a reason why it should be
needed here but not there?

> <snip>
>
> --
> Cheers
> ~ Vaibhav

Chen Wandun April 1, 2022, 9:48 a.m. UTC | #15

在 2022/4/1 17:20, Yosry Ahmed 写道:
> On Thu, Mar 31, 2022 at 8:05 PM Chen Wandun <chenwandun@huawei.com> wrote:
>>
>>
>> 在 2022/3/31 16:41, Yosry Ahmed 写道:
>>> From: Shakeel Butt <shakeelb@google.com>
>>>
>>> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
>>>
>>> Use case: Proactive Reclaim
>>> ---------------------------
>>>
>>> A userspace proactive reclaimer can continuously probe the memcg to
>>> reclaim a small amount of memory. This gives more accurate and
>>> up-to-date workingset estimation as the LRUs are continuously
>>> sorted and can potentially provide more deterministic memory
>>> overcommit behavior. The memory overcommit controller can provide
>>> more proactive response to the changing behavior of the running
>>> applications instead of being reactive.
>>>
>>> A userspace reclaimer's purpose in this case is not a complete replacement
>>> for kswapd or direct reclaim, it is to proactively identify memory savings
>>> opportunities and reclaim some amount of cold pages set by the policy
>>> to free up the memory for more demanding jobs or scheduling new jobs.
>>>
>>> A user space proactive reclaimer is used in Google data centers.
>>> Additionally, Meta's TMO paper recently referenced a very similar
>>> interface used for user space proactive reclaim:
>>> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
>>>
>>> Benefits of a user space reclaimer:
>>> -----------------------------------
>>>
>>> 1) More flexible on who should be charged for the cpu of the memory
>>> reclaim. For proactive reclaim, it makes more sense to be centralized.
>>>
>>> 2) More flexible on dedicating the resources (like cpu). The memory
>>> overcommit controller can balance the cost between the cpu usage and
>>> the memory reclaimed.
>>>
>>> 3) Provides a way to the applications to keep their LRUs sorted, so,
>>> under memory pressure better reclaim candidates are selected. This also
>>> gives more accurate and uptodate notion of working set for an
>>> application.
>>>
>>> Why memory.high is not enough?
>>> ------------------------------
>>>
>>> - memory.high can be used to trigger reclaim in a memcg and can
>>>     potentially be used for proactive reclaim.
>>>     However there is a big downside in using memory.high. It can potentially
>>>     introduce high reclaim stalls in the target application as the
>>>     allocations from the processes or the threads of the application can hit
>>>     the temporary memory.high limit.
>>>
>>> - Userspace proactive reclaimers usually use feedback loops to decide
>>>     how much memory to proactively reclaim from a workload. The metrics
>>>     used for this are usually either refaults or PSI, and these metrics
>>>     will become messy if the application gets throttled by hitting the
>>>     high limit.
>>>
>>> - memory.high is a stateful interface, if the userspace proactive
>>>     reclaimer crashes for any reason while triggering reclaim it can leave
>>>     the application in a bad state.
>>>
>>> - If a workload is rapidly expanding, setting memory.high to proactively
>>>     reclaim memory can result in actually reclaiming more memory than
>>>     intended.
>>>
>>> The benefits of such interface and shortcomings of existing interface
>>> were further discussed in this RFC thread:
>>> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
>>>
>>> Interface:
>>> ----------
>>>
>>> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
>>> trigger reclaim in the target memory cgroup.
>>>
>>>
>>> Possible Extensions:
>>> --------------------
>>>
>>> - This interface can be extended with an additional parameter or flags
>>>     to allow specifying one or more types of memory to reclaim from (e.g.
>>>     file, anon, ..).
>>>
>>> - The interface can also be extended with a node mask to reclaim from
>>>     specific nodes. This has use cases for reclaim-based demotion in memory
>>>     tiering systens.
>>>
>>> - A similar per-node interface can also be added to support proactive
>>>     reclaim and reclaim-based demotion in systems without memcg.
>>>
>>> For now, let's keep things simple by adding the basic functionality.
>>>
>>> [yosryahmed@google.com: refreshed to current master, updated commit
>>> message based on recent discussions and use cases]
>>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>> ---
>>>    Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
>>>    mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
>>>    2 files changed, 46 insertions(+)
>>>
>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>> index 69d7a6983f78..925aaabb2247 100644
>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>> @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
>>>        high limit is used and monitored properly, this limit's
>>>        utility is limited to providing the final safety net.
>>>
>>> +  memory.reclaim
>>> +     A write-only file which exists on non-root cgroups.
>>> +
>>> +     This is a simple interface to trigger memory reclaim in the
>>> +     target cgroup. Write the number of bytes to reclaim to this
>>> +     file and the kernel will try to reclaim that much memory.
>>> +     Please note that the kernel can over or under reclaim from
>>> +     the target cgroup.
>>> +
>>>      memory.oom.group
>>>        A read-write single value file which exists on non-root
>>>        cgroups.  The default value is "0".
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 725f76723220..994849fab7df 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>>>        return nbytes;
>>>    }
>>>
>>> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>>> +                           size_t nbytes, loff_t off)
>>> +{
>>> +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>>> +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>>> +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
>>> +     int err;
>>> +
>>> +     buf = strstrip(buf);
>>> +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
>>> +     if (err)
>>> +             return err;
>>> +
>>> +     while (nr_reclaimed < nr_to_reclaim) {
>>> +             unsigned long reclaimed;
>>> +
>>> +             if (signal_pending(current))
>>> +                     break;
>>> +
>>> +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
>>> +                                             nr_to_reclaim - nr_reclaimed,
>>> +                                             GFP_KERNEL, true);
>> In some scenario there are lots of page cache,  and we only want to
>> reclaim page cache,
>> how about add may_swap option?
> Thanks for taking a look at this!
>
> The first listed extension is an argument/flags to specify the type of
do you mean nbytes in  memory_reclaim? it decide the amount of memory
to reclaim.

one more argument such as may_swap can be add into memory_reclaim, and
pass this argument to try_to_free_mem_cgroup_pages in order to replace the
default "true"

Thanks.

> memory that we want to reclaim, I think this covers this use case, or
> am I missing something?
>
>>> +
>>> +             if (!reclaimed && !nr_retries--)
>>> +                     break;
>>> +
>>> +             nr_reclaimed += reclaimed;
>>> +     }
>>> +
>>> +     return nbytes;
>>> +}
>>> +
>>>    static struct cftype memory_files[] = {
>>>        {
>>>                .name = "current",
>>> @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
>>>                .seq_show = memory_oom_group_show,
>>>                .write = memory_oom_group_write,
>>>        },
>>> +     {
>>> +             .name = "reclaim",
>>> +             .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
>>> +             .write = memory_reclaim,
>>> +     },
>>>        { }     /* terminate */
>>>    };
>>>
> .

Yosry Ahmed April 1, 2022, 10:02 a.m. UTC | #16

On Fri, Apr 1, 2022 at 2:49 AM Chen Wandun <chenwandun@huawei.com> wrote:
>
>
>
> 在 2022/4/1 17:20, Yosry Ahmed 写道:
> > On Thu, Mar 31, 2022 at 8:05 PM Chen Wandun <chenwandun@huawei.com> wrote:
> >>
> >>
> >> 在 2022/3/31 16:41, Yosry Ahmed 写道:
> >>> From: Shakeel Butt <shakeelb@google.com>
> >>>
> >>> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >>>
> >>> Use case: Proactive Reclaim
> >>> ---------------------------
> >>>
> >>> A userspace proactive reclaimer can continuously probe the memcg to
> >>> reclaim a small amount of memory. This gives more accurate and
> >>> up-to-date workingset estimation as the LRUs are continuously
> >>> sorted and can potentially provide more deterministic memory
> >>> overcommit behavior. The memory overcommit controller can provide
> >>> more proactive response to the changing behavior of the running
> >>> applications instead of being reactive.
> >>>
> >>> A userspace reclaimer's purpose in this case is not a complete replacement
> >>> for kswapd or direct reclaim, it is to proactively identify memory savings
> >>> opportunities and reclaim some amount of cold pages set by the policy
> >>> to free up the memory for more demanding jobs or scheduling new jobs.
> >>>
> >>> A user space proactive reclaimer is used in Google data centers.
> >>> Additionally, Meta's TMO paper recently referenced a very similar
> >>> interface used for user space proactive reclaim:
> >>> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >>>
> >>> Benefits of a user space reclaimer:
> >>> -----------------------------------
> >>>
> >>> 1) More flexible on who should be charged for the cpu of the memory
> >>> reclaim. For proactive reclaim, it makes more sense to be centralized.
> >>>
> >>> 2) More flexible on dedicating the resources (like cpu). The memory
> >>> overcommit controller can balance the cost between the cpu usage and
> >>> the memory reclaimed.
> >>>
> >>> 3) Provides a way to the applications to keep their LRUs sorted, so,
> >>> under memory pressure better reclaim candidates are selected. This also
> >>> gives more accurate and uptodate notion of working set for an
> >>> application.
> >>>
> >>> Why memory.high is not enough?
> >>> ------------------------------
> >>>
> >>> - memory.high can be used to trigger reclaim in a memcg and can
> >>>     potentially be used for proactive reclaim.
> >>>     However there is a big downside in using memory.high. It can potentially
> >>>     introduce high reclaim stalls in the target application as the
> >>>     allocations from the processes or the threads of the application can hit
> >>>     the temporary memory.high limit.
> >>>
> >>> - Userspace proactive reclaimers usually use feedback loops to decide
> >>>     how much memory to proactively reclaim from a workload. The metrics
> >>>     used for this are usually either refaults or PSI, and these metrics
> >>>     will become messy if the application gets throttled by hitting the
> >>>     high limit.
> >>>
> >>> - memory.high is a stateful interface, if the userspace proactive
> >>>     reclaimer crashes for any reason while triggering reclaim it can leave
> >>>     the application in a bad state.
> >>>
> >>> - If a workload is rapidly expanding, setting memory.high to proactively
> >>>     reclaim memory can result in actually reclaiming more memory than
> >>>     intended.
> >>>
> >>> The benefits of such interface and shortcomings of existing interface
> >>> were further discussed in this RFC thread:
> >>> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> >>>
> >>> Interface:
> >>> ----------
> >>>
> >>> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> >>> trigger reclaim in the target memory cgroup.
> >>>
> >>>
> >>> Possible Extensions:
> >>> --------------------
> >>>
> >>> - This interface can be extended with an additional parameter or flags
> >>>     to allow specifying one or more types of memory to reclaim from (e.g.
> >>>     file, anon, ..).
> >>>
> >>> - The interface can also be extended with a node mask to reclaim from
> >>>     specific nodes. This has use cases for reclaim-based demotion in memory
> >>>     tiering systens.
> >>>
> >>> - A similar per-node interface can also be added to support proactive
> >>>     reclaim and reclaim-based demotion in systems without memcg.
> >>>
> >>> For now, let's keep things simple by adding the basic functionality.
> >>>
> >>> [yosryahmed@google.com: refreshed to current master, updated commit
> >>> message based on recent discussions and use cases]
> >>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >>> ---
> >>>    Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >>>    mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >>>    2 files changed, 46 insertions(+)
> >>>
> >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >>> index 69d7a6983f78..925aaabb2247 100644
> >>> --- a/Documentation/admin-guide/cgroup-v2.rst
> >>> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >>> @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> >>>        high limit is used and monitored properly, this limit's
> >>>        utility is limited to providing the final safety net.
> >>>
> >>> +  memory.reclaim
> >>> +     A write-only file which exists on non-root cgroups.
> >>> +
> >>> +     This is a simple interface to trigger memory reclaim in the
> >>> +     target cgroup. Write the number of bytes to reclaim to this
> >>> +     file and the kernel will try to reclaim that much memory.
> >>> +     Please note that the kernel can over or under reclaim from
> >>> +     the target cgroup.
> >>> +
> >>>      memory.oom.group
> >>>        A read-write single value file which exists on non-root
> >>>        cgroups.  The default value is "0".
> >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>> index 725f76723220..994849fab7df 100644
> >>> --- a/mm/memcontrol.c
> >>> +++ b/mm/memcontrol.c
> >>> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >>>        return nbytes;
> >>>    }
> >>>
> >>> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >>> +                           size_t nbytes, loff_t off)
> >>> +{
> >>> +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> >>> +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> >>> +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> >>> +     int err;
> >>> +
> >>> +     buf = strstrip(buf);
> >>> +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> >>> +     if (err)
> >>> +             return err;
> >>> +
> >>> +     while (nr_reclaimed < nr_to_reclaim) {
> >>> +             unsigned long reclaimed;
> >>> +
> >>> +             if (signal_pending(current))
> >>> +                     break;
> >>> +
> >>> +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> >>> +                                             nr_to_reclaim - nr_reclaimed,
> >>> +                                             GFP_KERNEL, true);
> >> In some scenario there are lots of page cache,  and we only want to
> >> reclaim page cache,
> >> how about add may_swap option?
> > Thanks for taking a look at this!
> >
> > The first listed extension is an argument/flags to specify the type of
> do you mean nbytes in  memory_reclaim? it decide the amount of memory
> to reclaim.
>
> one more argument such as may_swap can be add into memory_reclaim, and
> pass this argument to try_to_free_mem_cgroup_pages in order to replace the
> default "true"
>
> Thanks.

I agree about the need for a may_swap or similar argument. In the
commit message I list some possible extensions to this interface, and
the first one is to add an argument to specify the type of memory we
want to reclaim using the interface (anon, file, ..), which I think
covers this use case. I just think we should add this in a separate
patch as an extension.

>
> > memory that we want to reclaim, I think this covers this use case, or
> > am I missing something?
> >
> >>> +
> >>> +             if (!reclaimed && !nr_retries--)
> >>> +                     break;
> >>> +
> >>> +             nr_reclaimed += reclaimed;
> >>> +     }
> >>> +
> >>> +     return nbytes;
> >>> +}
> >>> +
> >>>    static struct cftype memory_files[] = {
> >>>        {
> >>>                .name = "current",
> >>> @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = {
> >>>                .seq_show = memory_oom_group_show,
> >>>                .write = memory_oom_group_write,
> >>>        },
> >>> +     {
> >>> +             .name = "reclaim",
> >>> +             .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> >>> +             .write = memory_reclaim,
> >>> +     },
> >>>        { }     /* terminate */
> >>>    };
> >>>
> > .
>

Michal Hocko April 1, 2022, 1:03 p.m. UTC | #17

On Fri 01-04-22 02:17:28, Yosry Ahmed wrote:
> On Thu, Mar 31, 2022 at 8:38 PM Wei Xu <weixugc@google.com> wrote:
> >
> > On Thu, Mar 31, 2022 at 5:33 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> > > >       return nbytes;
> > > >  }
> > > >
> > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > > > +                           size_t nbytes, loff_t off)
> > > > +{
> > > > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > > > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > > > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > > +     int err;
> > > > +
> > > > +     buf = strstrip(buf);
> > > > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > > > +     if (err)
> > > > +             return err;
> > > > +
> > > > +     while (nr_reclaimed < nr_to_reclaim) {
> > > > +             unsigned long reclaimed;
> > > > +
> > > > +             if (signal_pending(current))
> > > > +                     break;
> > > > +
> > > > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > > +                                             nr_to_reclaim - nr_reclaimed,
> > > > +                                             GFP_KERNEL, true);
> > > > +
> > > > +             if (!reclaimed && !nr_retries--)
> > > > +                     break;
> > > > +
> > > > +             nr_reclaimed += reclaimed;
> > > > +     }
> > >
> > > Is there any way in which this can be provoked into triggering the
> > > softlockup detector?
> >
> > memory.reclaim is similar to memory.high w.r.t. reclaiming memory,
> > except that memory.reclaim is stateless, while the kernel remembers
> > the state set by memory.high.  So memory.reclaim should not bring in
> > any new risks of triggering soft lockup, if any.

Memory reclaim already has cond_resched even if there is nothing
reclaimable. See shrink_node_memcgs

> > > Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel?
> > > Would additional flexibility be gained by letting userspace handle
> > > retrying?
> >
> > I agree it is better to retry from the userspace.
> 
> Thanks Andrew and Wei for looking at this. IIUC the
> MAX_RECLAIM_RETRIES loop was modeled after the loop in memory.high as
> well. Is there a reason why it should be different here?

No, I would go with the same approach other interfaces use. I am not a
great fan of MAX_RECLAIM_RETRIES - especially when we have a bail out on
signals - but if we are to change this then let's do it consisently.

Michal Hocko April 1, 2022, 1:49 p.m. UTC | #18

On Thu 31-03-22 10:25:23, Roman Gushchin wrote:
> On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
[...]
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
> 
> Maybe an option to specify a timeout? That might simplify the userspace part.

What do you mean by timeout here? Isn't
timeout $N echo $RECLAIM > ....

enough?

Michal Hocko April 1, 2022, 1:54 p.m. UTC | #19

On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> From: Shakeel Butt <shakeelb@google.com>
> 
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> 
> Use case: Proactive Reclaim
> ---------------------------
> 
> A userspace proactive reclaimer can continuously probe the memcg to
> reclaim a small amount of memory. This gives more accurate and
> up-to-date workingset estimation as the LRUs are continuously
> sorted and can potentially provide more deterministic memory
> overcommit behavior. The memory overcommit controller can provide
> more proactive response to the changing behavior of the running
> applications instead of being reactive.
> 
> A userspace reclaimer's purpose in this case is not a complete replacement
> for kswapd or direct reclaim, it is to proactively identify memory savings
> opportunities and reclaim some amount of cold pages set by the policy
> to free up the memory for more demanding jobs or scheduling new jobs.
> 
> A user space proactive reclaimer is used in Google data centers.
> Additionally, Meta's TMO paper recently referenced a very similar
> interface used for user space proactive reclaim:
> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> 
> Benefits of a user space reclaimer:
> -----------------------------------
> 
> 1) More flexible on who should be charged for the cpu of the memory
> reclaim. For proactive reclaim, it makes more sense to be centralized.
> 
> 2) More flexible on dedicating the resources (like cpu). The memory
> overcommit controller can balance the cost between the cpu usage and
> the memory reclaimed.
> 
> 3) Provides a way to the applications to keep their LRUs sorted, so,
> under memory pressure better reclaim candidates are selected. This also
> gives more accurate and uptodate notion of working set for an
> application.
> 
> Why memory.high is not enough?
> ------------------------------
> 
> - memory.high can be used to trigger reclaim in a memcg and can
>   potentially be used for proactive reclaim.
>   However there is a big downside in using memory.high. It can potentially
>   introduce high reclaim stalls in the target application as the
>   allocations from the processes or the threads of the application can hit
>   the temporary memory.high limit.
> 
> - Userspace proactive reclaimers usually use feedback loops to decide
>   how much memory to proactively reclaim from a workload. The metrics
>   used for this are usually either refaults or PSI, and these metrics
>   will become messy if the application gets throttled by hitting the
>   high limit.
> 
> - memory.high is a stateful interface, if the userspace proactive
>   reclaimer crashes for any reason while triggering reclaim it can leave
>   the application in a bad state.
> 
> - If a workload is rapidly expanding, setting memory.high to proactively
>   reclaim memory can result in actually reclaiming more memory than
>   intended.
> 
> The benefits of such interface and shortcomings of existing interface
> were further discussed in this RFC thread:
> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> 
> Interface:
> ----------
> 
> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> trigger reclaim in the target memory cgroup.
> 
> 
> Possible Extensions:
> --------------------
> 
> - This interface can be extended with an additional parameter or flags
>   to allow specifying one or more types of memory to reclaim from (e.g.
>   file, anon, ..).
> 
> - The interface can also be extended with a node mask to reclaim from
>   specific nodes. This has use cases for reclaim-based demotion in memory
>   tiering systens.
> 
> - A similar per-node interface can also be added to support proactive
>   reclaim and reclaim-based demotion in systems without memcg.
> 
> For now, let's keep things simple by adding the basic functionality.

Yes, I am for the simplicity and this really looks like a bare minumum
interface. But it is not really clear who do you want to add flags on
top of it?

I am not really sure we really need a node aware interface for memcg.
The global reclaim interface will likely need a different node because
we do not want to make this CONFIG_MEMCG constrained.
 
> [yosryahmed@google.com: refreshed to current master, updated commit
> message based on recent discussions and use cases]
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

All that being said. I haven't been a great fan for explicit reclaim
triggered from the userspace but I do recognize that limitations of the 
existing interfaces is just too restrictive.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

Johannes Weiner April 1, 2022, 3:22 p.m. UTC | #20

On Thu, Mar 31, 2022 at 09:05:15PM -0700, Wei Xu wrote:
> On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >   potentially be used for proactive reclaim.
> >   However there is a big downside in using memory.high. It can potentially
> >   introduce high reclaim stalls in the target application as the
> >   allocations from the processes or the threads of the application can hit
> >   the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >   how much memory to proactively reclaim from a workload. The metrics
> >   used for this are usually either refaults or PSI, and these metrics
> >   will become messy if the application gets throttled by hitting the
> >   high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >   reclaimer crashes for any reason while triggering reclaim it can leave
> >   the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >   reclaim memory can result in actually reclaiming more memory than
> >   intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >   file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >   tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
> >
> > For now, let's keep things simple by adding the basic functionality.
> >
> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..925aaabb2247 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> >         high limit is used and monitored properly, this limit's
> >         utility is limited to providing the final safety net.
> >
> > +  memory.reclaim
> > +       A write-only file which exists on non-root cgroups.
> > +
> > +       This is a simple interface to trigger memory reclaim in the
> > +       target cgroup. Write the number of bytes to reclaim to this
> > +       file and the kernel will try to reclaim that much memory.
> > +       Please note that the kernel can over or under reclaim from
> > +       the target cgroup.
> > +
> >    memory.oom.group
> >         A read-write single value file which exists on non-root
> >         cgroups.  The default value is "0".
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 725f76723220..994849fab7df 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >         return nbytes;
> >  }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                             size_t nbytes, loff_t off)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +       int err;
> > +
> > +       buf = strstrip(buf);
> > +       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +       if (err)
> > +               return err;
> > +
> > +       while (nr_reclaimed < nr_to_reclaim) {
> > +               unsigned long reclaimed;
> > +
> > +               if (signal_pending(current))
> > +                       break;
> > +
> > +               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                               nr_to_reclaim - nr_reclaimed,
> > +                                               GFP_KERNEL, true);
> > +
> > +               if (!reclaimed && !nr_retries--)
> > +                       break;
> > +
> > +               nr_reclaimed += reclaimed;
> > +       }
> > +
> > +       return nbytes;
> 
> It is better to return an error code (e.g. -EBUSY) when
> memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory,
> except if the cgroup memory usage is already 0.  We can also return
> -EINVAL if nr_to_reclaim is too large (e.g. > limit).

For -EBUSY, are you thinking of a specific usecase where that would
come in handy? I'm not really opposed to it, but couldn't convince
myself of the practical benefits of it, either.

Keep in mind that MAX_RECLAIM_RETRIES failed reclaim attempts usually
constitute an OOM situation: memory.max will issue kills and
memory.high will begin crippling throttling. In what scenario would
you want to keep reclaiming a workload that is considered OOM?

Certainly, proactive reclaim that wants to purge only the cold tail of
the workload wouldn't retry. Meta's version of this patch actually
does return -EAGAIN on reclaim failure, but the userspace daemon
doesn't do anything with it, so I didn't bring it up.

For -EINVAL, I tend to lean more toward disagreeing. We've been trying
to avoid arbitrary dependencies between control knobs in cgroup2, just
because it exposes us to race conditions and adds complications to the
interface. For example, it *usually* doesn't make sense to set limits
to 0, or set local limits and protections higher than the parent. But
we allow it anyway, to avoid creating well-intended linting rules that
could interfere with somebody's unforeseen, legitimate usecase.

Shakeel Butt April 1, 2022, 3:41 p.m. UTC | #21

On Fri, Apr 1, 2022 at 2:16 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> > > +     {
> > > +             .name = "reclaim",
> > > +             .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
> > > +             .write = memory_reclaim,
> >
> > Btw, why not on root?
>
> I missed the root question in my first reply. I think this was
> originally modeled after the memory.high interface, but I don't know
> if there are other reasons. Shakeel would know better.
>
> AFAIK this should work naturally on root as well, but I think it makes
> more sense then to use a global interface (hopefully introduced soon)?
> I don't have an opinion here let me know what you prefer for v2.

We will follow the psi example which is exposed for root as well as
for system level in procfs but both of these (for memory.reclaim) are
planned as the followup feature.

Wei Xu April 1, 2022, 4:56 p.m. UTC | #22

On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >   potentially be used for proactive reclaim.
> >   However there is a big downside in using memory.high. It can potentially
> >   introduce high reclaim stalls in the target application as the
> >   allocations from the processes or the threads of the application can hit
> >   the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >   how much memory to proactively reclaim from a workload. The metrics
> >   used for this are usually either refaults or PSI, and these metrics
> >   will become messy if the application gets throttled by hitting the
> >   high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >   reclaimer crashes for any reason while triggering reclaim it can leave
> >   the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >   reclaim memory can result in actually reclaiming more memory than
> >   intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >   file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >   tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
> >
> > For now, let's keep things simple by adding the basic functionality.
>
> Yes, I am for the simplicity and this really looks like a bare minumum
> interface. But it is not really clear who do you want to add flags on
> top of it?
>
> I am not really sure we really need a node aware interface for memcg.
> The global reclaim interface will likely need a different node because
> we do not want to make this CONFIG_MEMCG constrained.

A nodemask argument for memory.reclaim can be useful for memory
tiering between NUMA nodes with different performance.  Similar to
proactive reclaim, it can allow a userspace daemon to drive
memcg-based proactive demotion via the reclaim-based demotion
mechanism in the kernel.

> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> All that being said. I haven't been a great fan for explicit reclaim
> triggered from the userspace but I do recognize that limitations of the
> existing interfaces is just too restrictive.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

Roman Gushchin April 1, 2022, 4:58 p.m. UTC | #23

On Fri, Apr 01, 2022 at 03:49:19PM +0200, Michal Hocko wrote:
> On Thu 31-03-22 10:25:23, Roman Gushchin wrote:
> > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> [...]
> > > - A similar per-node interface can also be added to support proactive
> > >   reclaim and reclaim-based demotion in systems without memcg.
> > 
> > Maybe an option to specify a timeout? That might simplify the userspace part.
> 
> What do you mean by timeout here? Isn't
> timeout $N echo $RECLAIM > ....
> 
> enough?

It's nice and simple when it's a bash script, but when it's a complex
application trying to do the same, it quickly becomes less simple and
likely will require a dedicated thread to avoid blocking the main app
for too long and a mechanism to unblock it by timer/when the need arises.

In my experience using correctly such semi-blocking interfaces (semi- because
it's not clearly defined how much time the syscall can take and whether it
makes sense to wait longer) is tricky.

Roman Gushchin April 1, 2022, 6:39 p.m. UTC | #24

On Fri, Apr 01, 2022 at 02:11:51AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > > From: Shakeel Butt <shakeelb@google.com>
> > >
> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > >
> > > Use case: Proactive Reclaim
> > > ---------------------------
> > >
> > > A userspace proactive reclaimer can continuously probe the memcg to
> > > reclaim a small amount of memory. This gives more accurate and
> > > up-to-date workingset estimation as the LRUs are continuously
> > > sorted and can potentially provide more deterministic memory
> > > overcommit behavior. The memory overcommit controller can provide
> > > more proactive response to the changing behavior of the running
> > > applications instead of being reactive.
> > >
> > > A userspace reclaimer's purpose in this case is not a complete replacement
> > > for kswapd or direct reclaim, it is to proactively identify memory savings
> > > opportunities and reclaim some amount of cold pages set by the policy
> > > to free up the memory for more demanding jobs or scheduling new jobs.
> > >
> > > A user space proactive reclaimer is used in Google data centers.
> > > Additionally, Meta's TMO paper recently referenced a very similar
> > > interface used for user space proactive reclaim:
> > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> > >
> > > Benefits of a user space reclaimer:
> > > -----------------------------------
> > >
> > > 1) More flexible on who should be charged for the cpu of the memory
> > > reclaim. For proactive reclaim, it makes more sense to be centralized.
> > >
> > > 2) More flexible on dedicating the resources (like cpu). The memory
> > > overcommit controller can balance the cost between the cpu usage and
> > > the memory reclaimed.
> > >
> > > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > > under memory pressure better reclaim candidates are selected. This also
> > > gives more accurate and uptodate notion of working set for an
> > > application.
> > >
> > > Why memory.high is not enough?
> > > ------------------------------
> > >
> > > - memory.high can be used to trigger reclaim in a memcg and can
> > >   potentially be used for proactive reclaim.
> > >   However there is a big downside in using memory.high. It can potentially
> > >   introduce high reclaim stalls in the target application as the
> > >   allocations from the processes or the threads of the application can hit
> > >   the temporary memory.high limit.
> > >
> > > - Userspace proactive reclaimers usually use feedback loops to decide
> > >   how much memory to proactively reclaim from a workload. The metrics
> > >   used for this are usually either refaults or PSI, and these metrics
> > >   will become messy if the application gets throttled by hitting the
> > >   high limit.
> > >
> > > - memory.high is a stateful interface, if the userspace proactive
> > >   reclaimer crashes for any reason while triggering reclaim it can leave
> > >   the application in a bad state.
> > >
> > > - If a workload is rapidly expanding, setting memory.high to proactively
> > >   reclaim memory can result in actually reclaiming more memory than
> > >   intended.
> > >
> > > The benefits of such interface and shortcomings of existing interface
> > > were further discussed in this RFC thread:
> > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> >
> > Hello!
> >
> > I'm totally up for the proposed feature! It makes total sense and is proved
> > to be useful, let's add it.
> >
> > >
> > > Interface:
> > > ----------
> > >
> > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > > trigger reclaim in the target memory cgroup.
> > >
> > >
> > > Possible Extensions:
> > > --------------------
> > >
> > > - This interface can be extended with an additional parameter or flags
> > >   to allow specifying one or more types of memory to reclaim from (e.g.
> > >   file, anon, ..).
> > >
> > > - The interface can also be extended with a node mask to reclaim from
> > >   specific nodes. This has use cases for reclaim-based demotion in memory
> > >   tiering systens.
> > >
> > > - A similar per-node interface can also be added to support proactive
> > >   reclaim and reclaim-based demotion in systems without memcg.
> >
> > Maybe an option to specify a timeout? That might simplify the userspace part.
> > Also, please please add a test to selftests/cgroup/memcg tests.
> > It will also provide an example on how the userspace can use the feature.
> >
> 
> Hi Roman, thanks for taking the time to review this!
> 
> A timeout can be a good extension, I will add it to the commit message
> in the next version in possible extensions.
> 
> I will add a test in v2, thanks!

Great, thank you!

> 
> >
> > >
> > > For now, let's keep things simple by adding the basic functionality.
> >
> > What I'm worried about is how we gonna extend it? How do you see the interface
> > with 2-3 extensions from the list above? All these extensions look very
> > reasonable to me, so we'll likely have to implement them soon. So let's think
> > about the extensibility now.
> >
> 
> My idea is to have these extensions as optional positional arguments
> (like Wei suggested), so that the interface does not get too
> complicated for users who don't care about tuning these options. If
> this is the case then I think there is nothing to worry about.
> Otherwise, if you think some of these options make sense to be a
> required argument instead, we can rethink the initial interface.

The interface you're proposing is not really extensible, so we'll likely need to
introduce a new interface like memory.reclaim_ext very soon. Why not create
an extensible API from scratch?

I'm looking at cgroup v2 documentation which describes various interface files
formats and it seems like given the number of potential optional arguments
the best option is nested keyed (please, refer to the Interface Files section).

E.g. the format can be:
echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim

We can say that now we don't support any keyed arguments, but they can be
added in the future.

Basically you don't even need to change any code, only document the interface
properly, so we can extend it later without breaking the API.

> 
> > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead?
> > In the end, such a feature might make sense on the system level too.
> > Yes, there is the drop_caches sysctl, but it's too radical for many cases.
> >
> 
> I think in the RFC discussion there was consensus to add both a
> per-memcg knob, as well as per-node / per-system knobs (through sysfs
> or syscalls) later. Wei also points out that it's not common for a
> syscall to have a cgroup argument.

Actually there are examples (e.g. sys_bpf), but my only point is to make
the API extensible, so maybe syscall is not the best idea.

I'd add the root level interface from scratch: the code change is simple
and it makes sense as a feature. Then likely we don't really need another
system-level interface at all.

Thanks!

Wei Xu April 1, 2022, 8:14 p.m. UTC | #25

On Fri, Apr 1, 2022 at 8:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Mar 31, 2022 at 09:05:15PM -0700, Wei Xu wrote:
> > On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > From: Shakeel Butt <shakeelb@google.com>
> > >
> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > >
> > > Use case: Proactive Reclaim
> > > ---------------------------
> > >
> > > A userspace proactive reclaimer can continuously probe the memcg to
> > > reclaim a small amount of memory. This gives more accurate and
> > > up-to-date workingset estimation as the LRUs are continuously
> > > sorted and can potentially provide more deterministic memory
> > > overcommit behavior. The memory overcommit controller can provide
> > > more proactive response to the changing behavior of the running
> > > applications instead of being reactive.
> > >
> > > A userspace reclaimer's purpose in this case is not a complete replacement
> > > for kswapd or direct reclaim, it is to proactively identify memory savings
> > > opportunities and reclaim some amount of cold pages set by the policy
> > > to free up the memory for more demanding jobs or scheduling new jobs.
> > >
> > > A user space proactive reclaimer is used in Google data centers.
> > > Additionally, Meta's TMO paper recently referenced a very similar
> > > interface used for user space proactive reclaim:
> > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> > >
> > > Benefits of a user space reclaimer:
> > > -----------------------------------
> > >
> > > 1) More flexible on who should be charged for the cpu of the memory
> > > reclaim. For proactive reclaim, it makes more sense to be centralized.
> > >
> > > 2) More flexible on dedicating the resources (like cpu). The memory
> > > overcommit controller can balance the cost between the cpu usage and
> > > the memory reclaimed.
> > >
> > > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > > under memory pressure better reclaim candidates are selected. This also
> > > gives more accurate and uptodate notion of working set for an
> > > application.
> > >
> > > Why memory.high is not enough?
> > > ------------------------------
> > >
> > > - memory.high can be used to trigger reclaim in a memcg and can
> > >   potentially be used for proactive reclaim.
> > >   However there is a big downside in using memory.high. It can potentially
> > >   introduce high reclaim stalls in the target application as the
> > >   allocations from the processes or the threads of the application can hit
> > >   the temporary memory.high limit.
> > >
> > > - Userspace proactive reclaimers usually use feedback loops to decide
> > >   how much memory to proactively reclaim from a workload. The metrics
> > >   used for this are usually either refaults or PSI, and these metrics
> > >   will become messy if the application gets throttled by hitting the
> > >   high limit.
> > >
> > > - memory.high is a stateful interface, if the userspace proactive
> > >   reclaimer crashes for any reason while triggering reclaim it can leave
> > >   the application in a bad state.
> > >
> > > - If a workload is rapidly expanding, setting memory.high to proactively
> > >   reclaim memory can result in actually reclaiming more memory than
> > >   intended.
> > >
> > > The benefits of such interface and shortcomings of existing interface
> > > were further discussed in this RFC thread:
> > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> > >
> > > Interface:
> > > ----------
> > >
> > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > > trigger reclaim in the target memory cgroup.
> > >
> > >
> > > Possible Extensions:
> > > --------------------
> > >
> > > - This interface can be extended with an additional parameter or flags
> > >   to allow specifying one or more types of memory to reclaim from (e.g.
> > >   file, anon, ..).
> > >
> > > - The interface can also be extended with a node mask to reclaim from
> > >   specific nodes. This has use cases for reclaim-based demotion in memory
> > >   tiering systens.
> > >
> > > - A similar per-node interface can also be added to support proactive
> > >   reclaim and reclaim-based demotion in systems without memcg.
> > >
> > > For now, let's keep things simple by adding the basic functionality.
> > >
> > > [yosryahmed@google.com: refreshed to current master, updated commit
> > > message based on recent discussions and use cases]
> > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> > >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> > >  2 files changed, 46 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 69d7a6983f78..925aaabb2247 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back.
> > >         high limit is used and monitored properly, this limit's
> > >         utility is limited to providing the final safety net.
> > >
> > > +  memory.reclaim
> > > +       A write-only file which exists on non-root cgroups.
> > > +
> > > +       This is a simple interface to trigger memory reclaim in the
> > > +       target cgroup. Write the number of bytes to reclaim to this
> > > +       file and the kernel will try to reclaim that much memory.
> > > +       Please note that the kernel can over or under reclaim from
> > > +       the target cgroup.
> > > +
> > >    memory.oom.group
> > >         A read-write single value file which exists on non-root
> > >         cgroups.  The default value is "0".
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 725f76723220..994849fab7df 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> > >         return nbytes;
> > >  }
> > >
> > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > > +                             size_t nbytes, loff_t off)
> > > +{
> > > +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > > +       unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > > +       unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > +       int err;
> > > +
> > > +       buf = strstrip(buf);
> > > +       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > > +       if (err)
> > > +               return err;
> > > +
> > > +       while (nr_reclaimed < nr_to_reclaim) {
> > > +               unsigned long reclaimed;
> > > +
> > > +               if (signal_pending(current))
> > > +                       break;
> > > +
> > > +               reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > +                                               nr_to_reclaim - nr_reclaimed,
> > > +                                               GFP_KERNEL, true);
> > > +
> > > +               if (!reclaimed && !nr_retries--)
> > > +                       break;
> > > +
> > > +               nr_reclaimed += reclaimed;
> > > +       }
> > > +
> > > +       return nbytes;
> >
> > It is better to return an error code (e.g. -EBUSY) when
> > memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory,
> > except if the cgroup memory usage is already 0.  We can also return
> > -EINVAL if nr_to_reclaim is too large (e.g. > limit).
>
> For -EBUSY, are you thinking of a specific usecase where that would
> come in handy? I'm not really opposed to it, but couldn't convince
> myself of the practical benefits of it, either.
>
> Keep in mind that MAX_RECLAIM_RETRIES failed reclaim attempts usually
> constitute an OOM situation: memory.max will issue kills and
> memory.high will begin crippling throttling. In what scenario would
> you want to keep reclaiming a workload that is considered OOM?
>
> Certainly, proactive reclaim that wants to purge only the cold tail of
> the workload wouldn't retry. Meta's version of this patch actually
> does return -EAGAIN on reclaim failure, but the userspace daemon
> doesn't do anything with it, so I didn't bring it up.

-EAGAIN sounds good, too.  Given that the userspace requests to
reclaim a specified number of bytes, I think it is generally better to
tell the userspace whether the request has been successfully
fulfilled. Ideally, it would be even better to return how many bytes
that have been reclaimed, though that is not easy to do through the
cgroup interface. The userspace can choose to ignore the return value
or log a message/update some stats (which Google does) for the
monitoring purpose.

> For -EINVAL, I tend to lean more toward disagreeing. We've been trying
> to avoid arbitrary dependencies between control knobs in cgroup2, just
> because it exposes us to race conditions and adds complications to the
> interface. For example, it *usually* doesn't make sense to set limits
> to 0, or set local limits and protections higher than the parent. But
> we allow it anyway, to avoid creating well-intended linting rules that
> could interfere with somebody's unforeseen, legitimate usecase.

OK, let's then not check against the limit.

Johannes Weiner April 1, 2022, 9:07 p.m. UTC | #26

On Fri, Apr 01, 2022 at 01:14:35PM -0700, Wei Xu wrote:
> On Fri, Apr 1, 2022 at 8:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, Mar 31, 2022 at 09:05:15PM -0700, Wei Xu wrote:
> > > It is better to return an error code (e.g. -EBUSY) when
> > > memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory,
> > > except if the cgroup memory usage is already 0.  We can also return
> > > -EINVAL if nr_to_reclaim is too large (e.g. > limit).
> >
> > For -EBUSY, are you thinking of a specific usecase where that would
> > come in handy? I'm not really opposed to it, but couldn't convince
> > myself of the practical benefits of it, either.
> >
> > Keep in mind that MAX_RECLAIM_RETRIES failed reclaim attempts usually
> > constitute an OOM situation: memory.max will issue kills and
> > memory.high will begin crippling throttling. In what scenario would
> > you want to keep reclaiming a workload that is considered OOM?
> >
> > Certainly, proactive reclaim that wants to purge only the cold tail of
> > the workload wouldn't retry. Meta's version of this patch actually
> > does return -EAGAIN on reclaim failure, but the userspace daemon
> > doesn't do anything with it, so I didn't bring it up.
> 
> -EAGAIN sounds good, too.  Given that the userspace requests to
> reclaim a specified number of bytes, I think it is generally better to
> tell the userspace whether the request has been successfully
> fulfilled. Ideally, it would be even better to return how many bytes
> that have been reclaimed, though that is not easy to do through the
> cgroup interface. The userspace can choose to ignore the return value
> or log a message/update some stats (which Google does) for the
> monitoring purpose.

Fair enough, thanks for your thoughts. No objection from me!

Johannes Weiner April 1, 2022, 9:13 p.m. UTC | #27

On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote:
> The interface you're proposing is not really extensible, so we'll likely need to
> introduce a new interface like memory.reclaim_ext very soon. Why not create
> an extensible API from scratch?
> 
> I'm looking at cgroup v2 documentation which describes various interface files
> formats and it seems like given the number of potential optional arguments
> the best option is nested keyed (please, refer to the Interface Files section).
> 
> E.g. the format can be:
> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim

Yeah, that syntax looks perfect.

But why do you think it's not extensible from the current patch? We
can add those arguments one by one as we agree on them, and return
-EINVAL if somebody passes an unknown parameter.

It seems to me the current proposal is forward-compatible that way
(with the current set of keyword pararms being the empty set :-))

Roman Gushchin April 1, 2022, 9:21 p.m. UTC | #28

> On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote:
>> The interface you're proposing is not really extensible, so we'll likely need to
>> introduce a new interface like memory.reclaim_ext very soon. Why not create
>> an extensible API from scratch?
>> 
>> I'm looking at cgroup v2 documentation which describes various interface files
>> formats and it seems like given the number of potential optional arguments
>> the best option is nested keyed (please, refer to the Interface Files section).
>> 
>> E.g. the format can be:
>> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim
> 
> Yeah, that syntax looks perfect.
> 
> But why do you think it's not extensible from the current patch? We
> can add those arguments one by one as we agree on them, and return
> -EINVAL if somebody passes an unknown parameter.
> 
> It seems to me the current proposal is forward-compatible that way
> (with the current set of keyword pararms being the empty set :-))

It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :)

So yeah, let’s just describe it properly in the documentation, no code changes are needed.

Wei Xu April 1, 2022, 9:38 p.m. UTC | #29

On Fri, Apr 1, 2022 at 2:21 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> > On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote:
> >> The interface you're proposing is not really extensible, so we'll likely need to
> >> introduce a new interface like memory.reclaim_ext very soon. Why not create
> >> an extensible API from scratch?
> >>
> >> I'm looking at cgroup v2 documentation which describes various interface files
> >> formats and it seems like given the number of potential optional arguments
> >> the best option is nested keyed (please, refer to the Interface Files section).
> >>
> >> E.g. the format can be:
> >> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim
> >
> > Yeah, that syntax looks perfect.
> >

I agree this is a better syntax than positional arguments. The latter
would require a default value be specified for each earlier argument
if we just want to provide a custom value for a later argument.

> > But why do you think it's not extensible from the current patch? We
> > can add those arguments one by one as we agree on them, and return
> > -EINVAL if somebody passes an unknown parameter.
> >
> > It seems to me the current proposal is forward-compatible that way
> > (with the current set of keyword pararms being the empty set :-))
>
> It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :)
> So yeah, let’s just describe it properly in the documentation, no code changes are needed.

Johannes Weiner April 1, 2022, 9:51 p.m. UTC | #30

On Fri, Apr 01, 2022 at 02:21:52PM -0700, Roman Gushchin wrote:
> > On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote:
> >> The interface you're proposing is not really extensible, so we'll likely need to
> >> introduce a new interface like memory.reclaim_ext very soon. Why not create
> >> an extensible API from scratch?
> >> 
> >> I'm looking at cgroup v2 documentation which describes various interface files
> >> formats and it seems like given the number of potential optional arguments
> >> the best option is nested keyed (please, refer to the Interface Files section).
> >> 
> >> E.g. the format can be:
> >> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim
> > 
> > Yeah, that syntax looks perfect.
> > 
> > But why do you think it's not extensible from the current patch? We
> > can add those arguments one by one as we agree on them, and return
> > -EINVAL if somebody passes an unknown parameter.
> > 
> > It seems to me the current proposal is forward-compatible that way
> > (with the current set of keyword pararms being the empty set :-))
> 
> It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :)
> 
> So yeah, let’s just describe it properly in the documentation, no code changes are needed.

Sounds good to me!

Huang, Ying April 2, 2022, 8:13 a.m. UTC | #31

Wei Xu <weixugc@google.com> writes:

> On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
>> > From: Shakeel Butt <shakeelb@google.com>
>> >

[snip]

>> > Possible Extensions:
>> > --------------------
>> >
>> > - This interface can be extended with an additional parameter or flags
>> >   to allow specifying one or more types of memory to reclaim from (e.g.
>> >   file, anon, ..).
>> >
>> > - The interface can also be extended with a node mask to reclaim from
>> >   specific nodes. This has use cases for reclaim-based demotion in memory
>> >   tiering systens.
>> >
>> > - A similar per-node interface can also be added to support proactive
>> >   reclaim and reclaim-based demotion in systems without memcg.
>> >
>> > For now, let's keep things simple by adding the basic functionality.
>>
>> Yes, I am for the simplicity and this really looks like a bare minumum
>> interface. But it is not really clear who do you want to add flags on
>> top of it?
>>
>> I am not really sure we really need a node aware interface for memcg.
>> The global reclaim interface will likely need a different node because
>> we do not want to make this CONFIG_MEMCG constrained.
>
> A nodemask argument for memory.reclaim can be useful for memory
> tiering between NUMA nodes with different performance.  Similar to
> proactive reclaim, it can allow a userspace daemon to drive
> memcg-based proactive demotion via the reclaim-based demotion
> mechanism in the kernel.

I am not sure whether nodemask is a good way for demoting pages between
different types of memory.  For example, for a system with DRAM and
PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
is the meaning of specifying PMEM node? reclaiming to disk?

In general, I have no objection to the idea in general.  But we should
have a clear and consistent interface.  Per my understanding the default
memcg interface is for memory, regardless of memory types.  The memory
reclaiming means reduce the memory usage, regardless of memory types.
We need to either extending the semantics of memory reclaiming (to
include memory demoting too), or add another interface for memory
demoting.

Best Regards,
Huang, Ying

[snip]

Wei Xu April 3, 2022, 6:46 a.m. UTC | #32

On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> >> > From: Shakeel Butt <shakeelb@google.com>
> >> >
>
> [snip]
>
> >> > Possible Extensions:
> >> > --------------------
> >> >
> >> > - This interface can be extended with an additional parameter or flags
> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >> >   file, anon, ..).
> >> >
> >> > - The interface can also be extended with a node mask to reclaim from
> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >> >   tiering systens.
> >> >
> >> > - A similar per-node interface can also be added to support proactive
> >> >   reclaim and reclaim-based demotion in systems without memcg.
> >> >
> >> > For now, let's keep things simple by adding the basic functionality.
> >>
> >> Yes, I am for the simplicity and this really looks like a bare minumum
> >> interface. But it is not really clear who do you want to add flags on
> >> top of it?
> >>
> >> I am not really sure we really need a node aware interface for memcg.
> >> The global reclaim interface will likely need a different node because
> >> we do not want to make this CONFIG_MEMCG constrained.
> >
> > A nodemask argument for memory.reclaim can be useful for memory
> > tiering between NUMA nodes with different performance.  Similar to
> > proactive reclaim, it can allow a userspace daemon to drive
> > memcg-based proactive demotion via the reclaim-based demotion
> > mechanism in the kernel.
>
> I am not sure whether nodemask is a good way for demoting pages between
> different types of memory.  For example, for a system with DRAM and
> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
> is the meaning of specifying PMEM node? reclaiming to disk?
>
> In general, I have no objection to the idea in general.  But we should
> have a clear and consistent interface.  Per my understanding the default
> memcg interface is for memory, regardless of memory types.  The memory
> reclaiming means reduce the memory usage, regardless of memory types.
> We need to either extending the semantics of memory reclaiming (to
> include memory demoting too), or add another interface for memory
> demoting.

Wei Xu April 3, 2022, 6:56 a.m. UTC | #33

On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> >> > From: Shakeel Butt <shakeelb@google.com>
> >> >
>
> [snip]
>
> >> > Possible Extensions:
> >> > --------------------
> >> >
> >> > - This interface can be extended with an additional parameter or flags
> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >> >   file, anon, ..).
> >> >
> >> > - The interface can also be extended with a node mask to reclaim from
> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >> >   tiering systens.
> >> >
> >> > - A similar per-node interface can also be added to support proactive
> >> >   reclaim and reclaim-based demotion in systems without memcg.
> >> >
> >> > For now, let's keep things simple by adding the basic functionality.
> >>
> >> Yes, I am for the simplicity and this really looks like a bare minumum
> >> interface. But it is not really clear who do you want to add flags on
> >> top of it?
> >>
> >> I am not really sure we really need a node aware interface for memcg.
> >> The global reclaim interface will likely need a different node because
> >> we do not want to make this CONFIG_MEMCG constrained.
> >
> > A nodemask argument for memory.reclaim can be useful for memory
> > tiering between NUMA nodes with different performance.  Similar to
> > proactive reclaim, it can allow a userspace daemon to drive
> > memcg-based proactive demotion via the reclaim-based demotion
> > mechanism in the kernel.
>
> I am not sure whether nodemask is a good way for demoting pages between
> different types of memory.  For example, for a system with DRAM and
> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
> is the meaning of specifying PMEM node? reclaiming to disk?
>
> In general, I have no objection to the idea in general.  But we should
> have a clear and consistent interface.  Per my understanding the default
> memcg interface is for memory, regardless of memory types.  The memory
> reclaiming means reduce the memory usage, regardless of memory types.
> We need to either extending the semantics of memory reclaiming (to
> include memory demoting too), or add another interface for memory
> demoting.

Good point.  With the "demote pages during reclaim" patch series,
reclaim is already extended to demote pages as well.  For example,
can_reclaim_anon_pages() returns true if demotion is allowed and
shrink_page_list() can demote pages instead of reclaiming pages.

Currently, demotion is disabled for memcg reclaim, which I think can
be relaxed and also necessary for memcg-based proactive demotion.  I'd
like to suggest that we extend the semantics of memory.reclaim to
cover memory demotion as well.  A flag can be used to enable/disable
the demotion behavior.

Vaibhav Jain April 4, 2022, 3:50 a.m. UTC | #34

Apologies for the delayed response,

Yosry Ahmed <yosryahmed@google.com> writes:

> On Fri, Apr 1, 2022 at 1:39 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
>>
>>
>> Yosry Ahmed <yosryahmed@google.com> writes:
>> > From: Shakeel Butt <shakeelb@google.com>
>> >
>> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
>> <snip>
>>
>> > +
>> > +     while (nr_reclaimed < nr_to_reclaim) {
>> > +             unsigned long reclaimed;
>> > +
>> > +             if (signal_pending(current))
>> > +                     break;
>> > +
>> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
>> > +                                             nr_to_reclaim - nr_reclaimed,
>> > +                                             GFP_KERNEL, true);
>> > +
>> > +             if (!reclaimed && !nr_retries--)
>> > +                     break;
>> > +
>> > +             nr_reclaimed += reclaimed;
>>
>> I think there should be a cond_resched() in this loop before
>> try_to_free_mem_cgroup_pages() to have better chances of reclaim
>> succeding early.
>>
> Thanks for taking the time to look at this!
>
> I believe this loop is modeled after the loop in memory_high_write()
> for the memory.high interface. Is there a reason why it should be
> needed here but not there?
>

memory_high_write() calls drain_all_stock() atleast once before calling
try_to_free_mem_cgroup_pages(). This would drain all percpu stocks
for the given memcg and its descendents, giving a high chance
try_to_free_mem_cgroup_pages() to succeed quickly. Such a functionality
is missing from this patch.

Adding a cond_resched() would atleast give chance to other processess
within the memcg to run and make forward progress thereby making more
pages available for reclaim.

Suggestion is partly based on __perform_reclaim() issues a cond_resche()
as it may get called repeatedly during direct reclaim path.

>> <snip>
>>
>> --
>> Cheers
>> ~ Vaibhav
>

Michal Hocko April 4, 2022, 8:44 a.m. UTC | #35

On Fri 01-04-22 09:58:59, Roman Gushchin wrote:
> On Fri, Apr 01, 2022 at 03:49:19PM +0200, Michal Hocko wrote:
> > On Thu 31-03-22 10:25:23, Roman Gushchin wrote:
> > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > [...]
> > > > - A similar per-node interface can also be added to support proactive
> > > >   reclaim and reclaim-based demotion in systems without memcg.
> > > 
> > > Maybe an option to specify a timeout? That might simplify the userspace part.
> > 
> > What do you mean by timeout here? Isn't
> > timeout $N echo $RECLAIM > ....
> > 
> > enough?
> 
> It's nice and simple when it's a bash script, but when it's a complex
> application trying to do the same, it quickly becomes less simple and
> likely will require a dedicated thread to avoid blocking the main app
> for too long and a mechanism to unblock it by timer/when the need arises.
> 
> In my experience using correctly such semi-blocking interfaces (semi- because
> it's not clearly defined how much time the syscall can take and whether it
> makes sense to wait longer) is tricky.

We have the same approach to setting other limits which need to perform
the reclaim. Have we ever hit that as a limitation that would make
userspace unnecessarily too complex?

Shakeel Butt April 4, 2022, 5:08 p.m. UTC | #36

On Fri, Apr 1, 2022 at 1:14 PM Wei Xu <weixugc@google.com> wrote:
>
[...]
>
> -EAGAIN sounds good, too.  Given that the userspace requests to
> reclaim a specified number of bytes, I think it is generally better to
> tell the userspace whether the request has been successfully
> fulfilled. Ideally, it would be even better to return how many bytes
> that have been reclaimed, though that is not easy to do through the
> cgroup interface.

What would be the challenge on returning the number of bytes reclaimed
through cgroup interface?

Yosry Ahmed April 4, 2022, 5:09 p.m. UTC | #37

On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> > From: Shakeel Butt <shakeelb@google.com>
> >
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use case: Proactive Reclaim
> > ---------------------------
> >
> > A userspace proactive reclaimer can continuously probe the memcg to
> > reclaim a small amount of memory. This gives more accurate and
> > up-to-date workingset estimation as the LRUs are continuously
> > sorted and can potentially provide more deterministic memory
> > overcommit behavior. The memory overcommit controller can provide
> > more proactive response to the changing behavior of the running
> > applications instead of being reactive.
> >
> > A userspace reclaimer's purpose in this case is not a complete replacement
> > for kswapd or direct reclaim, it is to proactively identify memory savings
> > opportunities and reclaim some amount of cold pages set by the policy
> > to free up the memory for more demanding jobs or scheduling new jobs.
> >
> > A user space proactive reclaimer is used in Google data centers.
> > Additionally, Meta's TMO paper recently referenced a very similar
> > interface used for user space proactive reclaim:
> > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> >
> > Benefits of a user space reclaimer:
> > -----------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to be centralized.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Why memory.high is not enough?
> > ------------------------------
> >
> > - memory.high can be used to trigger reclaim in a memcg and can
> >   potentially be used for proactive reclaim.
> >   However there is a big downside in using memory.high. It can potentially
> >   introduce high reclaim stalls in the target application as the
> >   allocations from the processes or the threads of the application can hit
> >   the temporary memory.high limit.
> >
> > - Userspace proactive reclaimers usually use feedback loops to decide
> >   how much memory to proactively reclaim from a workload. The metrics
> >   used for this are usually either refaults or PSI, and these metrics
> >   will become messy if the application gets throttled by hitting the
> >   high limit.
> >
> > - memory.high is a stateful interface, if the userspace proactive
> >   reclaimer crashes for any reason while triggering reclaim it can leave
> >   the application in a bad state.
> >
> > - If a workload is rapidly expanding, setting memory.high to proactively
> >   reclaim memory can result in actually reclaiming more memory than
> >   intended.
> >
> > The benefits of such interface and shortcomings of existing interface
> > were further discussed in this RFC thread:
> > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> >
> > Interface:
> > ----------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> >
> > Possible Extensions:
> > --------------------
> >
> > - This interface can be extended with an additional parameter or flags
> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >   file, anon, ..).
> >
> > - The interface can also be extended with a node mask to reclaim from
> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >   tiering systens.
> >
> > - A similar per-node interface can also be added to support proactive
> >   reclaim and reclaim-based demotion in systems without memcg.
> >
> > For now, let's keep things simple by adding the basic functionality.
>
> Yes, I am for the simplicity and this really looks like a bare minumum
> interface. But it is not really clear who do you want to add flags on
> top of it?
>

Mostly I (or someone at Google) will follow-up with patches to add
most of these features. We just wanted to get consensus on the bare
minimum interface first, and to avoid derailing this discussion with
whether or not we need each of those features and what the best way to
implement them is.

> I am not really sure we really need a node aware interface for memcg.
> The global reclaim interface will likely need a different node because
> we do not want to make this CONFIG_MEMCG constrained.
>

The main use case, as Wei mentioned, is memcg-based proactive demotion
via the reclaim-based demotion
mechanism in the kernel. We can still have a nodemask argument to the
global reclaim interface as well.


> > [yosryahmed@google.com: refreshed to current master, updated commit
> > message based on recent discussions and use cases]
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> All that being said. I haven't been a great fan for explicit reclaim
> triggered from the userspace but I do recognize that limitations of the
> existing interfaces is just too restrictive.
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

Yosry Ahmed April 4, 2022, 5:13 p.m. UTC | #38

On Fri, Apr 1, 2022 at 11:39 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Fri, Apr 01, 2022 at 02:11:51AM -0700, Yosry Ahmed wrote:
> > On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > > > From: Shakeel Butt <shakeelb@google.com>
> > > >
> > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > > >
> > > > Use case: Proactive Reclaim
> > > > ---------------------------
> > > >
> > > > A userspace proactive reclaimer can continuously probe the memcg to
> > > > reclaim a small amount of memory. This gives more accurate and
> > > > up-to-date workingset estimation as the LRUs are continuously
> > > > sorted and can potentially provide more deterministic memory
> > > > overcommit behavior. The memory overcommit controller can provide
> > > > more proactive response to the changing behavior of the running
> > > > applications instead of being reactive.
> > > >
> > > > A userspace reclaimer's purpose in this case is not a complete replacement
> > > > for kswapd or direct reclaim, it is to proactively identify memory savings
> > > > opportunities and reclaim some amount of cold pages set by the policy
> > > > to free up the memory for more demanding jobs or scheduling new jobs.
> > > >
> > > > A user space proactive reclaimer is used in Google data centers.
> > > > Additionally, Meta's TMO paper recently referenced a very similar
> > > > interface used for user space proactive reclaim:
> > > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> > > >
> > > > Benefits of a user space reclaimer:
> > > > -----------------------------------
> > > >
> > > > 1) More flexible on who should be charged for the cpu of the memory
> > > > reclaim. For proactive reclaim, it makes more sense to be centralized.
> > > >
> > > > 2) More flexible on dedicating the resources (like cpu). The memory
> > > > overcommit controller can balance the cost between the cpu usage and
> > > > the memory reclaimed.
> > > >
> > > > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > > > under memory pressure better reclaim candidates are selected. This also
> > > > gives more accurate and uptodate notion of working set for an
> > > > application.
> > > >
> > > > Why memory.high is not enough?
> > > > ------------------------------
> > > >
> > > > - memory.high can be used to trigger reclaim in a memcg and can
> > > >   potentially be used for proactive reclaim.
> > > >   However there is a big downside in using memory.high. It can potentially
> > > >   introduce high reclaim stalls in the target application as the
> > > >   allocations from the processes or the threads of the application can hit
> > > >   the temporary memory.high limit.
> > > >
> > > > - Userspace proactive reclaimers usually use feedback loops to decide
> > > >   how much memory to proactively reclaim from a workload. The metrics
> > > >   used for this are usually either refaults or PSI, and these metrics
> > > >   will become messy if the application gets throttled by hitting the
> > > >   high limit.
> > > >
> > > > - memory.high is a stateful interface, if the userspace proactive
> > > >   reclaimer crashes for any reason while triggering reclaim it can leave
> > > >   the application in a bad state.
> > > >
> > > > - If a workload is rapidly expanding, setting memory.high to proactively
> > > >   reclaim memory can result in actually reclaiming more memory than
> > > >   intended.
> > > >
> > > > The benefits of such interface and shortcomings of existing interface
> > > > were further discussed in this RFC thread:
> > > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> > >
> > > Hello!
> > >
> > > I'm totally up for the proposed feature! It makes total sense and is proved
> > > to be useful, let's add it.
> > >
> > > >
> > > > Interface:
> > > > ----------
> > > >
> > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > > > trigger reclaim in the target memory cgroup.
> > > >
> > > >
> > > > Possible Extensions:
> > > > --------------------
> > > >
> > > > - This interface can be extended with an additional parameter or flags
> > > >   to allow specifying one or more types of memory to reclaim from (e.g.
> > > >   file, anon, ..).
> > > >
> > > > - The interface can also be extended with a node mask to reclaim from
> > > >   specific nodes. This has use cases for reclaim-based demotion in memory
> > > >   tiering systens.
> > > >
> > > > - A similar per-node interface can also be added to support proactive
> > > >   reclaim and reclaim-based demotion in systems without memcg.
> > >
> > > Maybe an option to specify a timeout? That might simplify the userspace part.
> > > Also, please please add a test to selftests/cgroup/memcg tests.
> > > It will also provide an example on how the userspace can use the feature.
> > >
> >
> > Hi Roman, thanks for taking the time to review this!
> >
> > A timeout can be a good extension, I will add it to the commit message
> > in the next version in possible extensions.
> >
> > I will add a test in v2, thanks!
>
> Great, thank you!
>
> >
> > >
> > > >
> > > > For now, let's keep things simple by adding the basic functionality.
> > >
> > > What I'm worried about is how we gonna extend it? How do you see the interface
> > > with 2-3 extensions from the list above? All these extensions look very
> > > reasonable to me, so we'll likely have to implement them soon. So let's think
> > > about the extensibility now.
> > >
> >
> > My idea is to have these extensions as optional positional arguments
> > (like Wei suggested), so that the interface does not get too
> > complicated for users who don't care about tuning these options. If
> > this is the case then I think there is nothing to worry about.
> > Otherwise, if you think some of these options make sense to be a
> > required argument instead, we can rethink the initial interface.
>
> The interface you're proposing is not really extensible, so we'll likely need to
> introduce a new interface like memory.reclaim_ext very soon. Why not create
> an extensible API from scratch?
>
> I'm looking at cgroup v2 documentation which describes various interface files
> formats and it seems like given the number of potential optional arguments
> the best option is nested keyed (please, refer to the Interface Files section).
>
> E.g. the format can be:
> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim
>
> We can say that now we don't support any keyed arguments, but they can be
> added in the future.
>
> Basically you don't even need to change any code, only document the interface
> properly, so we can extend it later without breaking the API.
>

Thanks a lot for suggesting this, it indeed looks very much cleaner.

I will make sure to document the interface properly as suggested in v2.

> >
> > > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead?
> > > In the end, such a feature might make sense on the system level too.
> > > Yes, there is the drop_caches sysctl, but it's too radical for many cases.
> > >
> >
> > I think in the RFC discussion there was consensus to add both a
> > per-memcg knob, as well as per-node / per-system knobs (through sysfs
> > or syscalls) later. Wei also points out that it's not common for a
> > syscall to have a cgroup argument.
>
> Actually there are examples (e.g. sys_bpf), but my only point is to make
> the API extensible, so maybe syscall is not the best idea.
>
> I'd add the root level interface from scratch: the code change is simple
> and it makes sense as a feature. Then likely we don't really need another
> system-level interface at all.
>

I think we would still need a system-level interface anyway for
systems with no memcg that wish to make use of proactive memory
reclaim. We can still make the memory.reclaim interface available for
root as well if you think this is desirable.

> Thanks!

Shakeel Butt April 4, 2022, 5:14 p.m. UTC | #39

On Fri, Apr 1, 2022 at 2:51 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 01, 2022 at 02:21:52PM -0700, Roman Gushchin wrote:
> > > On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote:
> > >> The interface you're proposing is not really extensible, so we'll likely need to
> > >> introduce a new interface like memory.reclaim_ext very soon. Why not create
> > >> an extensible API from scratch?
> > >>
> > >> I'm looking at cgroup v2 documentation which describes various interface files
> > >> formats and it seems like given the number of potential optional arguments
> > >> the best option is nested keyed (please, refer to the Interface Files section).
> > >>
> > >> E.g. the format can be:
> > >> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim
> > >
> > > Yeah, that syntax looks perfect.
> > >
> > > But why do you think it's not extensible from the current patch? We
> > > can add those arguments one by one as we agree on them, and return
> > > -EINVAL if somebody passes an unknown parameter.
> > >
> > > It seems to me the current proposal is forward-compatible that way
> > > (with the current set of keyword pararms being the empty set :-))
> >
> > It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :)
> >
> > So yeah, let’s just describe it properly in the documentation, no code changes are needed.
>
> Sounds good to me!

To summarize for next version:

1) Add selftests.
2) Add documentation for potential future extension, so whoever adds
those features in future should follow the key-value format Roman is
suggesting.

Yosry, once we have agreement on the return value, please send the
next version resolving these three points.

Yosry Ahmed April 4, 2022, 5:18 p.m. UTC | #40

On Sun, Apr 3, 2022 at 8:50 PM Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
>
>
> Apologies for the delayed response,
>
> Yosry Ahmed <yosryahmed@google.com> writes:
>
> > On Fri, Apr 1, 2022 at 1:39 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
> >>
> >>
> >> Yosry Ahmed <yosryahmed@google.com> writes:
> >> > From: Shakeel Butt <shakeelb@google.com>
> >> >
> >> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >> <snip>
> >>
> >> > +
> >> > +     while (nr_reclaimed < nr_to_reclaim) {
> >> > +             unsigned long reclaimed;
> >> > +
> >> > +             if (signal_pending(current))
> >> > +                     break;
> >> > +
> >> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> >> > +                                             nr_to_reclaim - nr_reclaimed,
> >> > +                                             GFP_KERNEL, true);
> >> > +
> >> > +             if (!reclaimed && !nr_retries--)
> >> > +                     break;
> >> > +
> >> > +             nr_reclaimed += reclaimed;
> >>
> >> I think there should be a cond_resched() in this loop before
> >> try_to_free_mem_cgroup_pages() to have better chances of reclaim
> >> succeding early.
> >>
> > Thanks for taking the time to look at this!
> >
> > I believe this loop is modeled after the loop in memory_high_write()
> > for the memory.high interface. Is there a reason why it should be
> > needed here but not there?
> >
>
> memory_high_write() calls drain_all_stock() atleast once before calling
> try_to_free_mem_cgroup_pages(). This would drain all percpu stocks
> for the given memcg and its descendents, giving a high chance
> try_to_free_mem_cgroup_pages() to succeed quickly. Such a functionality
> is missing from this patch.
>
> Adding a cond_resched() would atleast give chance to other processess
> within the memcg to run and make forward progress thereby making more
> pages available for reclaim.
>
> Suggestion is partly based on __perform_reclaim() issues a cond_resche()
> as it may get called repeatedly during direct reclaim path.
>
As Michal pointed out, there is already a call to cond_resched() in
shrink_node_memcgs().
>
> >> <snip>
> >>
> >> --
> >> Cheers
> >> ~ Vaibhav
> >
>
> --
> Cheers
> ~ Vaibhav

Roman Gushchin April 4, 2022, 5:55 p.m. UTC | #41

On Mon, Apr 04, 2022 at 10:13:03AM -0700, Yosry Ahmed wrote:
> On Fri, Apr 1, 2022 at 11:39 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Fri, Apr 01, 2022 at 02:11:51AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin
> > > <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > > > > From: Shakeel Butt <shakeelb@google.com>
> > > > >
> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > > > >
> > > > > Use case: Proactive Reclaim
> > > > > ---------------------------
> > > > >
> > > > > A userspace proactive reclaimer can continuously probe the memcg to
> > > > > reclaim a small amount of memory. This gives more accurate and
> > > > > up-to-date workingset estimation as the LRUs are continuously
> > > > > sorted and can potentially provide more deterministic memory
> > > > > overcommit behavior. The memory overcommit controller can provide
> > > > > more proactive response to the changing behavior of the running
> > > > > applications instead of being reactive.
> > > > >
> > > > > A userspace reclaimer's purpose in this case is not a complete replacement
> > > > > for kswapd or direct reclaim, it is to proactively identify memory savings
> > > > > opportunities and reclaim some amount of cold pages set by the policy
> > > > > to free up the memory for more demanding jobs or scheduling new jobs.
> > > > >
> > > > > A user space proactive reclaimer is used in Google data centers.
> > > > > Additionally, Meta's TMO paper recently referenced a very similar
> > > > > interface used for user space proactive reclaim:
> > > > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731
> > > > >
> > > > > Benefits of a user space reclaimer:
> > > > > -----------------------------------
> > > > >
> > > > > 1) More flexible on who should be charged for the cpu of the memory
> > > > > reclaim. For proactive reclaim, it makes more sense to be centralized.
> > > > >
> > > > > 2) More flexible on dedicating the resources (like cpu). The memory
> > > > > overcommit controller can balance the cost between the cpu usage and
> > > > > the memory reclaimed.
> > > > >
> > > > > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > > > > under memory pressure better reclaim candidates are selected. This also
> > > > > gives more accurate and uptodate notion of working set for an
> > > > > application.
> > > > >
> > > > > Why memory.high is not enough?
> > > > > ------------------------------
> > > > >
> > > > > - memory.high can be used to trigger reclaim in a memcg and can
> > > > >   potentially be used for proactive reclaim.
> > > > >   However there is a big downside in using memory.high. It can potentially
> > > > >   introduce high reclaim stalls in the target application as the
> > > > >   allocations from the processes or the threads of the application can hit
> > > > >   the temporary memory.high limit.
> > > > >
> > > > > - Userspace proactive reclaimers usually use feedback loops to decide
> > > > >   how much memory to proactively reclaim from a workload. The metrics
> > > > >   used for this are usually either refaults or PSI, and these metrics
> > > > >   will become messy if the application gets throttled by hitting the
> > > > >   high limit.
> > > > >
> > > > > - memory.high is a stateful interface, if the userspace proactive
> > > > >   reclaimer crashes for any reason while triggering reclaim it can leave
> > > > >   the application in a bad state.
> > > > >
> > > > > - If a workload is rapidly expanding, setting memory.high to proactively
> > > > >   reclaim memory can result in actually reclaiming more memory than
> > > > >   intended.
> > > > >
> > > > > The benefits of such interface and shortcomings of existing interface
> > > > > were further discussed in this RFC thread:
> > > > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/
> > > >
> > > > Hello!
> > > >
> > > > I'm totally up for the proposed feature! It makes total sense and is proved
> > > > to be useful, let's add it.
> > > >
> > > > >
> > > > > Interface:
> > > > > ----------
> > > > >
> > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > > > > trigger reclaim in the target memory cgroup.
> > > > >
> > > > >
> > > > > Possible Extensions:
> > > > > --------------------
> > > > >
> > > > > - This interface can be extended with an additional parameter or flags
> > > > >   to allow specifying one or more types of memory to reclaim from (e.g.
> > > > >   file, anon, ..).
> > > > >
> > > > > - The interface can also be extended with a node mask to reclaim from
> > > > >   specific nodes. This has use cases for reclaim-based demotion in memory
> > > > >   tiering systens.
> > > > >
> > > > > - A similar per-node interface can also be added to support proactive
> > > > >   reclaim and reclaim-based demotion in systems without memcg.
> > > >
> > > > Maybe an option to specify a timeout? That might simplify the userspace part.
> > > > Also, please please add a test to selftests/cgroup/memcg tests.
> > > > It will also provide an example on how the userspace can use the feature.
> > > >
> > >
> > > Hi Roman, thanks for taking the time to review this!
> > >
> > > A timeout can be a good extension, I will add it to the commit message
> > > in the next version in possible extensions.
> > >
> > > I will add a test in v2, thanks!
> >
> > Great, thank you!
> >
> > >
> > > >
> > > > >
> > > > > For now, let's keep things simple by adding the basic functionality.
> > > >
> > > > What I'm worried about is how we gonna extend it? How do you see the interface
> > > > with 2-3 extensions from the list above? All these extensions look very
> > > > reasonable to me, so we'll likely have to implement them soon. So let's think
> > > > about the extensibility now.
> > > >
> > >
> > > My idea is to have these extensions as optional positional arguments
> > > (like Wei suggested), so that the interface does not get too
> > > complicated for users who don't care about tuning these options. If
> > > this is the case then I think there is nothing to worry about.
> > > Otherwise, if you think some of these options make sense to be a
> > > required argument instead, we can rethink the initial interface.
> >
> > The interface you're proposing is not really extensible, so we'll likely need to
> > introduce a new interface like memory.reclaim_ext very soon. Why not create
> > an extensible API from scratch?
> >
> > I'm looking at cgroup v2 documentation which describes various interface files
> > formats and it seems like given the number of potential optional arguments
> > the best option is nested keyed (please, refer to the Interface Files section).
> >
> > E.g. the format can be:
> > echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim
> >
> > We can say that now we don't support any keyed arguments, but they can be
> > added in the future.
> >
> > Basically you don't even need to change any code, only document the interface
> > properly, so we can extend it later without breaking the API.
> >
> 
> Thanks a lot for suggesting this, it indeed looks very much cleaner.
> 
> I will make sure to document the interface properly as suggested in v2.
> 
> > >
> > > > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead?
> > > > In the end, such a feature might make sense on the system level too.
> > > > Yes, there is the drop_caches sysctl, but it's too radical for many cases.
> > > >
> > >
> > > I think in the RFC discussion there was consensus to add both a
> > > per-memcg knob, as well as per-node / per-system knobs (through sysfs
> > > or syscalls) later. Wei also points out that it's not common for a
> > > syscall to have a cgroup argument.
> >
> > Actually there are examples (e.g. sys_bpf), but my only point is to make
> > the API extensible, so maybe syscall is not the best idea.
> >
> > I'd add the root level interface from scratch: the code change is simple
> > and it makes sense as a feature. Then likely we don't really need another
> > system-level interface at all.
> >
> 
> I think we would still need a system-level interface anyway for
> systems with no memcg that wish to make use of proactive memory
> reclaim. We can still make the memory.reclaim interface available for
> root as well if you think this is desirable.

Yes, I think it's a good idea. !memcg systems is a different story, we can
handle them separately.

Thanks!

Roman Gushchin April 4, 2022, 6:25 p.m. UTC | #42

On Mon, Apr 04, 2022 at 10:44:04AM +0200, Michal Hocko wrote:
> On Fri 01-04-22 09:58:59, Roman Gushchin wrote:
> > On Fri, Apr 01, 2022 at 03:49:19PM +0200, Michal Hocko wrote:
> > > On Thu 31-03-22 10:25:23, Roman Gushchin wrote:
> > > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote:
> > > [...]
> > > > > - A similar per-node interface can also be added to support proactive
> > > > >   reclaim and reclaim-based demotion in systems without memcg.
> > > > 
> > > > Maybe an option to specify a timeout? That might simplify the userspace part.
> > > 
> > > What do you mean by timeout here? Isn't
> > > timeout $N echo $RECLAIM > ....
> > > 
> > > enough?
> > 
> > It's nice and simple when it's a bash script, but when it's a complex
> > application trying to do the same, it quickly becomes less simple and
> > likely will require a dedicated thread to avoid blocking the main app
> > for too long and a mechanism to unblock it by timer/when the need arises.
> > 
> > In my experience using correctly such semi-blocking interfaces (semi- because
> > it's not clearly defined how much time the syscall can take and whether it
> > makes sense to wait longer) is tricky.
> 
> We have the same approach to setting other limits which need to perform
> the reclaim. Have we ever hit that as a limitation that would make
> userspace unnecessarily too complex?

The difference here is that some limits are most likely set once and
never adjusted, e.g. memory.max or memory.low.
I do definitely remember some issues around memory.high, but as I recall,
we've fixed them on the kernel side. We've even had a private memory.high.tmp
interface with a value and a timeout, which later was replaced with
a memory.reclaim interface similar to what we discuss here.
But with memory.high we set the limit first, so if a user tries to reclaim
a lot of hot memory, it will soon put all processes in the cgroup into
the sleep/direct reclaim. So it's not expected to block for too long.

In general it all comes to the question how hard the kernel should try to
reclaim the memory before giving up. The userspace might have different
needs in different cases. But if the interface is defined very vaguely like
it tries for an undefined amount of time and then gives up, it's hard to
use it in a predictive manner.

Thanks!

Wei Xu April 5, 2022, 2:30 a.m. UTC | #43

On Mon, Apr 4, 2022 at 10:08 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Apr 1, 2022 at 1:14 PM Wei Xu <weixugc@google.com> wrote:
> >
> [...]
> >
> > -EAGAIN sounds good, too.  Given that the userspace requests to
> > reclaim a specified number of bytes, I think it is generally better to
> > tell the userspace whether the request has been successfully
> > fulfilled. Ideally, it would be even better to return how many bytes
> > that have been reclaimed, though that is not easy to do through the
> > cgroup interface.
>
> What would be the challenge on returning the number of bytes reclaimed
> through cgroup interface?

write() syscall is used to write the command into memory.reclaim,
which should return either the number of command bytes written or -1
(errno is set to indicate the actual error).  I think we should not
return the number of bytes reclaimed through write().  A new
sys_reclaim() is better in this regard because we can define its
return value, though it would need a cgroup argument, which is not
commonly defined for syscalls.

Michal Koutný April 5, 2022, 10:09 a.m. UTC | #44

On Mon, Apr 04, 2022 at 10:08:43AM -0700, Shakeel Butt <shakeelb@google.com> wrote:
> What would be the challenge on returning the number of bytes reclaimed
> through cgroup interface?

You'd need an object that represents the write size:

> bfd = open("/sys/kernel/mm/reclaim/balloon", O_RDWR);
> dprintf(bfd, "type=file nodemask=1-2 timeout=30\n")
> 
> fd = open("/sys/kernel/fs/cgroup/foo/memory.reclaim", O_WRONLY);
> reclaimed = splice(bfd, NULL, fd, NULL, reclaim_size);

(I'm joking with this API but it is a resembling concept.)

Michal

Huang, Ying April 6, 2022, 12:48 a.m. UTC | #45

Wei Xu <weixugc@google.com> writes:

> On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Wei Xu <weixugc@google.com> writes:
>>
>> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
>> >>
>> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
>> >> > From: Shakeel Butt <shakeelb@google.com>
>> >> >
>>
>> [snip]
>>
>> >> > Possible Extensions:
>> >> > --------------------
>> >> >
>> >> > - This interface can be extended with an additional parameter or flags
>> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
>> >> >   file, anon, ..).
>> >> >
>> >> > - The interface can also be extended with a node mask to reclaim from
>> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
>> >> >   tiering systens.
>> >> >
>> >> > - A similar per-node interface can also be added to support proactive
>> >> >   reclaim and reclaim-based demotion in systems without memcg.
>> >> >
>> >> > For now, let's keep things simple by adding the basic functionality.
>> >>
>> >> Yes, I am for the simplicity and this really looks like a bare minumum
>> >> interface. But it is not really clear who do you want to add flags on
>> >> top of it?
>> >>
>> >> I am not really sure we really need a node aware interface for memcg.
>> >> The global reclaim interface will likely need a different node because
>> >> we do not want to make this CONFIG_MEMCG constrained.
>> >
>> > A nodemask argument for memory.reclaim can be useful for memory
>> > tiering between NUMA nodes with different performance.  Similar to
>> > proactive reclaim, it can allow a userspace daemon to drive
>> > memcg-based proactive demotion via the reclaim-based demotion
>> > mechanism in the kernel.
>>
>> I am not sure whether nodemask is a good way for demoting pages between
>> different types of memory.  For example, for a system with DRAM and
>> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
>> is the meaning of specifying PMEM node? reclaiming to disk?
>>
>> In general, I have no objection to the idea in general.  But we should
>> have a clear and consistent interface.  Per my understanding the default
>> memcg interface is for memory, regardless of memory types.  The memory
>> reclaiming means reduce the memory usage, regardless of memory types.
>> We need to either extending the semantics of memory reclaiming (to
>> include memory demoting too), or add another interface for memory
>> demoting.
>
> Good point.  With the "demote pages during reclaim" patch series,
> reclaim is already extended to demote pages as well.  For example,
> can_reclaim_anon_pages() returns true if demotion is allowed and
> shrink_page_list() can demote pages instead of reclaiming pages.

These are in-kernel implementation, not the ABI.  So we still have
the opportunity to define the ABI now.

> Currently, demotion is disabled for memcg reclaim, which I think can
> be relaxed and also necessary for memcg-based proactive demotion.  I'd
> like to suggest that we extend the semantics of memory.reclaim to
> cover memory demotion as well.  A flag can be used to enable/disable
> the demotion behavior.

If so,

# echo A > memory.reclaim

means

a) "A" bytes memory are freed from the memcg, regardless demoting is
   used or not.

or

b) "A" bytes memory are reclaimed from the memcg, some of them may be
   freed, some of them may be just demoted from DRAM to PMEM.  The total
   number is "A".

For me, a) looks more reasonable.

Best Regards,
Huang, Ying

Wei Xu April 6, 2022, 1:07 a.m. UTC | #46

On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Wei Xu <weixugc@google.com> writes:
> >>
> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >>
> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> >> >> > From: Shakeel Butt <shakeelb@google.com>
> >> >> >
> >>
> >> [snip]
> >>
> >> >> > Possible Extensions:
> >> >> > --------------------
> >> >> >
> >> >> > - This interface can be extended with an additional parameter or flags
> >> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >> >> >   file, anon, ..).
> >> >> >
> >> >> > - The interface can also be extended with a node mask to reclaim from
> >> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >> >> >   tiering systens.
> >> >> >
> >> >> > - A similar per-node interface can also be added to support proactive
> >> >> >   reclaim and reclaim-based demotion in systems without memcg.
> >> >> >
> >> >> > For now, let's keep things simple by adding the basic functionality.
> >> >>
> >> >> Yes, I am for the simplicity and this really looks like a bare minumum
> >> >> interface. But it is not really clear who do you want to add flags on
> >> >> top of it?
> >> >>
> >> >> I am not really sure we really need a node aware interface for memcg.
> >> >> The global reclaim interface will likely need a different node because
> >> >> we do not want to make this CONFIG_MEMCG constrained.
> >> >
> >> > A nodemask argument for memory.reclaim can be useful for memory
> >> > tiering between NUMA nodes with different performance.  Similar to
> >> > proactive reclaim, it can allow a userspace daemon to drive
> >> > memcg-based proactive demotion via the reclaim-based demotion
> >> > mechanism in the kernel.
> >>
> >> I am not sure whether nodemask is a good way for demoting pages between
> >> different types of memory.  For example, for a system with DRAM and
> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
> >> is the meaning of specifying PMEM node? reclaiming to disk?
> >>
> >> In general, I have no objection to the idea in general.  But we should
> >> have a clear and consistent interface.  Per my understanding the default
> >> memcg interface is for memory, regardless of memory types.  The memory
> >> reclaiming means reduce the memory usage, regardless of memory types.
> >> We need to either extending the semantics of memory reclaiming (to
> >> include memory demoting too), or add another interface for memory
> >> demoting.
> >
> > Good point.  With the "demote pages during reclaim" patch series,
> > reclaim is already extended to demote pages as well.  For example,
> > can_reclaim_anon_pages() returns true if demotion is allowed and
> > shrink_page_list() can demote pages instead of reclaiming pages.
>
> These are in-kernel implementation, not the ABI.  So we still have
> the opportunity to define the ABI now.
>
> > Currently, demotion is disabled for memcg reclaim, which I think can
> > be relaxed and also necessary for memcg-based proactive demotion.  I'd
> > like to suggest that we extend the semantics of memory.reclaim to
> > cover memory demotion as well.  A flag can be used to enable/disable
> > the demotion behavior.
>
> If so,
>
> # echo A > memory.reclaim
>
> means
>
> a) "A" bytes memory are freed from the memcg, regardless demoting is
>    used or not.
>
> or
>
> b) "A" bytes memory are reclaimed from the memcg, some of them may be
>    freed, some of them may be just demoted from DRAM to PMEM.  The total
>    number is "A".
>
> For me, a) looks more reasonable.
>

We can use a DEMOTE flag to control the demotion behavior for
memory.reclaim.  If the flag is not set (the default), then
no_demotion of scan_control can be set to 1, similar to
reclaim_pages().

The question is then whether we want to rename memory.reclaim to
something more general.  I think this name is fine if reclaim-based
demotion is an accepted concept.

Huang, Ying April 6, 2022, 2:49 a.m. UTC | #47

Wei Xu <weixugc@google.com> writes:

> On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Wei Xu <weixugc@google.com> writes:
>>
>> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Wei Xu <weixugc@google.com> writes:
>> >>
>> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> >>
>> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
>> >> >> > From: Shakeel Butt <shakeelb@google.com>
>> >> >> >
>> >>
>> >> [snip]
>> >>
>> >> >> > Possible Extensions:
>> >> >> > --------------------
>> >> >> >
>> >> >> > - This interface can be extended with an additional parameter or flags
>> >> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
>> >> >> >   file, anon, ..).
>> >> >> >
>> >> >> > - The interface can also be extended with a node mask to reclaim from
>> >> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
>> >> >> >   tiering systens.
>> >> >> >
>> >> >> > - A similar per-node interface can also be added to support proactive
>> >> >> >   reclaim and reclaim-based demotion in systems without memcg.
>> >> >> >
>> >> >> > For now, let's keep things simple by adding the basic functionality.
>> >> >>
>> >> >> Yes, I am for the simplicity and this really looks like a bare minumum
>> >> >> interface. But it is not really clear who do you want to add flags on
>> >> >> top of it?
>> >> >>
>> >> >> I am not really sure we really need a node aware interface for memcg.
>> >> >> The global reclaim interface will likely need a different node because
>> >> >> we do not want to make this CONFIG_MEMCG constrained.
>> >> >
>> >> > A nodemask argument for memory.reclaim can be useful for memory
>> >> > tiering between NUMA nodes with different performance.  Similar to
>> >> > proactive reclaim, it can allow a userspace daemon to drive
>> >> > memcg-based proactive demotion via the reclaim-based demotion
>> >> > mechanism in the kernel.
>> >>
>> >> I am not sure whether nodemask is a good way for demoting pages between
>> >> different types of memory.  For example, for a system with DRAM and
>> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
>> >> is the meaning of specifying PMEM node? reclaiming to disk?
>> >>
>> >> In general, I have no objection to the idea in general.  But we should
>> >> have a clear and consistent interface.  Per my understanding the default
>> >> memcg interface is for memory, regardless of memory types.  The memory
>> >> reclaiming means reduce the memory usage, regardless of memory types.
>> >> We need to either extending the semantics of memory reclaiming (to
>> >> include memory demoting too), or add another interface for memory
>> >> demoting.
>> >
>> > Good point.  With the "demote pages during reclaim" patch series,
>> > reclaim is already extended to demote pages as well.  For example,
>> > can_reclaim_anon_pages() returns true if demotion is allowed and
>> > shrink_page_list() can demote pages instead of reclaiming pages.
>>
>> These are in-kernel implementation, not the ABI.  So we still have
>> the opportunity to define the ABI now.
>>
>> > Currently, demotion is disabled for memcg reclaim, which I think can
>> > be relaxed and also necessary for memcg-based proactive demotion.  I'd
>> > like to suggest that we extend the semantics of memory.reclaim to
>> > cover memory demotion as well.  A flag can be used to enable/disable
>> > the demotion behavior.
>>
>> If so,
>>
>> # echo A > memory.reclaim
>>
>> means
>>
>> a) "A" bytes memory are freed from the memcg, regardless demoting is
>>    used or not.
>>
>> or
>>
>> b) "A" bytes memory are reclaimed from the memcg, some of them may be
>>    freed, some of them may be just demoted from DRAM to PMEM.  The total
>>    number is "A".
>>
>> For me, a) looks more reasonable.
>>
>
> We can use a DEMOTE flag to control the demotion behavior for
> memory.reclaim.  If the flag is not set (the default), then
> no_demotion of scan_control can be set to 1, similar to
> reclaim_pages().

If we have to use a flag to control the behavior, I think it's better to
have a separate interface (e.g. memory.demote).  But do we really need b)?

> The question is then whether we want to rename memory.reclaim to
> something more general.  I think this name is fine if reclaim-based
> demotion is an accepted concept.

Best Regards,
Huang, Ying

Wei Xu April 6, 2022, 5:02 a.m. UTC | #48

On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Wei Xu <weixugc@google.com> writes:
> >>
> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Wei Xu <weixugc@google.com> writes:
> >> >>
> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >> >>
> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> >> >> >> > From: Shakeel Butt <shakeelb@google.com>
> >> >> >> >
> >> >>
> >> >> [snip]
> >> >>
> >> >> >> > Possible Extensions:
> >> >> >> > --------------------
> >> >> >> >
> >> >> >> > - This interface can be extended with an additional parameter or flags
> >> >> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >> >> >> >   file, anon, ..).
> >> >> >> >
> >> >> >> > - The interface can also be extended with a node mask to reclaim from
> >> >> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >> >> >> >   tiering systens.
> >> >> >> >
> >> >> >> > - A similar per-node interface can also be added to support proactive
> >> >> >> >   reclaim and reclaim-based demotion in systems without memcg.
> >> >> >> >
> >> >> >> > For now, let's keep things simple by adding the basic functionality.
> >> >> >>
> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum
> >> >> >> interface. But it is not really clear who do you want to add flags on
> >> >> >> top of it?
> >> >> >>
> >> >> >> I am not really sure we really need a node aware interface for memcg.
> >> >> >> The global reclaim interface will likely need a different node because
> >> >> >> we do not want to make this CONFIG_MEMCG constrained.
> >> >> >
> >> >> > A nodemask argument for memory.reclaim can be useful for memory
> >> >> > tiering between NUMA nodes with different performance.  Similar to
> >> >> > proactive reclaim, it can allow a userspace daemon to drive
> >> >> > memcg-based proactive demotion via the reclaim-based demotion
> >> >> > mechanism in the kernel.
> >> >>
> >> >> I am not sure whether nodemask is a good way for demoting pages between
> >> >> different types of memory.  For example, for a system with DRAM and
> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
> >> >> is the meaning of specifying PMEM node? reclaiming to disk?
> >> >>
> >> >> In general, I have no objection to the idea in general.  But we should
> >> >> have a clear and consistent interface.  Per my understanding the default
> >> >> memcg interface is for memory, regardless of memory types.  The memory
> >> >> reclaiming means reduce the memory usage, regardless of memory types.
> >> >> We need to either extending the semantics of memory reclaiming (to
> >> >> include memory demoting too), or add another interface for memory
> >> >> demoting.
> >> >
> >> > Good point.  With the "demote pages during reclaim" patch series,
> >> > reclaim is already extended to demote pages as well.  For example,
> >> > can_reclaim_anon_pages() returns true if demotion is allowed and
> >> > shrink_page_list() can demote pages instead of reclaiming pages.
> >>
> >> These are in-kernel implementation, not the ABI.  So we still have
> >> the opportunity to define the ABI now.
> >>
> >> > Currently, demotion is disabled for memcg reclaim, which I think can
> >> > be relaxed and also necessary for memcg-based proactive demotion.  I'd
> >> > like to suggest that we extend the semantics of memory.reclaim to
> >> > cover memory demotion as well.  A flag can be used to enable/disable
> >> > the demotion behavior.
> >>
> >> If so,
> >>
> >> # echo A > memory.reclaim
> >>
> >> means
> >>
> >> a) "A" bytes memory are freed from the memcg, regardless demoting is
> >>    used or not.
> >>
> >> or
> >>
> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be
> >>    freed, some of them may be just demoted from DRAM to PMEM.  The total
> >>    number is "A".
> >>
> >> For me, a) looks more reasonable.
> >>
> >
> > We can use a DEMOTE flag to control the demotion behavior for
> > memory.reclaim.  If the flag is not set (the default), then
> > no_demotion of scan_control can be set to 1, similar to
> > reclaim_pages().
>
> If we have to use a flag to control the behavior, I think it's better to
> have a separate interface (e.g. memory.demote).  But do we really need b)?
>

I am fine with either approach: a separate interface similar to
memory.reclaim, but dedicated to demotion, or multiplexing
memory.reclaim for demotion with a flag.

My understanding is that with the "demote pages during reclaim"
support, b) is the expected behavior, or more precisely, pages that
cannot be demoted may be freed or swapped out.  This is reasonable.
Demotion-only can also be supported via some arguments to the
interface and changes to demotion code in the kernel.  After all, this
interface is being designed to be extensible based on the discussions
so far.

> > The question is then whether we want to rename memory.reclaim to
> > something more general.  I think this name is fine if reclaim-based
> > demotion is an accepted concept.
>
> Best Regards,
> Huang, Ying

Huang, Ying April 6, 2022, 6:32 a.m. UTC | #49

Wei Xu <weixugc@google.com> writes:

> On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Wei Xu <weixugc@google.com> writes:
>>
>> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Wei Xu <weixugc@google.com> writes:
>> >>
>> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Wei Xu <weixugc@google.com> writes:
>> >> >>
>> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> >> >>
>> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
>> >> >> >> > From: Shakeel Butt <shakeelb@google.com>
>> >> >> >> >
>> >> >>
>> >> >> [snip]
>> >> >>
>> >> >> >> > Possible Extensions:
>> >> >> >> > --------------------
>> >> >> >> >
>> >> >> >> > - This interface can be extended with an additional parameter or flags
>> >> >> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
>> >> >> >> >   file, anon, ..).
>> >> >> >> >
>> >> >> >> > - The interface can also be extended with a node mask to reclaim from
>> >> >> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
>> >> >> >> >   tiering systens.
>> >> >> >> >
>> >> >> >> > - A similar per-node interface can also be added to support proactive
>> >> >> >> >   reclaim and reclaim-based demotion in systems without memcg.
>> >> >> >> >
>> >> >> >> > For now, let's keep things simple by adding the basic functionality.
>> >> >> >>
>> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum
>> >> >> >> interface. But it is not really clear who do you want to add flags on
>> >> >> >> top of it?
>> >> >> >>
>> >> >> >> I am not really sure we really need a node aware interface for memcg.
>> >> >> >> The global reclaim interface will likely need a different node because
>> >> >> >> we do not want to make this CONFIG_MEMCG constrained.
>> >> >> >
>> >> >> > A nodemask argument for memory.reclaim can be useful for memory
>> >> >> > tiering between NUMA nodes with different performance.  Similar to
>> >> >> > proactive reclaim, it can allow a userspace daemon to drive
>> >> >> > memcg-based proactive demotion via the reclaim-based demotion
>> >> >> > mechanism in the kernel.
>> >> >>
>> >> >> I am not sure whether nodemask is a good way for demoting pages between
>> >> >> different types of memory.  For example, for a system with DRAM and
>> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
>> >> >> is the meaning of specifying PMEM node? reclaiming to disk?
>> >> >>
>> >> >> In general, I have no objection to the idea in general.  But we should
>> >> >> have a clear and consistent interface.  Per my understanding the default
>> >> >> memcg interface is for memory, regardless of memory types.  The memory
>> >> >> reclaiming means reduce the memory usage, regardless of memory types.
>> >> >> We need to either extending the semantics of memory reclaiming (to
>> >> >> include memory demoting too), or add another interface for memory
>> >> >> demoting.
>> >> >
>> >> > Good point.  With the "demote pages during reclaim" patch series,
>> >> > reclaim is already extended to demote pages as well.  For example,
>> >> > can_reclaim_anon_pages() returns true if demotion is allowed and
>> >> > shrink_page_list() can demote pages instead of reclaiming pages.
>> >>
>> >> These are in-kernel implementation, not the ABI.  So we still have
>> >> the opportunity to define the ABI now.
>> >>
>> >> > Currently, demotion is disabled for memcg reclaim, which I think can
>> >> > be relaxed and also necessary for memcg-based proactive demotion.  I'd
>> >> > like to suggest that we extend the semantics of memory.reclaim to
>> >> > cover memory demotion as well.  A flag can be used to enable/disable
>> >> > the demotion behavior.
>> >>
>> >> If so,
>> >>
>> >> # echo A > memory.reclaim
>> >>
>> >> means
>> >>
>> >> a) "A" bytes memory are freed from the memcg, regardless demoting is
>> >>    used or not.
>> >>
>> >> or
>> >>
>> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be
>> >>    freed, some of them may be just demoted from DRAM to PMEM.  The total
>> >>    number is "A".
>> >>
>> >> For me, a) looks more reasonable.
>> >>
>> >
>> > We can use a DEMOTE flag to control the demotion behavior for
>> > memory.reclaim.  If the flag is not set (the default), then
>> > no_demotion of scan_control can be set to 1, similar to
>> > reclaim_pages().
>>
>> If we have to use a flag to control the behavior, I think it's better to
>> have a separate interface (e.g. memory.demote).  But do we really need b)?
>>
>
> I am fine with either approach: a separate interface similar to
> memory.reclaim, but dedicated to demotion, or multiplexing
> memory.reclaim for demotion with a flag.
>
> My understanding is that with the "demote pages during reclaim"
> support, b) is the expected behavior, or more precisely, pages that
> cannot be demoted may be freed or swapped out.  This is reasonable.
> Demotion-only can also be supported via some arguments to the
> interface and changes to demotion code in the kernel.  After all, this
> interface is being designed to be extensible based on the discussions
> so far.

I think we should define the interface not from the current
implementation point of view, but from the requirement point of view.
For proactive reclaim, per my understanding, the requirement is,

  we found that there's some cold pages in some workloads, so we can
  take advantage of the proactive reclaim to reclaim some pages so that
  other workload can use the freed memory.

For proactive demotion, per my understanding, the requirement could be,

  We found that there's some cold pages in fast memory (e.g. DRAM) in
  some workloads, so we can take advantage of the proactive demotion to
  demote some pages so that other workload can use the freed fast
  memory.  Given the DRAM partition support Tim (Cced) is working on.

Why do we need something in the middle?

Best Regards,
Huang, Ying

>> > The question is then whether we want to rename memory.reclaim to
>> > something more general.  I think this name is fine if reclaim-based
>> > demotion is an accepted concept.
>>
>> Best Regards,
>> Huang, Ying

Wei Xu April 6, 2022, 7:05 a.m. UTC | #50

On Tue, Apr 5, 2022 at 11:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Wei Xu <weixugc@google.com> writes:
> >>
> >> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Wei Xu <weixugc@google.com> writes:
> >> >>
> >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Wei Xu <weixugc@google.com> writes:
> >> >> >>
> >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >> >> >>
> >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> >> >> >> >> > From: Shakeel Butt <shakeelb@google.com>
> >> >> >> >> >
> >> >> >>
> >> >> >> [snip]
> >> >> >>
> >> >> >> >> > Possible Extensions:
> >> >> >> >> > --------------------
> >> >> >> >> >
> >> >> >> >> > - This interface can be extended with an additional parameter or flags
> >> >> >> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >> >> >> >> >   file, anon, ..).
> >> >> >> >> >
> >> >> >> >> > - The interface can also be extended with a node mask to reclaim from
> >> >> >> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >> >> >> >> >   tiering systens.
> >> >> >> >> >
> >> >> >> >> > - A similar per-node interface can also be added to support proactive
> >> >> >> >> >   reclaim and reclaim-based demotion in systems without memcg.
> >> >> >> >> >
> >> >> >> >> > For now, let's keep things simple by adding the basic functionality.
> >> >> >> >>
> >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum
> >> >> >> >> interface. But it is not really clear who do you want to add flags on
> >> >> >> >> top of it?
> >> >> >> >>
> >> >> >> >> I am not really sure we really need a node aware interface for memcg.
> >> >> >> >> The global reclaim interface will likely need a different node because
> >> >> >> >> we do not want to make this CONFIG_MEMCG constrained.
> >> >> >> >
> >> >> >> > A nodemask argument for memory.reclaim can be useful for memory
> >> >> >> > tiering between NUMA nodes with different performance.  Similar to
> >> >> >> > proactive reclaim, it can allow a userspace daemon to drive
> >> >> >> > memcg-based proactive demotion via the reclaim-based demotion
> >> >> >> > mechanism in the kernel.
> >> >> >>
> >> >> >> I am not sure whether nodemask is a good way for demoting pages between
> >> >> >> different types of memory.  For example, for a system with DRAM and
> >> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
> >> >> >> is the meaning of specifying PMEM node? reclaiming to disk?
> >> >> >>
> >> >> >> In general, I have no objection to the idea in general.  But we should
> >> >> >> have a clear and consistent interface.  Per my understanding the default
> >> >> >> memcg interface is for memory, regardless of memory types.  The memory
> >> >> >> reclaiming means reduce the memory usage, regardless of memory types.
> >> >> >> We need to either extending the semantics of memory reclaiming (to
> >> >> >> include memory demoting too), or add another interface for memory
> >> >> >> demoting.
> >> >> >
> >> >> > Good point.  With the "demote pages during reclaim" patch series,
> >> >> > reclaim is already extended to demote pages as well.  For example,
> >> >> > can_reclaim_anon_pages() returns true if demotion is allowed and
> >> >> > shrink_page_list() can demote pages instead of reclaiming pages.
> >> >>
> >> >> These are in-kernel implementation, not the ABI.  So we still have
> >> >> the opportunity to define the ABI now.
> >> >>
> >> >> > Currently, demotion is disabled for memcg reclaim, which I think can
> >> >> > be relaxed and also necessary for memcg-based proactive demotion.  I'd
> >> >> > like to suggest that we extend the semantics of memory.reclaim to
> >> >> > cover memory demotion as well.  A flag can be used to enable/disable
> >> >> > the demotion behavior.
> >> >>
> >> >> If so,
> >> >>
> >> >> # echo A > memory.reclaim
> >> >>
> >> >> means
> >> >>
> >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is
> >> >>    used or not.
> >> >>
> >> >> or
> >> >>
> >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be
> >> >>    freed, some of them may be just demoted from DRAM to PMEM.  The total
> >> >>    number is "A".
> >> >>
> >> >> For me, a) looks more reasonable.
> >> >>
> >> >
> >> > We can use a DEMOTE flag to control the demotion behavior for
> >> > memory.reclaim.  If the flag is not set (the default), then
> >> > no_demotion of scan_control can be set to 1, similar to
> >> > reclaim_pages().
> >>
> >> If we have to use a flag to control the behavior, I think it's better to
> >> have a separate interface (e.g. memory.demote).  But do we really need b)?
> >>
> >
> > I am fine with either approach: a separate interface similar to
> > memory.reclaim, but dedicated to demotion, or multiplexing
> > memory.reclaim for demotion with a flag.
> >
> > My understanding is that with the "demote pages during reclaim"
> > support, b) is the expected behavior, or more precisely, pages that
> > cannot be demoted may be freed or swapped out.  This is reasonable.
> > Demotion-only can also be supported via some arguments to the
> > interface and changes to demotion code in the kernel.  After all, this
> > interface is being designed to be extensible based on the discussions
> > so far.
>
> I think we should define the interface not from the current
> implementation point of view, but from the requirement point of view.
> For proactive reclaim, per my understanding, the requirement is,
>
>   we found that there's some cold pages in some workloads, so we can
>   take advantage of the proactive reclaim to reclaim some pages so that
>   other workload can use the freed memory.
>
> For proactive demotion, per my understanding, the requirement could be,
>
>   We found that there's some cold pages in fast memory (e.g. DRAM) in
>   some workloads, so we can take advantage of the proactive demotion to
>   demote some pages so that other workload can use the freed fast
>   memory.  Given the DRAM partition support Tim (Cced) is working on.
>
> Why do we need something in the middle?

Maybe there is some misunderstanding.  As you said, demotion is to
free up fast memory.  If pages on fast memory cannot be demoted, but
can still be reclaimed to free some fast memory, it is useful, too.
Certainly, we can also add the support and configure the policy to
only demote, not reclaim, from fast memory in such cases.

In any case, we will not reclaim from slow memory for demotion, if
that is the middle thing you refer to.  This is why nodemask is
proposed for memory.reclaim to support the demotion use case.  With a
separate memory.demote interface and memory tiering topology among
NUMA nodes being well defined by the kernel and shared with the
userspace, we can omit the nodemask argument.

> Best Regards,
> Huang, Ying
>
> >> > The question is then whether we want to rename memory.reclaim to
> >> > something more general.  I think this name is fine if reclaim-based
> >> > demotion is an accepted concept.
> >>
> >> Best Regards,
> >> Huang, Ying

Huang, Ying April 6, 2022, 8:49 a.m. UTC | #51

Wei Xu <weixugc@google.com> writes:

> On Tue, Apr 5, 2022 at 11:32 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Wei Xu <weixugc@google.com> writes:
>>
>> > On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Wei Xu <weixugc@google.com> writes:
>> >>
>> >> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >>
>> >> >> Wei Xu <weixugc@google.com> writes:
>> >> >>
>> >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> >>
>> >> >> >> Wei Xu <weixugc@google.com> writes:
>> >> >> >>
>> >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
>> >> >> >> >>
>> >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
>> >> >> >> >> > From: Shakeel Butt <shakeelb@google.com>
>> >> >> >> >> >
>> >> >> >>
>> >> >> >> [snip]
>> >> >> >>
>> >> >> >> >> > Possible Extensions:
>> >> >> >> >> > --------------------
>> >> >> >> >> >
>> >> >> >> >> > - This interface can be extended with an additional parameter or flags
>> >> >> >> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
>> >> >> >> >> >   file, anon, ..).
>> >> >> >> >> >
>> >> >> >> >> > - The interface can also be extended with a node mask to reclaim from
>> >> >> >> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
>> >> >> >> >> >   tiering systens.
>> >> >> >> >> >
>> >> >> >> >> > - A similar per-node interface can also be added to support proactive
>> >> >> >> >> >   reclaim and reclaim-based demotion in systems without memcg.
>> >> >> >> >> >
>> >> >> >> >> > For now, let's keep things simple by adding the basic functionality.
>> >> >> >> >>
>> >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum
>> >> >> >> >> interface. But it is not really clear who do you want to add flags on
>> >> >> >> >> top of it?
>> >> >> >> >>
>> >> >> >> >> I am not really sure we really need a node aware interface for memcg.
>> >> >> >> >> The global reclaim interface will likely need a different node because
>> >> >> >> >> we do not want to make this CONFIG_MEMCG constrained.
>> >> >> >> >
>> >> >> >> > A nodemask argument for memory.reclaim can be useful for memory
>> >> >> >> > tiering between NUMA nodes with different performance.  Similar to
>> >> >> >> > proactive reclaim, it can allow a userspace daemon to drive
>> >> >> >> > memcg-based proactive demotion via the reclaim-based demotion
>> >> >> >> > mechanism in the kernel.
>> >> >> >>
>> >> >> >> I am not sure whether nodemask is a good way for demoting pages between
>> >> >> >> different types of memory.  For example, for a system with DRAM and
>> >> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
>> >> >> >> is the meaning of specifying PMEM node? reclaiming to disk?
>> >> >> >>
>> >> >> >> In general, I have no objection to the idea in general.  But we should
>> >> >> >> have a clear and consistent interface.  Per my understanding the default
>> >> >> >> memcg interface is for memory, regardless of memory types.  The memory
>> >> >> >> reclaiming means reduce the memory usage, regardless of memory types.
>> >> >> >> We need to either extending the semantics of memory reclaiming (to
>> >> >> >> include memory demoting too), or add another interface for memory
>> >> >> >> demoting.
>> >> >> >
>> >> >> > Good point.  With the "demote pages during reclaim" patch series,
>> >> >> > reclaim is already extended to demote pages as well.  For example,
>> >> >> > can_reclaim_anon_pages() returns true if demotion is allowed and
>> >> >> > shrink_page_list() can demote pages instead of reclaiming pages.
>> >> >>
>> >> >> These are in-kernel implementation, not the ABI.  So we still have
>> >> >> the opportunity to define the ABI now.
>> >> >>
>> >> >> > Currently, demotion is disabled for memcg reclaim, which I think can
>> >> >> > be relaxed and also necessary for memcg-based proactive demotion.  I'd
>> >> >> > like to suggest that we extend the semantics of memory.reclaim to
>> >> >> > cover memory demotion as well.  A flag can be used to enable/disable
>> >> >> > the demotion behavior.
>> >> >>
>> >> >> If so,
>> >> >>
>> >> >> # echo A > memory.reclaim
>> >> >>
>> >> >> means
>> >> >>
>> >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is
>> >> >>    used or not.
>> >> >>
>> >> >> or
>> >> >>
>> >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be
>> >> >>    freed, some of them may be just demoted from DRAM to PMEM.  The total
>> >> >>    number is "A".
>> >> >>
>> >> >> For me, a) looks more reasonable.
>> >> >>
>> >> >
>> >> > We can use a DEMOTE flag to control the demotion behavior for
>> >> > memory.reclaim.  If the flag is not set (the default), then
>> >> > no_demotion of scan_control can be set to 1, similar to
>> >> > reclaim_pages().
>> >>
>> >> If we have to use a flag to control the behavior, I think it's better to
>> >> have a separate interface (e.g. memory.demote).  But do we really need b)?
>> >>
>> >
>> > I am fine with either approach: a separate interface similar to
>> > memory.reclaim, but dedicated to demotion, or multiplexing
>> > memory.reclaim for demotion with a flag.
>> >
>> > My understanding is that with the "demote pages during reclaim"
>> > support, b) is the expected behavior, or more precisely, pages that
>> > cannot be demoted may be freed or swapped out.  This is reasonable.
>> > Demotion-only can also be supported via some arguments to the
>> > interface and changes to demotion code in the kernel.  After all, this
>> > interface is being designed to be extensible based on the discussions
>> > so far.
>>
>> I think we should define the interface not from the current
>> implementation point of view, but from the requirement point of view.
>> For proactive reclaim, per my understanding, the requirement is,
>>
>>   we found that there's some cold pages in some workloads, so we can
>>   take advantage of the proactive reclaim to reclaim some pages so that
>>   other workload can use the freed memory.
>>
>> For proactive demotion, per my understanding, the requirement could be,
>>
>>   We found that there's some cold pages in fast memory (e.g. DRAM) in
>>   some workloads, so we can take advantage of the proactive demotion to
>>   demote some pages so that other workload can use the freed fast
>>   memory.  Given the DRAM partition support Tim (Cced) is working on.
>>
>> Why do we need something in the middle?
>
> Maybe there is some misunderstanding.  As you said, demotion is to
> free up fast memory.  If pages on fast memory cannot be demoted, but
> can still be reclaimed to free some fast memory, it is useful, too.
> Certainly, we can also add the support and configure the policy to
> only demote, not reclaim, from fast memory in such cases.

Yes.  I think it may be better to demote from fast memory nodes only in
such cases.  We just free some fast memory proactively.  But we may
reclaim from the slow memory node (which are demotion targets) if
necessary.

> In any case, we will not reclaim from slow memory for demotion,

If there's no free pages in the slow memory to accommodate the demoted
pages, why not just reclaim some pages in the slow memory?  What are the
disadvantages to do that?

> if that is the middle thing you refer to.

No.  I mean,

If we reclaim "A" pages proactively, we will free "A" pages, maybe from
slow memory firstly.  The target is the total memory size of a memcg.

If we demote "A" pages proactively, we will demote "A" pages from fast
memory to slow memory.  The target is the fast memory size of a memcg.
In the process, some slow memory may be reclaimed to accommodate the
demoted pages.

For me, the middle thing is,

If we demote some pages from fast memory to slow memory proactively and
free some pages from slow memory at the same time, the total number
(demote + free) is "A".  There's no clear target.  I think this is
confusing.  Per my understanding, you don't suggest this too.

> This is why nodemask is
> proposed for memory.reclaim to support the demotion use case.  With a
> separate memory.demote interface and memory tiering topology among
> NUMA nodes being well defined by the kernel and shared with the
> userspace, we can omit the nodemask argument.

Yes.  Both seems work.

Best Regards,
Huang, Ying

>>
>> >> > The question is then whether we want to rename memory.reclaim to
>> >> > something more general.  I think this name is fine if reclaim-based
>> >> > demotion is an accepted concept.
>> >>
>> >> Best Regards,
>> >> Huang, Ying

Wei Xu April 6, 2022, 8:16 p.m. UTC | #52

On Wed, Apr 6, 2022 at 1:49 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Tue, Apr 5, 2022 at 11:32 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Wei Xu <weixugc@google.com> writes:
> >>
> >> > On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Wei Xu <weixugc@google.com> writes:
> >> >>
> >> >> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >>
> >> >> >> Wei Xu <weixugc@google.com> writes:
> >> >> >>
> >> >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> >> >>
> >> >> >> >> Wei Xu <weixugc@google.com> writes:
> >> >> >> >>
> >> >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote:
> >> >> >> >> >>
> >> >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote:
> >> >> >> >> >> > From: Shakeel Butt <shakeelb@google.com>
> >> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> [snip]
> >> >> >> >>
> >> >> >> >> >> > Possible Extensions:
> >> >> >> >> >> > --------------------
> >> >> >> >> >> >
> >> >> >> >> >> > - This interface can be extended with an additional parameter or flags
> >> >> >> >> >> >   to allow specifying one or more types of memory to reclaim from (e.g.
> >> >> >> >> >> >   file, anon, ..).
> >> >> >> >> >> >
> >> >> >> >> >> > - The interface can also be extended with a node mask to reclaim from
> >> >> >> >> >> >   specific nodes. This has use cases for reclaim-based demotion in memory
> >> >> >> >> >> >   tiering systens.
> >> >> >> >> >> >
> >> >> >> >> >> > - A similar per-node interface can also be added to support proactive
> >> >> >> >> >> >   reclaim and reclaim-based demotion in systems without memcg.
> >> >> >> >> >> >
> >> >> >> >> >> > For now, let's keep things simple by adding the basic functionality.
> >> >> >> >> >>
> >> >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum
> >> >> >> >> >> interface. But it is not really clear who do you want to add flags on
> >> >> >> >> >> top of it?
> >> >> >> >> >>
> >> >> >> >> >> I am not really sure we really need a node aware interface for memcg.
> >> >> >> >> >> The global reclaim interface will likely need a different node because
> >> >> >> >> >> we do not want to make this CONFIG_MEMCG constrained.
> >> >> >> >> >
> >> >> >> >> > A nodemask argument for memory.reclaim can be useful for memory
> >> >> >> >> > tiering between NUMA nodes with different performance.  Similar to
> >> >> >> >> > proactive reclaim, it can allow a userspace daemon to drive
> >> >> >> >> > memcg-based proactive demotion via the reclaim-based demotion
> >> >> >> >> > mechanism in the kernel.
> >> >> >> >>
> >> >> >> >> I am not sure whether nodemask is a good way for demoting pages between
> >> >> >> >> different types of memory.  For example, for a system with DRAM and
> >> >> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what
> >> >> >> >> is the meaning of specifying PMEM node? reclaiming to disk?
> >> >> >> >>
> >> >> >> >> In general, I have no objection to the idea in general.  But we should
> >> >> >> >> have a clear and consistent interface.  Per my understanding the default
> >> >> >> >> memcg interface is for memory, regardless of memory types.  The memory
> >> >> >> >> reclaiming means reduce the memory usage, regardless of memory types.
> >> >> >> >> We need to either extending the semantics of memory reclaiming (to
> >> >> >> >> include memory demoting too), or add another interface for memory
> >> >> >> >> demoting.
> >> >> >> >
> >> >> >> > Good point.  With the "demote pages during reclaim" patch series,
> >> >> >> > reclaim is already extended to demote pages as well.  For example,
> >> >> >> > can_reclaim_anon_pages() returns true if demotion is allowed and
> >> >> >> > shrink_page_list() can demote pages instead of reclaiming pages.
> >> >> >>
> >> >> >> These are in-kernel implementation, not the ABI.  So we still have
> >> >> >> the opportunity to define the ABI now.
> >> >> >>
> >> >> >> > Currently, demotion is disabled for memcg reclaim, which I think can
> >> >> >> > be relaxed and also necessary for memcg-based proactive demotion.  I'd
> >> >> >> > like to suggest that we extend the semantics of memory.reclaim to
> >> >> >> > cover memory demotion as well.  A flag can be used to enable/disable
> >> >> >> > the demotion behavior.
> >> >> >>
> >> >> >> If so,
> >> >> >>
> >> >> >> # echo A > memory.reclaim
> >> >> >>
> >> >> >> means
> >> >> >>
> >> >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is
> >> >> >>    used or not.
> >> >> >>
> >> >> >> or
> >> >> >>
> >> >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be
> >> >> >>    freed, some of them may be just demoted from DRAM to PMEM.  The total
> >> >> >>    number is "A".
> >> >> >>
> >> >> >> For me, a) looks more reasonable.
> >> >> >>
> >> >> >
> >> >> > We can use a DEMOTE flag to control the demotion behavior for
> >> >> > memory.reclaim.  If the flag is not set (the default), then
> >> >> > no_demotion of scan_control can be set to 1, similar to
> >> >> > reclaim_pages().
> >> >>
> >> >> If we have to use a flag to control the behavior, I think it's better to
> >> >> have a separate interface (e.g. memory.demote).  But do we really need b)?
> >> >>
> >> >
> >> > I am fine with either approach: a separate interface similar to
> >> > memory.reclaim, but dedicated to demotion, or multiplexing
> >> > memory.reclaim for demotion with a flag.
> >> >
> >> > My understanding is that with the "demote pages during reclaim"
> >> > support, b) is the expected behavior, or more precisely, pages that
> >> > cannot be demoted may be freed or swapped out.  This is reasonable.
> >> > Demotion-only can also be supported via some arguments to the
> >> > interface and changes to demotion code in the kernel.  After all, this
> >> > interface is being designed to be extensible based on the discussions
> >> > so far.
> >>
> >> I think we should define the interface not from the current
> >> implementation point of view, but from the requirement point of view.
> >> For proactive reclaim, per my understanding, the requirement is,
> >>
> >>   we found that there's some cold pages in some workloads, so we can
> >>   take advantage of the proactive reclaim to reclaim some pages so that
> >>   other workload can use the freed memory.
> >>
> >> For proactive demotion, per my understanding, the requirement could be,
> >>
> >>   We found that there's some cold pages in fast memory (e.g. DRAM) in
> >>   some workloads, so we can take advantage of the proactive demotion to
> >>   demote some pages so that other workload can use the freed fast
> >>   memory.  Given the DRAM partition support Tim (Cced) is working on.
> >>
> >> Why do we need something in the middle?
> >
> > Maybe there is some misunderstanding.  As you said, demotion is to
> > free up fast memory.  If pages on fast memory cannot be demoted, but
> > can still be reclaimed to free some fast memory, it is useful, too.
> > Certainly, we can also add the support and configure the policy to
> > only demote, not reclaim, from fast memory in such cases.
>
> Yes.  I think it may be better to demote from fast memory nodes only in
> such cases.  We just free some fast memory proactively.  But we may
> reclaim from the slow memory node (which are demotion targets) if
> necessary.

It can be a policy choice whether to reclaim from slow memory nodes
for demotion, or reclaim the pages directly from fast memory nodes, or
do nothing, if there isn't enough free space on slow memory nodes for
a proactive demotion request.  For example, if the file pages on fast
memory are clean and cold enough, they can be discarded, which should
be cheaper than reclaiming from slow memory nodes and then demoting
these pages.  A policy for such behaviors can be specified as an
argument to the proactive demotion interface when it is desired.

> > In any case, we will not reclaim from slow memory for demotion,
>
> If there's no free pages in the slow memory to accommodate the demoted
> pages, why not just reclaim some pages in the slow memory?  What are the
> disadvantages to do that?

We can certainly do what you have described through a policy argument.
What I meant is that we will not ask directly via the proactive
demotion interface to reclaim from slow memory nodes and count the
reclaimed bytes as part of the requested bytes.

> > if that is the middle thing you refer to.
>
> No.  I mean,
>
> If we reclaim "A" pages proactively, we will free "A" pages, maybe from
> slow memory firstly.  The target is the total memory size of a memcg.
>
> If we demote "A" pages proactively, we will demote "A" pages from fast
> memory to slow memory.  The target is the fast memory size of a memcg.
> In the process, some slow memory may be reclaimed to accommodate the
> demoted pages.
>
> For me, the middle thing is,
>
> If we demote some pages from fast memory to slow memory proactively and
> free some pages from slow memory at the same time, the total number
> (demote + free) is "A".  There's no clear target.  I think this is
> confusing.  Per my understanding, you don't suggest this too.

I agree and don't suggest this middle thing, either.

> > This is why nodemask is
> > proposed for memory.reclaim to support the demotion use case.  With a
> > separate memory.demote interface and memory tiering topology among
> > NUMA nodes being well defined by the kernel and shared with the
> > userspace, we can omit the nodemask argument.
>
> Yes.  Both seems work.
>
> Best Regards,
> Huang, Ying
>
> >>
> >> >> > The question is then whether we want to rename memory.reclaim to
> >> >> > something more general.  I think this name is fine if reclaim-based
> >> >> > demotion is an accepted concept.
> >> >>
> >> >> Best Regards,
> >> >> Huang, Ying

Michal Hocko April 7, 2022, 7:35 a.m. UTC | #53

On Wed 06-04-22 14:32:24, Huang, Ying wrote:
[...]
> I think we should define the interface not from the current
> implementation point of view, but from the requirement point of view.

Agreed!

> For proactive reclaim, per my understanding, the requirement is,
> 
>   we found that there's some cold pages in some workloads, so we can
>   take advantage of the proactive reclaim to reclaim some pages so that
>   other workload can use the freed memory.

We are talking about memcg here so this is not as much a matter of free
memory as it is to decrease the amount of charged memory. Demotion
cannot achieve that.

> For proactive demotion, per my understanding, the requirement could be,
> 
>   We found that there's some cold pages in fast memory (e.g. DRAM) in
>   some workloads, so we can take advantage of the proactive demotion to
>   demote some pages so that other workload can use the freed fast
>   memory.  Given the DRAM partition support Tim (Cced) is working on.

Yes, this is essentially a kernel assisted memory migration. Userspace
can migrate memory but the issue is that it doesn't have any information
on the aging so the migration has hard time to find suitable memory to
migrate. If we really need this functionality then it would deserve a
separate interface IMHO.

Tim Chen April 7, 2022, 9:26 p.m. UTC | #54

On Wed, 2022-04-06 at 10:49 +0800, Huang, Ying wrote:
> 
> > > If so,
> > > 
> > > # echo A > memory.reclaim
> > > 
> > > means
> > > 
> > > a) "A" bytes memory are freed from the memcg, regardless demoting is
> > >    used or not.
> > > 
> > > or
> > > 
> > > b) "A" bytes memory are reclaimed from the memcg, some of them may be
> > >    freed, some of them may be just demoted from DRAM to PMEM.  The total
> > >    number is "A".
> > > 
> > > For me, a) looks more reasonable.
> > > 
> > 
> > We can use a DEMOTE flag to control the demotion behavior for
> > memory.reclaim.  If the flag is not set (the default), then
> > no_demotion of scan_control can be set to 1, similar to
> > reclaim_pages().
> 
> If we have to use a flag to control the behavior, I think it's better to
> have a separate interface (e.g. memory.demote).  But do we really need b)?
> 
> > The question is then whether we want to rename memory.reclaim to
> > something more general.  I think this name is fine if reclaim-based
> > demotion is an accepted concept.
> 

memory.demote will work for 2 level of memory tiers.  But when we have 3 level
of memory (e.g. high bandwidth memory, DRAM and PMEM), 
it gets ambiguous again of wheter we sould demote from high bandwidth memory
or DRAM.

Will something like this be more general?

echo X > memory_[dram,pmem,hbm].reclaim

So echo X > memory_dram.reclaim
means that we want to free up X bytes from DRAM for the mem cgroup.

echo demote > memory_dram.reclaim_policy

This means that we prefer demotion for reclaim instead
of swapping to disk.


Tim

Wei Xu April 7, 2022, 10:07 p.m. UTC | #55

On Thu, Apr 7, 2022 at 2:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Wed, 2022-04-06 at 10:49 +0800, Huang, Ying wrote:
> >
> > > > If so,
> > > >
> > > > # echo A > memory.reclaim
> > > >
> > > > means
> > > >
> > > > a) "A" bytes memory are freed from the memcg, regardless demoting is
> > > >    used or not.
> > > >
> > > > or
> > > >
> > > > b) "A" bytes memory are reclaimed from the memcg, some of them may be
> > > >    freed, some of them may be just demoted from DRAM to PMEM.  The
> total
> > > >    number is "A".
> > > >
> > > > For me, a) looks more reasonable.
> > > >
> > >
> > > We can use a DEMOTE flag to control the demotion behavior for
> > > memory.reclaim.  If the flag is not set (the default), then
> > > no_demotion of scan_control can be set to 1, similar to
> > > reclaim_pages().
> >
> > If we have to use a flag to control the behavior, I think it's better to
> > have a separate interface (e.g. memory.demote).  But do we really need
> b)?
> >
> > > The question is then whether we want to rename memory.reclaim to
> > > something more general.  I think this name is fine if reclaim-based
> > > demotion is an accepted concept.
> >
>
> memory.demote will work for 2 level of memory tiers.  But when we have 3
> level
> of memory (e.g. high bandwidth memory, DRAM and PMEM),
> it gets ambiguous again of wheter we sould demote from high bandwidth
> memory
> or DRAM.
>
> Will something like this be more general?
>
> echo X > memory_[dram,pmem,hbm].reclaim
>
> So echo X > memory_dram.reclaim
> means that we want to free up X bytes from DRAM for the mem cgroup.
>
> echo demote > memory_dram.reclaim_policy
>
> This means that we prefer demotion for reclaim instead
> of swapping to disk.
>
>
memory.demote can work with any level of memory tiers if a nodemask
argument (or a tier argument if there is a more-explicitly defined,
userspace visible tiering representation) is provided.  The semantics can
be to demote X bytes from these nodes to their next tier.

memory_dram/memory_pmem assumes the hardware for a particular memory tier,
which is undesirable.  For example, it is entirely possible that a slow
memory tier is implemented by a lower-cost/lower-performance DDR device
connected via CXL.mem, not by PMEM.  It is better for this interface to
speak in either the NUMA node abstraction or a new tier abstraction.

It is also desirable to make this interface stateless, i.e. not to require
the setting of memory_dram.reclaim_policy.  Any policy can be specified as
arguments to the request itself and should only affect that particular
request.

Wei

Wei Xu April 7, 2022, 10:12 p.m. UTC | #56

On Thu, Apr 7, 2022 at 2:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Wed, 2022-04-06 at 10:49 +0800, Huang, Ying wrote:
> >
> > > > If so,
> > > >
> > > > # echo A > memory.reclaim
> > > >
> > > > means
> > > >
> > > > a) "A" bytes memory are freed from the memcg, regardless demoting is
> > > >    used or not.
> > > >
> > > > or
> > > >
> > > > b) "A" bytes memory are reclaimed from the memcg, some of them may be
> > > >    freed, some of them may be just demoted from DRAM to PMEM.  The total
> > > >    number is "A".
> > > >
> > > > For me, a) looks more reasonable.
> > > >
> > >
> > > We can use a DEMOTE flag to control the demotion behavior for
> > > memory.reclaim.  If the flag is not set (the default), then
> > > no_demotion of scan_control can be set to 1, similar to
> > > reclaim_pages().
> >
> > If we have to use a flag to control the behavior, I think it's better to
> > have a separate interface (e.g. memory.demote).  But do we really need b)?
> >
> > > The question is then whether we want to rename memory.reclaim to
> > > something more general.  I think this name is fine if reclaim-based
> > > demotion is an accepted concept.
> >
>
> memory.demote will work for 2 level of memory tiers.  But when we have 3 level
> of memory (e.g. high bandwidth memory, DRAM and PMEM),
> it gets ambiguous again of wheter we sould demote from high bandwidth memory
> or DRAM.
>
> Will something like this be more general?
>
> echo X > memory_[dram,pmem,hbm].reclaim
>
> So echo X > memory_dram.reclaim
> means that we want to free up X bytes from DRAM for the mem cgroup.
>
> echo demote > memory_dram.reclaim_policy
>
> This means that we prefer demotion for reclaim instead
> of swapping to disk.
>

(resending in plain-text, sorry).

memory.demote can work with any level of memory tiers if a nodemask
argument (or a tier argument if there is a more-explicitly defined,
userspace visible tiering representation) is provided.  The semantics
can be to demote X bytes from these nodes to their next tier.

memory_dram/memory_pmem assumes the hardware for a particular memory
tier, which is undesirable.  For example, it is entirely possible that
a slow memory tier is implemented by a lower-cost/lower-performance
DDR device connected via CXL.mem, not by PMEM.  It is better for this
interface to speak in either the NUMA node abstraction or a new tier
abstraction.

It is also desirable to make this interface stateless, i.e. not to
require the setting of memory_dram.reclaim_policy.  Any policy can be
specified as arguments to the request itself and should only affect
that particular request.

Wei

Tim Chen April 7, 2022, 11:11 p.m. UTC | #57

On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote:

> 
> (resending in plain-text, sorry).
> 
> memory.demote can work with any level of memory tiers if a nodemask
> argument (or a tier argument if there is a more-explicitly defined,
> userspace visible tiering representation) is provided.  The semantics
> can be to demote X bytes from these nodes to their next tier.
> 

We do need some kind of userspace visible tiering representation.
Will be nice if I can tell the memory type, nodemask of nodes in tier Y with

cat memory.tier_Y


> memory_dram/memory_pmem assumes the hardware for a particular memory
> tier, which is undesirable.  For example, it is entirely possible that
> a slow memory tier is implemented by a lower-cost/lower-performance
> DDR device connected via CXL.mem, not by PMEM.  It is better for this
> interface to speak in either the NUMA node abstraction or a new tier
> abstraction.

Just from the perspective of memory.reclaim and memory.demote, I think
they could work with nodemask.  For ease of management,
some kind of abstraction of tier information like nodemask, memory type
and expected performance should be readily accessible by user space.  

Tim

> 
> It is also desirable to make this interface stateless, i.e. not to
> require the setting of memory_dram.reclaim_policy.  Any policy can be
> specified as arguments to the request itself and should only affect
> that particular request.
> 
> Wei

Wei Xu April 8, 2022, 2:10 a.m. UTC | #58

On Thu, Apr 7, 2022 at 4:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote:
>
> >
> > (resending in plain-text, sorry).
> >
> > memory.demote can work with any level of memory tiers if a nodemask
> > argument (or a tier argument if there is a more-explicitly defined,
> > userspace visible tiering representation) is provided.  The semantics
> > can be to demote X bytes from these nodes to their next tier.
> >
>
> We do need some kind of userspace visible tiering representation.
> Will be nice if I can tell the memory type, nodemask of nodes in tier Y with
>
> cat memory.tier_Y
>
>
> > memory_dram/memory_pmem assumes the hardware for a particular memory
> > tier, which is undesirable.  For example, it is entirely possible that
> > a slow memory tier is implemented by a lower-cost/lower-performance
> > DDR device connected via CXL.mem, not by PMEM.  It is better for this
> > interface to speak in either the NUMA node abstraction or a new tier
> > abstraction.
>
> Just from the perspective of memory.reclaim and memory.demote, I think
> they could work with nodemask.  For ease of management,
> some kind of abstraction of tier information like nodemask, memory type
> and expected performance should be readily accessible by user space.
>

I agree.  The tier information should be provided at the system level.
One suggestion is to have a new directory "/sys/devices/system/tier/"
for tiers, e.g.:

/sys/devices/system/tier/tier0/memlist: all memory nodes in tier 0.
/sys/devices/system/tier/tier1/memlist: all memory nodes in tier 1.

We can discuss this tier representation in a new thread.

> Tim
>
> >
> > It is also desirable to make this interface stateless, i.e. not to
> > require the setting of memory_dram.reclaim_policy.  Any policy can be
> > specified as arguments to the request itself and should only affect
> > that particular request.
> >
> > Wei
>

Huang, Ying April 8, 2022, 3:08 a.m. UTC | #59

Wei Xu <weixugc@google.com> writes:

> On Thu, Apr 7, 2022 at 4:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>> On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote:
>>
>> >
>> > (resending in plain-text, sorry).
>> >
>> > memory.demote can work with any level of memory tiers if a nodemask
>> > argument (or a tier argument if there is a more-explicitly defined,
>> > userspace visible tiering representation) is provided.  The semantics
>> > can be to demote X bytes from these nodes to their next tier.
>> >
>>
>> We do need some kind of userspace visible tiering representation.
>> Will be nice if I can tell the memory type, nodemask of nodes in tier Y with
>>
>> cat memory.tier_Y
>>
>>
>> > memory_dram/memory_pmem assumes the hardware for a particular memory
>> > tier, which is undesirable.  For example, it is entirely possible that
>> > a slow memory tier is implemented by a lower-cost/lower-performance
>> > DDR device connected via CXL.mem, not by PMEM.  It is better for this
>> > interface to speak in either the NUMA node abstraction or a new tier
>> > abstraction.
>>
>> Just from the perspective of memory.reclaim and memory.demote, I think
>> they could work with nodemask.  For ease of management,
>> some kind of abstraction of tier information like nodemask, memory type
>> and expected performance should be readily accessible by user space.
>>
>
> I agree.  The tier information should be provided at the system level.
> One suggestion is to have a new directory "/sys/devices/system/tier/"
> for tiers, e.g.:
>
> /sys/devices/system/tier/tier0/memlist: all memory nodes in tier 0.
> /sys/devices/system/tier/tier1/memlist: all memory nodes in tier 1.

I think that it may be sufficient to make tier an attribute of "node".
Some thing like,

/sys/devices/system/node/nodeX/memory_tier

Best Regards,
Huang, Ying

> We can discuss this tier representation in a new thread.
>
>> Tim
>>
>> >
>> > It is also desirable to make this interface stateless, i.e. not to
>> > require the setting of memory_dram.reclaim_policy.  Any policy can be
>> > specified as arguments to the request itself and should only affect
>> > that particular request.
>> >
>> > Wei
>>

Wei Xu April 8, 2022, 4:10 a.m. UTC | #60

On Thu, Apr 7, 2022 at 8:08 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Wei Xu <weixugc@google.com> writes:
>
> > On Thu, Apr 7, 2022 at 4:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >>
> >> On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote:
> >>
> >> >
> >> > (resending in plain-text, sorry).
> >> >
> >> > memory.demote can work with any level of memory tiers if a nodemask
> >> > argument (or a tier argument if there is a more-explicitly defined,
> >> > userspace visible tiering representation) is provided.  The semantics
> >> > can be to demote X bytes from these nodes to their next tier.
> >> >
> >>
> >> We do need some kind of userspace visible tiering representation.
> >> Will be nice if I can tell the memory type, nodemask of nodes in tier Y with
> >>
> >> cat memory.tier_Y
> >>
> >>
> >> > memory_dram/memory_pmem assumes the hardware for a particular memory
> >> > tier, which is undesirable.  For example, it is entirely possible that
> >> > a slow memory tier is implemented by a lower-cost/lower-performance
> >> > DDR device connected via CXL.mem, not by PMEM.  It is better for this
> >> > interface to speak in either the NUMA node abstraction or a new tier
> >> > abstraction.
> >>
> >> Just from the perspective of memory.reclaim and memory.demote, I think
> >> they could work with nodemask.  For ease of management,
> >> some kind of abstraction of tier information like nodemask, memory type
> >> and expected performance should be readily accessible by user space.
> >>
> >
> > I agree.  The tier information should be provided at the system level.
> > One suggestion is to have a new directory "/sys/devices/system/tier/"
> > for tiers, e.g.:
> >
> > /sys/devices/system/tier/tier0/memlist: all memory nodes in tier 0.
> > /sys/devices/system/tier/tier1/memlist: all memory nodes in tier 1.
>
> I think that it may be sufficient to make tier an attribute of "node".
> Some thing like,
>
> /sys/devices/system/node/nodeX/memory_tier
>

This works. If we want additional information about each tier, we can
then add a tier-specific subtree.

In addition, it would be good to also expose the demotion target nodes
(node_demotion[]) via sysfs, e.g.:

/sys/devices/system/node/nodeX/demotion_path

which returns node_demotion[X].

> Best Regards,
> Huang, Ying
>
> > We can discuss this tier representation in a new thread.
> >
> >> Tim
> >>
> >> >
> >> > It is also desirable to make this interface stateless, i.e. not to
> >> > require the setting of memory_dram.reclaim_policy.  Any policy can be
> >> > specified as arguments to the request itself and should only affect
> >> > that particular request.
> >> >
> >> > Wei
> >>
>

diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69d7a6983f78..925aaabb2247 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1208,6 +1208,15 @@  PAGE_SIZE multiple when read back.
 	high limit is used and monitored properly, this limit's
 	utility is limited to providing the final safety net.
 
+  memory.reclaim
+	A write-only file which exists on non-root cgroups.
+
+	This is a simple interface to trigger memory reclaim in the
+	target cgroup. Write the number of bytes to reclaim to this
+	file and the kernel will try to reclaim that much memory.
+	Please note that the kernel can over or under reclaim from
+	the target cgroup.
+
   memory.oom.group
 	A read-write single value file which exists on non-root
 	cgroups.  The default value is "0".
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 725f76723220..994849fab7df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6355,6 +6355,38 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
+			      size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
+	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "", &nr_to_reclaim);
+	if (err)
+		return err;
+
+	while (nr_reclaimed < nr_to_reclaim) {
+		unsigned long reclaimed;
+
+		if (signal_pending(current))
+			break;
+
+		reclaimed = try_to_free_mem_cgroup_pages(memcg,
+						nr_to_reclaim - nr_reclaimed,
+						GFP_KERNEL, true);
+
+		if (!reclaimed && !nr_retries--)
+			break;
+
+		nr_reclaimed += reclaimed;
+	}
+
+	return nbytes;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -6413,6 +6445,11 @@  static struct cftype memory_files[] = {
 		.seq_show = memory_oom_group_show,
 		.write = memory_oom_group_write,
 	},
+	{
+		.name = "reclaim",
+		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+		.write = memory_reclaim,
+	},
 	{ }	/* terminate */
 };