Message ID | 20220331084151.2600229-1-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [resend] memcg: introduce per-memcg reclaim interface | expand |
On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > From: Shakeel Butt <shakeelb@google.com> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use case: Proactive Reclaim > --------------------------- > > A userspace proactive reclaimer can continuously probe the memcg to > reclaim a small amount of memory. This gives more accurate and > up-to-date workingset estimation as the LRUs are continuously > sorted and can potentially provide more deterministic memory > overcommit behavior. The memory overcommit controller can provide > more proactive response to the changing behavior of the running > applications instead of being reactive. > > A userspace reclaimer's purpose in this case is not a complete replacement > for kswapd or direct reclaim, it is to proactively identify memory savings > opportunities and reclaim some amount of cold pages set by the policy > to free up the memory for more demanding jobs or scheduling new jobs. > > A user space proactive reclaimer is used in Google data centers. > Additionally, Meta's TMO paper recently referenced a very similar > interface used for user space proactive reclaim: > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > Benefits of a user space reclaimer: > ----------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to be centralized. > > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. > > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. > > Why memory.high is not enough? > ------------------------------ > > - memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim. > However there is a big downside in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. > > - Userspace proactive reclaimers usually use feedback loops to decide > how much memory to proactively reclaim from a workload. The metrics > used for this are usually either refaults or PSI, and these metrics > will become messy if the application gets throttled by hitting the > high limit. > > - memory.high is a stateful interface, if the userspace proactive > reclaimer crashes for any reason while triggering reclaim it can leave > the application in a bad state. > > - If a workload is rapidly expanding, setting memory.high to proactively > reclaim memory can result in actually reclaiming more memory than > intended. > > The benefits of such interface and shortcomings of existing interface > were further discussed in this RFC thread: > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ Hello! I'm totally up for the proposed feature! It makes total sense and is proved to be useful, let's add it. > > Interface: > ---------- > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. > > > Possible Extensions: > -------------------- > > - This interface can be extended with an additional parameter or flags > to allow specifying one or more types of memory to reclaim from (e.g. > file, anon, ..). > > - The interface can also be extended with a node mask to reclaim from > specific nodes. This has use cases for reclaim-based demotion in memory > tiering systens. > > - A similar per-node interface can also be added to support proactive > reclaim and reclaim-based demotion in systems without memcg. Maybe an option to specify a timeout? That might simplify the userspace part. Also, please please add a test to selftests/cgroup/memcg tests. It will also provide an example on how the userspace can use the feature. > > For now, let's keep things simple by adding the basic functionality. What I'm worried about is how we gonna extend it? How do you see the interface with 2-3 extensions from the list above? All these extensions look very reasonable to me, so we'll likely have to implement them soon. So let's think about the extensibility now. I wonder if it makes more sense to introduce a sys_reclaim() syscall instead? In the end, such a feature might make sense on the system level too. Yes, there is the drop_caches sysctl, but it's too radical for many cases. > > [yosryahmed@google.com: refreshed to current master, updated commit > message based on recent discussions and use cases] > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 69d7a6983f78..925aaabb2247 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > high limit is used and monitored properly, this limit's > utility is limited to providing the final safety net. > > + memory.reclaim > + A write-only file which exists on non-root cgroups. > + > + This is a simple interface to trigger memory reclaim in the > + target cgroup. Write the number of bytes to reclaim to this > + file and the kernel will try to reclaim that much memory. > + Please note that the kernel can over or under reclaim from > + the target cgroup. > + > memory.oom.group > A read-write single value file which exists on non-root > cgroups. The default value is "0". > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 725f76723220..994849fab7df 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > + size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int err; > + > + buf = strstrip(buf); > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > + if (err) > + return err; > + > + while (nr_reclaimed < nr_to_reclaim) { > + unsigned long reclaimed; > + > + if (signal_pending(current)) > + break; > + > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > + nr_to_reclaim - nr_reclaimed, > + GFP_KERNEL, true); > + > + if (!reclaimed && !nr_retries--) > + break; > + > + nr_reclaimed += reclaimed; > + } > + > + return nbytes; > +} > + > static struct cftype memory_files[] = { > { > .name = "current", > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > .seq_show = memory_oom_group_show, > .write = memory_oom_group_write, > }, > + { > + .name = "reclaim", > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > + .write = memory_reclaim, Btw, why not on root?
On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > From: Shakeel Butt <shakeelb@google.com> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use case: Proactive Reclaim > --------------------------- > > A userspace proactive reclaimer can continuously probe the memcg to > reclaim a small amount of memory. This gives more accurate and > up-to-date workingset estimation as the LRUs are continuously > sorted and can potentially provide more deterministic memory > overcommit behavior. The memory overcommit controller can provide > more proactive response to the changing behavior of the running > applications instead of being reactive. > > A userspace reclaimer's purpose in this case is not a complete replacement > for kswapd or direct reclaim, it is to proactively identify memory savings > opportunities and reclaim some amount of cold pages set by the policy > to free up the memory for more demanding jobs or scheduling new jobs. > > A user space proactive reclaimer is used in Google data centers. > Additionally, Meta's TMO paper recently referenced a very similar > interface used for user space proactive reclaim: > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > Benefits of a user space reclaimer: > ----------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to be centralized. > > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. > > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. > > Why memory.high is not enough? > ------------------------------ > > - memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim. > However there is a big downside in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. > > - Userspace proactive reclaimers usually use feedback loops to decide > how much memory to proactively reclaim from a workload. The metrics > used for this are usually either refaults or PSI, and these metrics > will become messy if the application gets throttled by hitting the > high limit. > > - memory.high is a stateful interface, if the userspace proactive > reclaimer crashes for any reason while triggering reclaim it can leave > the application in a bad state. > > - If a workload is rapidly expanding, setting memory.high to proactively > reclaim memory can result in actually reclaiming more memory than > intended. > > The benefits of such interface and shortcomings of existing interface > were further discussed in this RFC thread: > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > Interface: > ---------- > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. > > > Possible Extensions: > -------------------- > > - This interface can be extended with an additional parameter or flags > to allow specifying one or more types of memory to reclaim from (e.g. > file, anon, ..). > > - The interface can also be extended with a node mask to reclaim from > specific nodes. This has use cases for reclaim-based demotion in memory > tiering systens. > > - A similar per-node interface can also be added to support proactive > reclaim and reclaim-based demotion in systems without memcg. > > For now, let's keep things simple by adding the basic functionality. > > [yosryahmed@google.com: refreshed to current master, updated commit > message based on recent discussions and use cases] > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Thanks for compiling all the history and arguments around this change!
On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > + size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int err; > + > + buf = strstrip(buf); > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > + if (err) > + return err; > + > + while (nr_reclaimed < nr_to_reclaim) { > + unsigned long reclaimed; > + > + if (signal_pending(current)) > + break; > + > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > + nr_to_reclaim - nr_reclaimed, > + GFP_KERNEL, true); > + > + if (!reclaimed && !nr_retries--) > + break; > + > + nr_reclaimed += reclaimed; > + } Is there any way in which this can be provoked into triggering the softlockup detector? Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel? Would additional flexibility be gained by letting userspace handle retrying?
在 2022/3/31 16:41, Yosry Ahmed 写道: > From: Shakeel Butt <shakeelb@google.com> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use case: Proactive Reclaim > --------------------------- > > A userspace proactive reclaimer can continuously probe the memcg to > reclaim a small amount of memory. This gives more accurate and > up-to-date workingset estimation as the LRUs are continuously > sorted and can potentially provide more deterministic memory > overcommit behavior. The memory overcommit controller can provide > more proactive response to the changing behavior of the running > applications instead of being reactive. > > A userspace reclaimer's purpose in this case is not a complete replacement > for kswapd or direct reclaim, it is to proactively identify memory savings > opportunities and reclaim some amount of cold pages set by the policy > to free up the memory for more demanding jobs or scheduling new jobs. > > A user space proactive reclaimer is used in Google data centers. > Additionally, Meta's TMO paper recently referenced a very similar > interface used for user space proactive reclaim: > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > Benefits of a user space reclaimer: > ----------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to be centralized. > > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. > > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. > > Why memory.high is not enough? > ------------------------------ > > - memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim. > However there is a big downside in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. > > - Userspace proactive reclaimers usually use feedback loops to decide > how much memory to proactively reclaim from a workload. The metrics > used for this are usually either refaults or PSI, and these metrics > will become messy if the application gets throttled by hitting the > high limit. > > - memory.high is a stateful interface, if the userspace proactive > reclaimer crashes for any reason while triggering reclaim it can leave > the application in a bad state. > > - If a workload is rapidly expanding, setting memory.high to proactively > reclaim memory can result in actually reclaiming more memory than > intended. > > The benefits of such interface and shortcomings of existing interface > were further discussed in this RFC thread: > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > Interface: > ---------- > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. > > > Possible Extensions: > -------------------- > > - This interface can be extended with an additional parameter or flags > to allow specifying one or more types of memory to reclaim from (e.g. > file, anon, ..). > > - The interface can also be extended with a node mask to reclaim from > specific nodes. This has use cases for reclaim-based demotion in memory > tiering systens. > > - A similar per-node interface can also be added to support proactive > reclaim and reclaim-based demotion in systems without memcg. > > For now, let's keep things simple by adding the basic functionality. > > [yosryahmed@google.com: refreshed to current master, updated commit > message based on recent discussions and use cases] > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 69d7a6983f78..925aaabb2247 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > high limit is used and monitored properly, this limit's > utility is limited to providing the final safety net. > > + memory.reclaim > + A write-only file which exists on non-root cgroups. > + > + This is a simple interface to trigger memory reclaim in the > + target cgroup. Write the number of bytes to reclaim to this > + file and the kernel will try to reclaim that much memory. > + Please note that the kernel can over or under reclaim from > + the target cgroup. > + > memory.oom.group > A read-write single value file which exists on non-root > cgroups. The default value is "0". > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 725f76723220..994849fab7df 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > + size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int err; > + > + buf = strstrip(buf); > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > + if (err) > + return err; > + > + while (nr_reclaimed < nr_to_reclaim) { > + unsigned long reclaimed; > + > + if (signal_pending(current)) > + break; > + > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > + nr_to_reclaim - nr_reclaimed, > + GFP_KERNEL, true); In some scenario there are lots of page cache, and we only want to reclaim page cache, how about add may_swap option? > + > + if (!reclaimed && !nr_retries--) > + break; > + > + nr_reclaimed += reclaimed; > + } > + > + return nbytes; > +} > + > static struct cftype memory_files[] = { > { > .name = "current", > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > .seq_show = memory_oom_group_show, > .write = memory_oom_group_write, > }, > + { > + .name = "reclaim", > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > + .write = memory_reclaim, > + }, > { } /* terminate */ > }; >
On Thu, Mar 31, 2022 at 5:33 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > + } > > Is there any way in which this can be provoked into triggering the > softlockup detector? memory.reclaim is similar to memory.high w.r.t. reclaiming memory, except that memory.reclaim is stateless, while the kernel remembers the state set by memory.high. So memory.reclaim should not bring in any new risks of triggering soft lockup, if any. > Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel? > Would additional flexibility be gained by letting userspace handle > retrying? I agree it is better to retry from the userspace.
On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > From: Shakeel Butt <shakeelb@google.com> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use case: Proactive Reclaim > --------------------------- > > A userspace proactive reclaimer can continuously probe the memcg to > reclaim a small amount of memory. This gives more accurate and > up-to-date workingset estimation as the LRUs are continuously > sorted and can potentially provide more deterministic memory > overcommit behavior. The memory overcommit controller can provide > more proactive response to the changing behavior of the running > applications instead of being reactive. > > A userspace reclaimer's purpose in this case is not a complete replacement > for kswapd or direct reclaim, it is to proactively identify memory savings > opportunities and reclaim some amount of cold pages set by the policy > to free up the memory for more demanding jobs or scheduling new jobs. > > A user space proactive reclaimer is used in Google data centers. > Additionally, Meta's TMO paper recently referenced a very similar > interface used for user space proactive reclaim: > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > Benefits of a user space reclaimer: > ----------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to be centralized. > > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. > > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. > > Why memory.high is not enough? > ------------------------------ > > - memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim. > However there is a big downside in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. > > - Userspace proactive reclaimers usually use feedback loops to decide > how much memory to proactively reclaim from a workload. The metrics > used for this are usually either refaults or PSI, and these metrics > will become messy if the application gets throttled by hitting the > high limit. > > - memory.high is a stateful interface, if the userspace proactive > reclaimer crashes for any reason while triggering reclaim it can leave > the application in a bad state. > > - If a workload is rapidly expanding, setting memory.high to proactively > reclaim memory can result in actually reclaiming more memory than > intended. > > The benefits of such interface and shortcomings of existing interface > were further discussed in this RFC thread: > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > Interface: > ---------- > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. > > > Possible Extensions: > -------------------- > > - This interface can be extended with an additional parameter or flags > to allow specifying one or more types of memory to reclaim from (e.g. > file, anon, ..). > > - The interface can also be extended with a node mask to reclaim from > specific nodes. This has use cases for reclaim-based demotion in memory > tiering systens. > > - A similar per-node interface can also be added to support proactive > reclaim and reclaim-based demotion in systems without memcg. > > For now, let's keep things simple by adding the basic functionality. > > [yosryahmed@google.com: refreshed to current master, updated commit > message based on recent discussions and use cases] > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 69d7a6983f78..925aaabb2247 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > high limit is used and monitored properly, this limit's > utility is limited to providing the final safety net. > > + memory.reclaim > + A write-only file which exists on non-root cgroups. > + > + This is a simple interface to trigger memory reclaim in the > + target cgroup. Write the number of bytes to reclaim to this > + file and the kernel will try to reclaim that much memory. > + Please note that the kernel can over or under reclaim from > + the target cgroup. > + > memory.oom.group > A read-write single value file which exists on non-root > cgroups. The default value is "0". > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 725f76723220..994849fab7df 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > + size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int err; > + > + buf = strstrip(buf); > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > + if (err) > + return err; > + > + while (nr_reclaimed < nr_to_reclaim) { > + unsigned long reclaimed; > + > + if (signal_pending(current)) > + break; > + > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > + nr_to_reclaim - nr_reclaimed, > + GFP_KERNEL, true); > + > + if (!reclaimed && !nr_retries--) > + break; > + > + nr_reclaimed += reclaimed; > + } > + > + return nbytes; It is better to return an error code (e.g. -EBUSY) when memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory, except if the cgroup memory usage is already 0. We can also return -EINVAL if nr_to_reclaim is too large (e.g. > limit). > +} > + > static struct cftype memory_files[] = { > { > .name = "current", > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > .seq_show = memory_oom_group_show, > .write = memory_oom_group_write, > }, > + { > + .name = "reclaim", > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > + .write = memory_reclaim, > + }, > { } /* terminate */ > }; > > -- > 2.35.1.1021.g381101b075-goog >
On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > Hello! > > I'm totally up for the proposed feature! It makes total sense and is proved > to be useful, let's add it. > > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > Maybe an option to specify a timeout? That might simplify the userspace part. A timeout is a good idea. I think it can be added as an extension, similar to other extensions. > Also, please please add a test to selftests/cgroup/memcg tests. > It will also provide an example on how the userspace can use the feature. +1 > > > > For now, let's keep things simple by adding the basic functionality. > > What I'm worried about is how we gonna extend it? How do you see the interface > with 2-3 extensions from the list above? All these extensions look very > reasonable to me, so we'll likely have to implement them soon. So let's think > about the extensibility now. For the first two extensions (flags and nodemask), they can be implemented as additional positional arguments of memory.reclaim. The non-memcg use cases will need a different interface, which can be either a sysfs file or a syscall. > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead? > In the end, such a feature might make sense on the system level too. > Yes, there is the drop_caches sysctl, but it's too radical for many cases. sys_reclaim() syscall is a good proposal for non-memcg use cases. But for memcg-based proactive reclaim, memory.reclaim should be more natural. It is not common to have cgroup as a syscall argument. > > > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69d7a6983f78..925aaabb2247 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.reclaim > > + A write-only file which exists on non-root cgroups. > > + > > + This is a simple interface to trigger memory reclaim in the > > + target cgroup. Write the number of bytes to reclaim to this > > + file and the kernel will try to reclaim that much memory. > > + Please note that the kernel can over or under reclaim from > > + the target cgroup. > > + > > memory.oom.group > > A read-write single value file which exists on non-root > > cgroups. The default value is "0". > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 725f76723220..994849fab7df 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > + } > > + > > + return nbytes; > > +} > > + > > static struct cftype memory_files[] = { > > { > > .name = "current", > > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > > .seq_show = memory_oom_group_show, > > .write = memory_oom_group_write, > > }, > > + { > > + .name = "reclaim", > > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > > + .write = memory_reclaim, > > Btw, why not on root?
Yosry Ahmed <yosryahmed@google.com> writes: > From: Shakeel Butt <shakeelb@google.com> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. <snip> > + > + while (nr_reclaimed < nr_to_reclaim) { > + unsigned long reclaimed; > + > + if (signal_pending(current)) > + break; > + > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > + nr_to_reclaim - nr_reclaimed, > + GFP_KERNEL, true); > + > + if (!reclaimed && !nr_retries--) > + break; > + > + nr_reclaimed += reclaimed; I think there should be a cond_resched() in this loop before try_to_free_mem_cgroup_pages() to have better chances of reclaim succeding early. <snip>
On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > Hello! > > I'm totally up for the proposed feature! It makes total sense and is proved > to be useful, let's add it. > > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > Maybe an option to specify a timeout? That might simplify the userspace part. > Also, please please add a test to selftests/cgroup/memcg tests. > It will also provide an example on how the userspace can use the feature. > Hi Roman, thanks for taking the time to review this! A timeout can be a good extension, I will add it to the commit message in the next version in possible extensions. I will add a test in v2, thanks! > > > > > For now, let's keep things simple by adding the basic functionality. > > What I'm worried about is how we gonna extend it? How do you see the interface > with 2-3 extensions from the list above? All these extensions look very > reasonable to me, so we'll likely have to implement them soon. So let's think > about the extensibility now. > My idea is to have these extensions as optional positional arguments (like Wei suggested), so that the interface does not get too complicated for users who don't care about tuning these options. If this is the case then I think there is nothing to worry about. Otherwise, if you think some of these options make sense to be a required argument instead, we can rethink the initial interface. > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead? > In the end, such a feature might make sense on the system level too. > Yes, there is the drop_caches sysctl, but it's too radical for many cases. > I think in the RFC discussion there was consensus to add both a per-memcg knob, as well as per-node / per-system knobs (through sysfs or syscalls) later. Wei also points out that it's not common for a syscall to have a cgroup argument. > > > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69d7a6983f78..925aaabb2247 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.reclaim > > + A write-only file which exists on non-root cgroups. > > + > > + This is a simple interface to trigger memory reclaim in the > > + target cgroup. Write the number of bytes to reclaim to this > > + file and the kernel will try to reclaim that much memory. > > + Please note that the kernel can over or under reclaim from > > + the target cgroup. > > + > > memory.oom.group > > A read-write single value file which exists on non-root > > cgroups. The default value is "0". > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 725f76723220..994849fab7df 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > + } > > + > > + return nbytes; > > +} > > + > > static struct cftype memory_files[] = { > > { > > .name = "current", > > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > > .seq_show = memory_oom_group_show, > > .write = memory_oom_group_write, > > }, > > + { > > + .name = "reclaim", > > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > > + .write = memory_reclaim, > > Btw, why not on root?
On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > Hello! > > I'm totally up for the proposed feature! It makes total sense and is proved > to be useful, let's add it. > > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > Maybe an option to specify a timeout? That might simplify the userspace part. > Also, please please add a test to selftests/cgroup/memcg tests. > It will also provide an example on how the userspace can use the feature. > > > > > For now, let's keep things simple by adding the basic functionality. > > What I'm worried about is how we gonna extend it? How do you see the interface > with 2-3 extensions from the list above? All these extensions look very > reasonable to me, so we'll likely have to implement them soon. So let's think > about the extensibility now. > > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead? > In the end, such a feature might make sense on the system level too. > Yes, there is the drop_caches sysctl, but it's too radical for many cases. > > > > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69d7a6983f78..925aaabb2247 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.reclaim > > + A write-only file which exists on non-root cgroups. > > + > > + This is a simple interface to trigger memory reclaim in the > > + target cgroup. Write the number of bytes to reclaim to this > > + file and the kernel will try to reclaim that much memory. > > + Please note that the kernel can over or under reclaim from > > + the target cgroup. > > + > > memory.oom.group > > A read-write single value file which exists on non-root > > cgroups. The default value is "0". > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 725f76723220..994849fab7df 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > + } > > + > > + return nbytes; > > +} > > + > > static struct cftype memory_files[] = { > > { > > .name = "current", > > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > > .seq_show = memory_oom_group_show, > > .write = memory_oom_group_write, > > }, > > + { > > + .name = "reclaim", > > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > > + .write = memory_reclaim, > > Btw, why not on root? I missed the root question in my first reply. I think this was originally modeled after the memory.high interface, but I don't know if there are other reasons. Shakeel would know better. AFAIK this should work naturally on root as well, but I think it makes more sense then to use a global interface (hopefully introduced soon)? I don't have an opinion here let me know what you prefer for v2.
On Thu, Mar 31, 2022 at 8:38 PM Wei Xu <weixugc@google.com> wrote: > > On Thu, Mar 31, 2022 at 5:33 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > > return nbytes; > > > } > > > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > + size_t nbytes, loff_t off) > > > +{ > > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > > + int err; > > > + > > > + buf = strstrip(buf); > > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > > + if (err) > > > + return err; > > > + > > > + while (nr_reclaimed < nr_to_reclaim) { > > > + unsigned long reclaimed; > > > + > > > + if (signal_pending(current)) > > > + break; > > > + > > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > + nr_to_reclaim - nr_reclaimed, > > > + GFP_KERNEL, true); > > > + > > > + if (!reclaimed && !nr_retries--) > > > + break; > > > + > > > + nr_reclaimed += reclaimed; > > > + } > > > > Is there any way in which this can be provoked into triggering the > > softlockup detector? > > memory.reclaim is similar to memory.high w.r.t. reclaiming memory, > except that memory.reclaim is stateless, while the kernel remembers > the state set by memory.high. So memory.reclaim should not bring in > any new risks of triggering soft lockup, if any. > > > Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel? > > Would additional flexibility be gained by letting userspace handle > > retrying? > > I agree it is better to retry from the userspace. Thanks Andrew and Wei for looking at this. IIUC the MAX_RECLAIM_RETRIES loop was modeled after the loop in memory.high as well. Is there a reason why it should be different here?
On Thu, Mar 31, 2022 at 8:05 PM Chen Wandun <chenwandun@huawei.com> wrote: > > > > 在 2022/3/31 16:41, Yosry Ahmed 写道: > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > > > For now, let's keep things simple by adding the basic functionality. > > > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69d7a6983f78..925aaabb2247 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.reclaim > > + A write-only file which exists on non-root cgroups. > > + > > + This is a simple interface to trigger memory reclaim in the > > + target cgroup. Write the number of bytes to reclaim to this > > + file and the kernel will try to reclaim that much memory. > > + Please note that the kernel can over or under reclaim from > > + the target cgroup. > > + > > memory.oom.group > > A read-write single value file which exists on non-root > > cgroups. The default value is "0". > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 725f76723220..994849fab7df 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > In some scenario there are lots of page cache, and we only want to > reclaim page cache, > how about add may_swap option? Thanks for taking a look at this! The first listed extension is an argument/flags to specify the type of memory that we want to reclaim, I think this covers this use case, or am I missing something? > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > + } > > + > > + return nbytes; > > +} > > + > > static struct cftype memory_files[] = { > > { > > .name = "current", > > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > > .seq_show = memory_oom_group_show, > > .write = memory_oom_group_write, > > }, > > + { > > + .name = "reclaim", > > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > > + .write = memory_reclaim, > > + }, > > { } /* terminate */ > > }; > > >
On Thu, Mar 31, 2022 at 9:05 PM Wei Xu <weixugc@google.com> wrote: > > On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > > > For now, let's keep things simple by adding the basic functionality. > > > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69d7a6983f78..925aaabb2247 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.reclaim > > + A write-only file which exists on non-root cgroups. > > + > > + This is a simple interface to trigger memory reclaim in the > > + target cgroup. Write the number of bytes to reclaim to this > > + file and the kernel will try to reclaim that much memory. > > + Please note that the kernel can over or under reclaim from > > + the target cgroup. > > + > > memory.oom.group > > A read-write single value file which exists on non-root > > cgroups. The default value is "0". > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 725f76723220..994849fab7df 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > + } > > + > > + return nbytes; > > It is better to return an error code (e.g. -EBUSY) when > memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory, > except if the cgroup memory usage is already 0. We can also return > -EINVAL if nr_to_reclaim is too large (e.g. > limit). IIUC this interface is modeled after memory.high, which returns nbytes as well. If you think it's better here to do this instead of maintaining consistency with memory.high we can certainly do this for v2. > > > +} > > + > > static struct cftype memory_files[] = { > > { > > .name = "current", > > @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > > .seq_show = memory_oom_group_show, > > .write = memory_oom_group_write, > > }, > > + { > > + .name = "reclaim", > > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > > + .write = memory_reclaim, > > + }, > > { } /* terminate */ > > }; > > > > -- > > 2.35.1.1021.g381101b075-goog > >
On Fri, Apr 1, 2022 at 1:39 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote: > > > Yosry Ahmed <yosryahmed@google.com> writes: > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > <snip> > > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > I think there should be a cond_resched() in this loop before > try_to_free_mem_cgroup_pages() to have better chances of reclaim > succeding early. > Thanks for taking the time to look at this! I believe this loop is modeled after the loop in memory_high_write() for the memory.high interface. Is there a reason why it should be needed here but not there? > <snip> > > -- > Cheers > ~ Vaibhav
在 2022/4/1 17:20, Yosry Ahmed 写道: > On Thu, Mar 31, 2022 at 8:05 PM Chen Wandun <chenwandun@huawei.com> wrote: >> >> >> 在 2022/3/31 16:41, Yosry Ahmed 写道: >>> From: Shakeel Butt <shakeelb@google.com> >>> >>> Introduce an memcg interface to trigger memory reclaim on a memory cgroup. >>> >>> Use case: Proactive Reclaim >>> --------------------------- >>> >>> A userspace proactive reclaimer can continuously probe the memcg to >>> reclaim a small amount of memory. This gives more accurate and >>> up-to-date workingset estimation as the LRUs are continuously >>> sorted and can potentially provide more deterministic memory >>> overcommit behavior. The memory overcommit controller can provide >>> more proactive response to the changing behavior of the running >>> applications instead of being reactive. >>> >>> A userspace reclaimer's purpose in this case is not a complete replacement >>> for kswapd or direct reclaim, it is to proactively identify memory savings >>> opportunities and reclaim some amount of cold pages set by the policy >>> to free up the memory for more demanding jobs or scheduling new jobs. >>> >>> A user space proactive reclaimer is used in Google data centers. >>> Additionally, Meta's TMO paper recently referenced a very similar >>> interface used for user space proactive reclaim: >>> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 >>> >>> Benefits of a user space reclaimer: >>> ----------------------------------- >>> >>> 1) More flexible on who should be charged for the cpu of the memory >>> reclaim. For proactive reclaim, it makes more sense to be centralized. >>> >>> 2) More flexible on dedicating the resources (like cpu). The memory >>> overcommit controller can balance the cost between the cpu usage and >>> the memory reclaimed. >>> >>> 3) Provides a way to the applications to keep their LRUs sorted, so, >>> under memory pressure better reclaim candidates are selected. This also >>> gives more accurate and uptodate notion of working set for an >>> application. >>> >>> Why memory.high is not enough? >>> ------------------------------ >>> >>> - memory.high can be used to trigger reclaim in a memcg and can >>> potentially be used for proactive reclaim. >>> However there is a big downside in using memory.high. It can potentially >>> introduce high reclaim stalls in the target application as the >>> allocations from the processes or the threads of the application can hit >>> the temporary memory.high limit. >>> >>> - Userspace proactive reclaimers usually use feedback loops to decide >>> how much memory to proactively reclaim from a workload. The metrics >>> used for this are usually either refaults or PSI, and these metrics >>> will become messy if the application gets throttled by hitting the >>> high limit. >>> >>> - memory.high is a stateful interface, if the userspace proactive >>> reclaimer crashes for any reason while triggering reclaim it can leave >>> the application in a bad state. >>> >>> - If a workload is rapidly expanding, setting memory.high to proactively >>> reclaim memory can result in actually reclaiming more memory than >>> intended. >>> >>> The benefits of such interface and shortcomings of existing interface >>> were further discussed in this RFC thread: >>> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ >>> >>> Interface: >>> ---------- >>> >>> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to >>> trigger reclaim in the target memory cgroup. >>> >>> >>> Possible Extensions: >>> -------------------- >>> >>> - This interface can be extended with an additional parameter or flags >>> to allow specifying one or more types of memory to reclaim from (e.g. >>> file, anon, ..). >>> >>> - The interface can also be extended with a node mask to reclaim from >>> specific nodes. This has use cases for reclaim-based demotion in memory >>> tiering systens. >>> >>> - A similar per-node interface can also be added to support proactive >>> reclaim and reclaim-based demotion in systems without memcg. >>> >>> For now, let's keep things simple by adding the basic functionality. >>> >>> [yosryahmed@google.com: refreshed to current master, updated commit >>> message based on recent discussions and use cases] >>> Signed-off-by: Shakeel Butt <shakeelb@google.com> >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >>> --- >>> Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ >>> mm/memcontrol.c | 37 +++++++++++++++++++++++++ >>> 2 files changed, 46 insertions(+) >>> >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >>> index 69d7a6983f78..925aaabb2247 100644 >>> --- a/Documentation/admin-guide/cgroup-v2.rst >>> +++ b/Documentation/admin-guide/cgroup-v2.rst >>> @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. >>> high limit is used and monitored properly, this limit's >>> utility is limited to providing the final safety net. >>> >>> + memory.reclaim >>> + A write-only file which exists on non-root cgroups. >>> + >>> + This is a simple interface to trigger memory reclaim in the >>> + target cgroup. Write the number of bytes to reclaim to this >>> + file and the kernel will try to reclaim that much memory. >>> + Please note that the kernel can over or under reclaim from >>> + the target cgroup. >>> + >>> memory.oom.group >>> A read-write single value file which exists on non-root >>> cgroups. The default value is "0". >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 725f76723220..994849fab7df 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, >>> return nbytes; >>> } >>> >>> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, >>> + size_t nbytes, loff_t off) >>> +{ >>> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); >>> + unsigned int nr_retries = MAX_RECLAIM_RETRIES; >>> + unsigned long nr_to_reclaim, nr_reclaimed = 0; >>> + int err; >>> + >>> + buf = strstrip(buf); >>> + err = page_counter_memparse(buf, "", &nr_to_reclaim); >>> + if (err) >>> + return err; >>> + >>> + while (nr_reclaimed < nr_to_reclaim) { >>> + unsigned long reclaimed; >>> + >>> + if (signal_pending(current)) >>> + break; >>> + >>> + reclaimed = try_to_free_mem_cgroup_pages(memcg, >>> + nr_to_reclaim - nr_reclaimed, >>> + GFP_KERNEL, true); >> In some scenario there are lots of page cache, and we only want to >> reclaim page cache, >> how about add may_swap option? > Thanks for taking a look at this! > > The first listed extension is an argument/flags to specify the type of do you mean nbytes in memory_reclaim? it decide the amount of memory to reclaim. one more argument such as may_swap can be add into memory_reclaim, and pass this argument to try_to_free_mem_cgroup_pages in order to replace the default "true" Thanks. > memory that we want to reclaim, I think this covers this use case, or > am I missing something? > >>> + >>> + if (!reclaimed && !nr_retries--) >>> + break; >>> + >>> + nr_reclaimed += reclaimed; >>> + } >>> + >>> + return nbytes; >>> +} >>> + >>> static struct cftype memory_files[] = { >>> { >>> .name = "current", >>> @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { >>> .seq_show = memory_oom_group_show, >>> .write = memory_oom_group_write, >>> }, >>> + { >>> + .name = "reclaim", >>> + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, >>> + .write = memory_reclaim, >>> + }, >>> { } /* terminate */ >>> }; >>> > .
On Fri, Apr 1, 2022 at 2:49 AM Chen Wandun <chenwandun@huawei.com> wrote: > > > > 在 2022/4/1 17:20, Yosry Ahmed 写道: > > On Thu, Mar 31, 2022 at 8:05 PM Chen Wandun <chenwandun@huawei.com> wrote: > >> > >> > >> 在 2022/3/31 16:41, Yosry Ahmed 写道: > >>> From: Shakeel Butt <shakeelb@google.com> > >>> > >>> Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > >>> > >>> Use case: Proactive Reclaim > >>> --------------------------- > >>> > >>> A userspace proactive reclaimer can continuously probe the memcg to > >>> reclaim a small amount of memory. This gives more accurate and > >>> up-to-date workingset estimation as the LRUs are continuously > >>> sorted and can potentially provide more deterministic memory > >>> overcommit behavior. The memory overcommit controller can provide > >>> more proactive response to the changing behavior of the running > >>> applications instead of being reactive. > >>> > >>> A userspace reclaimer's purpose in this case is not a complete replacement > >>> for kswapd or direct reclaim, it is to proactively identify memory savings > >>> opportunities and reclaim some amount of cold pages set by the policy > >>> to free up the memory for more demanding jobs or scheduling new jobs. > >>> > >>> A user space proactive reclaimer is used in Google data centers. > >>> Additionally, Meta's TMO paper recently referenced a very similar > >>> interface used for user space proactive reclaim: > >>> https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > >>> > >>> Benefits of a user space reclaimer: > >>> ----------------------------------- > >>> > >>> 1) More flexible on who should be charged for the cpu of the memory > >>> reclaim. For proactive reclaim, it makes more sense to be centralized. > >>> > >>> 2) More flexible on dedicating the resources (like cpu). The memory > >>> overcommit controller can balance the cost between the cpu usage and > >>> the memory reclaimed. > >>> > >>> 3) Provides a way to the applications to keep their LRUs sorted, so, > >>> under memory pressure better reclaim candidates are selected. This also > >>> gives more accurate and uptodate notion of working set for an > >>> application. > >>> > >>> Why memory.high is not enough? > >>> ------------------------------ > >>> > >>> - memory.high can be used to trigger reclaim in a memcg and can > >>> potentially be used for proactive reclaim. > >>> However there is a big downside in using memory.high. It can potentially > >>> introduce high reclaim stalls in the target application as the > >>> allocations from the processes or the threads of the application can hit > >>> the temporary memory.high limit. > >>> > >>> - Userspace proactive reclaimers usually use feedback loops to decide > >>> how much memory to proactively reclaim from a workload. The metrics > >>> used for this are usually either refaults or PSI, and these metrics > >>> will become messy if the application gets throttled by hitting the > >>> high limit. > >>> > >>> - memory.high is a stateful interface, if the userspace proactive > >>> reclaimer crashes for any reason while triggering reclaim it can leave > >>> the application in a bad state. > >>> > >>> - If a workload is rapidly expanding, setting memory.high to proactively > >>> reclaim memory can result in actually reclaiming more memory than > >>> intended. > >>> > >>> The benefits of such interface and shortcomings of existing interface > >>> were further discussed in this RFC thread: > >>> https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > >>> > >>> Interface: > >>> ---------- > >>> > >>> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > >>> trigger reclaim in the target memory cgroup. > >>> > >>> > >>> Possible Extensions: > >>> -------------------- > >>> > >>> - This interface can be extended with an additional parameter or flags > >>> to allow specifying one or more types of memory to reclaim from (e.g. > >>> file, anon, ..). > >>> > >>> - The interface can also be extended with a node mask to reclaim from > >>> specific nodes. This has use cases for reclaim-based demotion in memory > >>> tiering systens. > >>> > >>> - A similar per-node interface can also be added to support proactive > >>> reclaim and reclaim-based demotion in systems without memcg. > >>> > >>> For now, let's keep things simple by adding the basic functionality. > >>> > >>> [yosryahmed@google.com: refreshed to current master, updated commit > >>> message based on recent discussions and use cases] > >>> Signed-off-by: Shakeel Butt <shakeelb@google.com> > >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > >>> --- > >>> Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > >>> mm/memcontrol.c | 37 +++++++++++++++++++++++++ > >>> 2 files changed, 46 insertions(+) > >>> > >>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > >>> index 69d7a6983f78..925aaabb2247 100644 > >>> --- a/Documentation/admin-guide/cgroup-v2.rst > >>> +++ b/Documentation/admin-guide/cgroup-v2.rst > >>> @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > >>> high limit is used and monitored properly, this limit's > >>> utility is limited to providing the final safety net. > >>> > >>> + memory.reclaim > >>> + A write-only file which exists on non-root cgroups. > >>> + > >>> + This is a simple interface to trigger memory reclaim in the > >>> + target cgroup. Write the number of bytes to reclaim to this > >>> + file and the kernel will try to reclaim that much memory. > >>> + Please note that the kernel can over or under reclaim from > >>> + the target cgroup. > >>> + > >>> memory.oom.group > >>> A read-write single value file which exists on non-root > >>> cgroups. The default value is "0". > >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c > >>> index 725f76723220..994849fab7df 100644 > >>> --- a/mm/memcontrol.c > >>> +++ b/mm/memcontrol.c > >>> @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > >>> return nbytes; > >>> } > >>> > >>> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > >>> + size_t nbytes, loff_t off) > >>> +{ > >>> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > >>> + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > >>> + unsigned long nr_to_reclaim, nr_reclaimed = 0; > >>> + int err; > >>> + > >>> + buf = strstrip(buf); > >>> + err = page_counter_memparse(buf, "", &nr_to_reclaim); > >>> + if (err) > >>> + return err; > >>> + > >>> + while (nr_reclaimed < nr_to_reclaim) { > >>> + unsigned long reclaimed; > >>> + > >>> + if (signal_pending(current)) > >>> + break; > >>> + > >>> + reclaimed = try_to_free_mem_cgroup_pages(memcg, > >>> + nr_to_reclaim - nr_reclaimed, > >>> + GFP_KERNEL, true); > >> In some scenario there are lots of page cache, and we only want to > >> reclaim page cache, > >> how about add may_swap option? > > Thanks for taking a look at this! > > > > The first listed extension is an argument/flags to specify the type of > do you mean nbytes in memory_reclaim? it decide the amount of memory > to reclaim. > > one more argument such as may_swap can be add into memory_reclaim, and > pass this argument to try_to_free_mem_cgroup_pages in order to replace the > default "true" > > Thanks. I agree about the need for a may_swap or similar argument. In the commit message I list some possible extensions to this interface, and the first one is to add an argument to specify the type of memory we want to reclaim using the interface (anon, file, ..), which I think covers this use case. I just think we should add this in a separate patch as an extension. > > > memory that we want to reclaim, I think this covers this use case, or > > am I missing something? > > > >>> + > >>> + if (!reclaimed && !nr_retries--) > >>> + break; > >>> + > >>> + nr_reclaimed += reclaimed; > >>> + } > >>> + > >>> + return nbytes; > >>> +} > >>> + > >>> static struct cftype memory_files[] = { > >>> { > >>> .name = "current", > >>> @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { > >>> .seq_show = memory_oom_group_show, > >>> .write = memory_oom_group_write, > >>> }, > >>> + { > >>> + .name = "reclaim", > >>> + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > >>> + .write = memory_reclaim, > >>> + }, > >>> { } /* terminate */ > >>> }; > >>> > > . >
On Fri 01-04-22 02:17:28, Yosry Ahmed wrote: > On Thu, Mar 31, 2022 at 8:38 PM Wei Xu <weixugc@google.com> wrote: > > > > On Thu, Mar 31, 2022 at 5:33 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > > On Thu, 31 Mar 2022 08:41:51 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > > > return nbytes; > > > > } > > > > > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > > + size_t nbytes, loff_t off) > > > > +{ > > > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > > > + int err; > > > > + > > > > + buf = strstrip(buf); > > > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > > > + if (err) > > > > + return err; > > > > + > > > > + while (nr_reclaimed < nr_to_reclaim) { > > > > + unsigned long reclaimed; > > > > + > > > > + if (signal_pending(current)) > > > > + break; > > > > + > > > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > > + nr_to_reclaim - nr_reclaimed, > > > > + GFP_KERNEL, true); > > > > + > > > > + if (!reclaimed && !nr_retries--) > > > > + break; > > > > + > > > > + nr_reclaimed += reclaimed; > > > > + } > > > > > > Is there any way in which this can be provoked into triggering the > > > softlockup detector? > > > > memory.reclaim is similar to memory.high w.r.t. reclaiming memory, > > except that memory.reclaim is stateless, while the kernel remembers > > the state set by memory.high. So memory.reclaim should not bring in > > any new risks of triggering soft lockup, if any. Memory reclaim already has cond_resched even if there is nothing reclaimable. See shrink_node_memcgs > > > Is it optimal to do the MAX_RECLAIM_RETRIES loop in the kernel? > > > Would additional flexibility be gained by letting userspace handle > > > retrying? > > > > I agree it is better to retry from the userspace. > > Thanks Andrew and Wei for looking at this. IIUC the > MAX_RECLAIM_RETRIES loop was modeled after the loop in memory.high as > well. Is there a reason why it should be different here? No, I would go with the same approach other interfaces use. I am not a great fan of MAX_RECLAIM_RETRIES - especially when we have a bail out on signals - but if we are to change this then let's do it consisently.
On Thu 31-03-22 10:25:23, Roman Gushchin wrote: > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: [...] > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > Maybe an option to specify a timeout? That might simplify the userspace part. What do you mean by timeout here? Isn't timeout $N echo $RECLAIM > .... enough?
On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > From: Shakeel Butt <shakeelb@google.com> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use case: Proactive Reclaim > --------------------------- > > A userspace proactive reclaimer can continuously probe the memcg to > reclaim a small amount of memory. This gives more accurate and > up-to-date workingset estimation as the LRUs are continuously > sorted and can potentially provide more deterministic memory > overcommit behavior. The memory overcommit controller can provide > more proactive response to the changing behavior of the running > applications instead of being reactive. > > A userspace reclaimer's purpose in this case is not a complete replacement > for kswapd or direct reclaim, it is to proactively identify memory savings > opportunities and reclaim some amount of cold pages set by the policy > to free up the memory for more demanding jobs or scheduling new jobs. > > A user space proactive reclaimer is used in Google data centers. > Additionally, Meta's TMO paper recently referenced a very similar > interface used for user space proactive reclaim: > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > Benefits of a user space reclaimer: > ----------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to be centralized. > > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. > > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. > > Why memory.high is not enough? > ------------------------------ > > - memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim. > However there is a big downside in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. > > - Userspace proactive reclaimers usually use feedback loops to decide > how much memory to proactively reclaim from a workload. The metrics > used for this are usually either refaults or PSI, and these metrics > will become messy if the application gets throttled by hitting the > high limit. > > - memory.high is a stateful interface, if the userspace proactive > reclaimer crashes for any reason while triggering reclaim it can leave > the application in a bad state. > > - If a workload is rapidly expanding, setting memory.high to proactively > reclaim memory can result in actually reclaiming more memory than > intended. > > The benefits of such interface and shortcomings of existing interface > were further discussed in this RFC thread: > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > Interface: > ---------- > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. > > > Possible Extensions: > -------------------- > > - This interface can be extended with an additional parameter or flags > to allow specifying one or more types of memory to reclaim from (e.g. > file, anon, ..). > > - The interface can also be extended with a node mask to reclaim from > specific nodes. This has use cases for reclaim-based demotion in memory > tiering systens. > > - A similar per-node interface can also be added to support proactive > reclaim and reclaim-based demotion in systems without memcg. > > For now, let's keep things simple by adding the basic functionality. Yes, I am for the simplicity and this really looks like a bare minumum interface. But it is not really clear who do you want to add flags on top of it? I am not really sure we really need a node aware interface for memcg. The global reclaim interface will likely need a different node because we do not want to make this CONFIG_MEMCG constrained. > [yosryahmed@google.com: refreshed to current master, updated commit > message based on recent discussions and use cases] > Signed-off-by: Shakeel Butt <shakeelb@google.com> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> All that being said. I haven't been a great fan for explicit reclaim triggered from the userspace but I do recognize that limitations of the existing interfaces is just too restrictive. Acked-by: Michal Hocko <mhocko@suse.com> Thanks!
On Thu, Mar 31, 2022 at 09:05:15PM -0700, Wei Xu wrote: > On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > > > For now, let's keep things simple by adding the basic functionality. > > > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 69d7a6983f78..925aaabb2247 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.reclaim > > + A write-only file which exists on non-root cgroups. > > + > > + This is a simple interface to trigger memory reclaim in the > > + target cgroup. Write the number of bytes to reclaim to this > > + file and the kernel will try to reclaim that much memory. > > + Please note that the kernel can over or under reclaim from > > + the target cgroup. > > + > > memory.oom.group > > A read-write single value file which exists on non-root > > cgroups. The default value is "0". > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 725f76723220..994849fab7df 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > + > > + nr_reclaimed += reclaimed; > > + } > > + > > + return nbytes; > > It is better to return an error code (e.g. -EBUSY) when > memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory, > except if the cgroup memory usage is already 0. We can also return > -EINVAL if nr_to_reclaim is too large (e.g. > limit). For -EBUSY, are you thinking of a specific usecase where that would come in handy? I'm not really opposed to it, but couldn't convince myself of the practical benefits of it, either. Keep in mind that MAX_RECLAIM_RETRIES failed reclaim attempts usually constitute an OOM situation: memory.max will issue kills and memory.high will begin crippling throttling. In what scenario would you want to keep reclaiming a workload that is considered OOM? Certainly, proactive reclaim that wants to purge only the cold tail of the workload wouldn't retry. Meta's version of this patch actually does return -EAGAIN on reclaim failure, but the userspace daemon doesn't do anything with it, so I didn't bring it up. For -EINVAL, I tend to lean more toward disagreeing. We've been trying to avoid arbitrary dependencies between control knobs in cgroup2, just because it exposes us to race conditions and adds complications to the interface. For example, it *usually* doesn't make sense to set limits to 0, or set local limits and protections higher than the parent. But we allow it anyway, to avoid creating well-intended linting rules that could interfere with somebody's unforeseen, legitimate usecase.
On Fri, Apr 1, 2022 at 2:16 AM Yosry Ahmed <yosryahmed@google.com> wrote: > [...] > > > + { > > > + .name = "reclaim", > > > + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, > > > + .write = memory_reclaim, > > > > Btw, why not on root? > > I missed the root question in my first reply. I think this was > originally modeled after the memory.high interface, but I don't know > if there are other reasons. Shakeel would know better. > > AFAIK this should work naturally on root as well, but I think it makes > more sense then to use a global interface (hopefully introduced soon)? > I don't have an opinion here let me know what you prefer for v2. We will follow the psi example which is exposed for root as well as for system level in procfs but both of these (for memory.reclaim) are planned as the followup feature.
On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > > > For now, let's keep things simple by adding the basic functionality. > > Yes, I am for the simplicity and this really looks like a bare minumum > interface. But it is not really clear who do you want to add flags on > top of it? > > I am not really sure we really need a node aware interface for memcg. > The global reclaim interface will likely need a different node because > we do not want to make this CONFIG_MEMCG constrained. A nodemask argument for memory.reclaim can be useful for memory tiering between NUMA nodes with different performance. Similar to proactive reclaim, it can allow a userspace daemon to drive memcg-based proactive demotion via the reclaim-based demotion mechanism in the kernel. > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > All that being said. I haven't been a great fan for explicit reclaim > triggered from the userspace but I do recognize that limitations of the > existing interfaces is just too restrictive. > > Acked-by: Michal Hocko <mhocko@suse.com> > > Thanks! > -- > Michal Hocko > SUSE Labs
On Fri, Apr 01, 2022 at 03:49:19PM +0200, Michal Hocko wrote: > On Thu 31-03-22 10:25:23, Roman Gushchin wrote: > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > [...] > > > - A similar per-node interface can also be added to support proactive > > > reclaim and reclaim-based demotion in systems without memcg. > > > > Maybe an option to specify a timeout? That might simplify the userspace part. > > What do you mean by timeout here? Isn't > timeout $N echo $RECLAIM > .... > > enough? It's nice and simple when it's a bash script, but when it's a complex application trying to do the same, it quickly becomes less simple and likely will require a dedicated thread to avoid blocking the main app for too long and a mechanism to unblock it by timer/when the need arises. In my experience using correctly such semi-blocking interfaces (semi- because it's not clearly defined how much time the syscall can take and whether it makes sense to wait longer) is tricky.
On Fri, Apr 01, 2022 at 02:11:51AM -0700, Yosry Ahmed wrote: > On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin > <roman.gushchin@linux.dev> wrote: > > > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > > From: Shakeel Butt <shakeelb@google.com> > > > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > > > Use case: Proactive Reclaim > > > --------------------------- > > > > > > A userspace proactive reclaimer can continuously probe the memcg to > > > reclaim a small amount of memory. This gives more accurate and > > > up-to-date workingset estimation as the LRUs are continuously > > > sorted and can potentially provide more deterministic memory > > > overcommit behavior. The memory overcommit controller can provide > > > more proactive response to the changing behavior of the running > > > applications instead of being reactive. > > > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > > for kswapd or direct reclaim, it is to proactively identify memory savings > > > opportunities and reclaim some amount of cold pages set by the policy > > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > > > A user space proactive reclaimer is used in Google data centers. > > > Additionally, Meta's TMO paper recently referenced a very similar > > > interface used for user space proactive reclaim: > > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > > > Benefits of a user space reclaimer: > > > ----------------------------------- > > > > > > 1) More flexible on who should be charged for the cpu of the memory > > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > > overcommit controller can balance the cost between the cpu usage and > > > the memory reclaimed. > > > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > > under memory pressure better reclaim candidates are selected. This also > > > gives more accurate and uptodate notion of working set for an > > > application. > > > > > > Why memory.high is not enough? > > > ------------------------------ > > > > > > - memory.high can be used to trigger reclaim in a memcg and can > > > potentially be used for proactive reclaim. > > > However there is a big downside in using memory.high. It can potentially > > > introduce high reclaim stalls in the target application as the > > > allocations from the processes or the threads of the application can hit > > > the temporary memory.high limit. > > > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > > how much memory to proactively reclaim from a workload. The metrics > > > used for this are usually either refaults or PSI, and these metrics > > > will become messy if the application gets throttled by hitting the > > > high limit. > > > > > > - memory.high is a stateful interface, if the userspace proactive > > > reclaimer crashes for any reason while triggering reclaim it can leave > > > the application in a bad state. > > > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > > reclaim memory can result in actually reclaiming more memory than > > > intended. > > > > > > The benefits of such interface and shortcomings of existing interface > > > were further discussed in this RFC thread: > > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > Hello! > > > > I'm totally up for the proposed feature! It makes total sense and is proved > > to be useful, let's add it. > > > > > > > > Interface: > > > ---------- > > > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > > trigger reclaim in the target memory cgroup. > > > > > > > > > Possible Extensions: > > > -------------------- > > > > > > - This interface can be extended with an additional parameter or flags > > > to allow specifying one or more types of memory to reclaim from (e.g. > > > file, anon, ..). > > > > > > - The interface can also be extended with a node mask to reclaim from > > > specific nodes. This has use cases for reclaim-based demotion in memory > > > tiering systens. > > > > > > - A similar per-node interface can also be added to support proactive > > > reclaim and reclaim-based demotion in systems without memcg. > > > > Maybe an option to specify a timeout? That might simplify the userspace part. > > Also, please please add a test to selftests/cgroup/memcg tests. > > It will also provide an example on how the userspace can use the feature. > > > > Hi Roman, thanks for taking the time to review this! > > A timeout can be a good extension, I will add it to the commit message > in the next version in possible extensions. > > I will add a test in v2, thanks! Great, thank you! > > > > > > > > > For now, let's keep things simple by adding the basic functionality. > > > > What I'm worried about is how we gonna extend it? How do you see the interface > > with 2-3 extensions from the list above? All these extensions look very > > reasonable to me, so we'll likely have to implement them soon. So let's think > > about the extensibility now. > > > > My idea is to have these extensions as optional positional arguments > (like Wei suggested), so that the interface does not get too > complicated for users who don't care about tuning these options. If > this is the case then I think there is nothing to worry about. > Otherwise, if you think some of these options make sense to be a > required argument instead, we can rethink the initial interface. The interface you're proposing is not really extensible, so we'll likely need to introduce a new interface like memory.reclaim_ext very soon. Why not create an extensible API from scratch? I'm looking at cgroup v2 documentation which describes various interface files formats and it seems like given the number of potential optional arguments the best option is nested keyed (please, refer to the Interface Files section). E.g. the format can be: echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim We can say that now we don't support any keyed arguments, but they can be added in the future. Basically you don't even need to change any code, only document the interface properly, so we can extend it later without breaking the API. > > > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead? > > In the end, such a feature might make sense on the system level too. > > Yes, there is the drop_caches sysctl, but it's too radical for many cases. > > > > I think in the RFC discussion there was consensus to add both a > per-memcg knob, as well as per-node / per-system knobs (through sysfs > or syscalls) later. Wei also points out that it's not common for a > syscall to have a cgroup argument. Actually there are examples (e.g. sys_bpf), but my only point is to make the API extensible, so maybe syscall is not the best idea. I'd add the root level interface from scratch: the code change is simple and it makes sense as a feature. Then likely we don't really need another system-level interface at all. Thanks!
On Fri, Apr 1, 2022 at 8:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Mar 31, 2022 at 09:05:15PM -0700, Wei Xu wrote: > > On Thu, Mar 31, 2022 at 1:42 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > From: Shakeel Butt <shakeelb@google.com> > > > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > > > Use case: Proactive Reclaim > > > --------------------------- > > > > > > A userspace proactive reclaimer can continuously probe the memcg to > > > reclaim a small amount of memory. This gives more accurate and > > > up-to-date workingset estimation as the LRUs are continuously > > > sorted and can potentially provide more deterministic memory > > > overcommit behavior. The memory overcommit controller can provide > > > more proactive response to the changing behavior of the running > > > applications instead of being reactive. > > > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > > for kswapd or direct reclaim, it is to proactively identify memory savings > > > opportunities and reclaim some amount of cold pages set by the policy > > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > > > A user space proactive reclaimer is used in Google data centers. > > > Additionally, Meta's TMO paper recently referenced a very similar > > > interface used for user space proactive reclaim: > > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > > > Benefits of a user space reclaimer: > > > ----------------------------------- > > > > > > 1) More flexible on who should be charged for the cpu of the memory > > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > > overcommit controller can balance the cost between the cpu usage and > > > the memory reclaimed. > > > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > > under memory pressure better reclaim candidates are selected. This also > > > gives more accurate and uptodate notion of working set for an > > > application. > > > > > > Why memory.high is not enough? > > > ------------------------------ > > > > > > - memory.high can be used to trigger reclaim in a memcg and can > > > potentially be used for proactive reclaim. > > > However there is a big downside in using memory.high. It can potentially > > > introduce high reclaim stalls in the target application as the > > > allocations from the processes or the threads of the application can hit > > > the temporary memory.high limit. > > > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > > how much memory to proactively reclaim from a workload. The metrics > > > used for this are usually either refaults or PSI, and these metrics > > > will become messy if the application gets throttled by hitting the > > > high limit. > > > > > > - memory.high is a stateful interface, if the userspace proactive > > > reclaimer crashes for any reason while triggering reclaim it can leave > > > the application in a bad state. > > > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > > reclaim memory can result in actually reclaiming more memory than > > > intended. > > > > > > The benefits of such interface and shortcomings of existing interface > > > were further discussed in this RFC thread: > > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > > > Interface: > > > ---------- > > > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > > trigger reclaim in the target memory cgroup. > > > > > > > > > Possible Extensions: > > > -------------------- > > > > > > - This interface can be extended with an additional parameter or flags > > > to allow specifying one or more types of memory to reclaim from (e.g. > > > file, anon, ..). > > > > > > - The interface can also be extended with a node mask to reclaim from > > > specific nodes. This has use cases for reclaim-based demotion in memory > > > tiering systens. > > > > > > - A similar per-node interface can also be added to support proactive > > > reclaim and reclaim-based demotion in systems without memcg. > > > > > > For now, let's keep things simple by adding the basic functionality. > > > > > > [yosryahmed@google.com: refreshed to current master, updated commit > > > message based on recent discussions and use cases] > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > --- > > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > > 2 files changed, 46 insertions(+) > > > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > > index 69d7a6983f78..925aaabb2247 100644 > > > --- a/Documentation/admin-guide/cgroup-v2.rst > > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > > @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. > > > high limit is used and monitored properly, this limit's > > > utility is limited to providing the final safety net. > > > > > > + memory.reclaim > > > + A write-only file which exists on non-root cgroups. > > > + > > > + This is a simple interface to trigger memory reclaim in the > > > + target cgroup. Write the number of bytes to reclaim to this > > > + file and the kernel will try to reclaim that much memory. > > > + Please note that the kernel can over or under reclaim from > > > + the target cgroup. > > > + > > > memory.oom.group > > > A read-write single value file which exists on non-root > > > cgroups. The default value is "0". > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 725f76723220..994849fab7df 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > > return nbytes; > > > } > > > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > + size_t nbytes, loff_t off) > > > +{ > > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > > + int err; > > > + > > > + buf = strstrip(buf); > > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > > + if (err) > > > + return err; > > > + > > > + while (nr_reclaimed < nr_to_reclaim) { > > > + unsigned long reclaimed; > > > + > > > + if (signal_pending(current)) > > > + break; > > > + > > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > + nr_to_reclaim - nr_reclaimed, > > > + GFP_KERNEL, true); > > > + > > > + if (!reclaimed && !nr_retries--) > > > + break; > > > + > > > + nr_reclaimed += reclaimed; > > > + } > > > + > > > + return nbytes; > > > > It is better to return an error code (e.g. -EBUSY) when > > memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory, > > except if the cgroup memory usage is already 0. We can also return > > -EINVAL if nr_to_reclaim is too large (e.g. > limit). > > For -EBUSY, are you thinking of a specific usecase where that would > come in handy? I'm not really opposed to it, but couldn't convince > myself of the practical benefits of it, either. > > Keep in mind that MAX_RECLAIM_RETRIES failed reclaim attempts usually > constitute an OOM situation: memory.max will issue kills and > memory.high will begin crippling throttling. In what scenario would > you want to keep reclaiming a workload that is considered OOM? > > Certainly, proactive reclaim that wants to purge only the cold tail of > the workload wouldn't retry. Meta's version of this patch actually > does return -EAGAIN on reclaim failure, but the userspace daemon > doesn't do anything with it, so I didn't bring it up. -EAGAIN sounds good, too. Given that the userspace requests to reclaim a specified number of bytes, I think it is generally better to tell the userspace whether the request has been successfully fulfilled. Ideally, it would be even better to return how many bytes that have been reclaimed, though that is not easy to do through the cgroup interface. The userspace can choose to ignore the return value or log a message/update some stats (which Google does) for the monitoring purpose. > For -EINVAL, I tend to lean more toward disagreeing. We've been trying > to avoid arbitrary dependencies between control knobs in cgroup2, just > because it exposes us to race conditions and adds complications to the > interface. For example, it *usually* doesn't make sense to set limits > to 0, or set local limits and protections higher than the parent. But > we allow it anyway, to avoid creating well-intended linting rules that > could interfere with somebody's unforeseen, legitimate usecase. OK, let's then not check against the limit.
On Fri, Apr 01, 2022 at 01:14:35PM -0700, Wei Xu wrote: > On Fri, Apr 1, 2022 at 8:22 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Mar 31, 2022 at 09:05:15PM -0700, Wei Xu wrote: > > > It is better to return an error code (e.g. -EBUSY) when > > > memory_reclaim() fails to reclaim nr_to_reclaim bytes of memory, > > > except if the cgroup memory usage is already 0. We can also return > > > -EINVAL if nr_to_reclaim is too large (e.g. > limit). > > > > For -EBUSY, are you thinking of a specific usecase where that would > > come in handy? I'm not really opposed to it, but couldn't convince > > myself of the practical benefits of it, either. > > > > Keep in mind that MAX_RECLAIM_RETRIES failed reclaim attempts usually > > constitute an OOM situation: memory.max will issue kills and > > memory.high will begin crippling throttling. In what scenario would > > you want to keep reclaiming a workload that is considered OOM? > > > > Certainly, proactive reclaim that wants to purge only the cold tail of > > the workload wouldn't retry. Meta's version of this patch actually > > does return -EAGAIN on reclaim failure, but the userspace daemon > > doesn't do anything with it, so I didn't bring it up. > > -EAGAIN sounds good, too. Given that the userspace requests to > reclaim a specified number of bytes, I think it is generally better to > tell the userspace whether the request has been successfully > fulfilled. Ideally, it would be even better to return how many bytes > that have been reclaimed, though that is not easy to do through the > cgroup interface. The userspace can choose to ignore the return value > or log a message/update some stats (which Google does) for the > monitoring purpose. Fair enough, thanks for your thoughts. No objection from me!
On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote: > The interface you're proposing is not really extensible, so we'll likely need to > introduce a new interface like memory.reclaim_ext very soon. Why not create > an extensible API from scratch? > > I'm looking at cgroup v2 documentation which describes various interface files > formats and it seems like given the number of potential optional arguments > the best option is nested keyed (please, refer to the Interface Files section). > > E.g. the format can be: > echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim Yeah, that syntax looks perfect. But why do you think it's not extensible from the current patch? We can add those arguments one by one as we agree on them, and return -EINVAL if somebody passes an unknown parameter. It seems to me the current proposal is forward-compatible that way (with the current set of keyword pararms being the empty set :-))
> On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote: >> The interface you're proposing is not really extensible, so we'll likely need to >> introduce a new interface like memory.reclaim_ext very soon. Why not create >> an extensible API from scratch? >> >> I'm looking at cgroup v2 documentation which describes various interface files >> formats and it seems like given the number of potential optional arguments >> the best option is nested keyed (please, refer to the Interface Files section). >> >> E.g. the format can be: >> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim > > Yeah, that syntax looks perfect. > > But why do you think it's not extensible from the current patch? We > can add those arguments one by one as we agree on them, and return > -EINVAL if somebody passes an unknown parameter. > > It seems to me the current proposal is forward-compatible that way > (with the current set of keyword pararms being the empty set :-)) It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :) So yeah, let’s just describe it properly in the documentation, no code changes are needed.
On Fri, Apr 1, 2022 at 2:21 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote: > >> The interface you're proposing is not really extensible, so we'll likely need to > >> introduce a new interface like memory.reclaim_ext very soon. Why not create > >> an extensible API from scratch? > >> > >> I'm looking at cgroup v2 documentation which describes various interface files > >> formats and it seems like given the number of potential optional arguments > >> the best option is nested keyed (please, refer to the Interface Files section). > >> > >> E.g. the format can be: > >> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim > > > > Yeah, that syntax looks perfect. > > I agree this is a better syntax than positional arguments. The latter would require a default value be specified for each earlier argument if we just want to provide a custom value for a later argument. > > But why do you think it's not extensible from the current patch? We > > can add those arguments one by one as we agree on them, and return > > -EINVAL if somebody passes an unknown parameter. > > > > It seems to me the current proposal is forward-compatible that way > > (with the current set of keyword pararms being the empty set :-)) > > It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :) > So yeah, let’s just describe it properly in the documentation, no code changes are needed.
On Fri, Apr 01, 2022 at 02:21:52PM -0700, Roman Gushchin wrote: > > On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote: > >> The interface you're proposing is not really extensible, so we'll likely need to > >> introduce a new interface like memory.reclaim_ext very soon. Why not create > >> an extensible API from scratch? > >> > >> I'm looking at cgroup v2 documentation which describes various interface files > >> formats and it seems like given the number of potential optional arguments > >> the best option is nested keyed (please, refer to the Interface Files section). > >> > >> E.g. the format can be: > >> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim > > > > Yeah, that syntax looks perfect. > > > > But why do you think it's not extensible from the current patch? We > > can add those arguments one by one as we agree on them, and return > > -EINVAL if somebody passes an unknown parameter. > > > > It seems to me the current proposal is forward-compatible that way > > (with the current set of keyword pararms being the empty set :-)) > > It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :) > > So yeah, let’s just describe it properly in the documentation, no code changes are needed. Sounds good to me!
Wei Xu <weixugc@google.com> writes: > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: >> > From: Shakeel Butt <shakeelb@google.com> >> > [snip] >> > Possible Extensions: >> > -------------------- >> > >> > - This interface can be extended with an additional parameter or flags >> > to allow specifying one or more types of memory to reclaim from (e.g. >> > file, anon, ..). >> > >> > - The interface can also be extended with a node mask to reclaim from >> > specific nodes. This has use cases for reclaim-based demotion in memory >> > tiering systens. >> > >> > - A similar per-node interface can also be added to support proactive >> > reclaim and reclaim-based demotion in systems without memcg. >> > >> > For now, let's keep things simple by adding the basic functionality. >> >> Yes, I am for the simplicity and this really looks like a bare minumum >> interface. But it is not really clear who do you want to add flags on >> top of it? >> >> I am not really sure we really need a node aware interface for memcg. >> The global reclaim interface will likely need a different node because >> we do not want to make this CONFIG_MEMCG constrained. > > A nodemask argument for memory.reclaim can be useful for memory > tiering between NUMA nodes with different performance. Similar to > proactive reclaim, it can allow a userspace daemon to drive > memcg-based proactive demotion via the reclaim-based demotion > mechanism in the kernel. I am not sure whether nodemask is a good way for demoting pages between different types of memory. For example, for a system with DRAM and PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what is the meaning of specifying PMEM node? reclaiming to disk? In general, I have no objection to the idea in general. But we should have a clear and consistent interface. Per my understanding the default memcg interface is for memory, regardless of memory types. The memory reclaiming means reduce the memory usage, regardless of memory types. We need to either extending the semantics of memory reclaiming (to include memory demoting too), or add another interface for memory demoting. Best Regards, Huang, Ying [snip]
On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: > > Wei Xu <weixugc@google.com> writes: > > > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > >> > >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > >> > From: Shakeel Butt <shakeelb@google.com> > >> > > > [snip] > > >> > Possible Extensions: > >> > -------------------- > >> > > >> > - This interface can be extended with an additional parameter or flags > >> > to allow specifying one or more types of memory to reclaim from (e.g. > >> > file, anon, ..). > >> > > >> > - The interface can also be extended with a node mask to reclaim from > >> > specific nodes. This has use cases for reclaim-based demotion in memory > >> > tiering systens. > >> > > >> > - A similar per-node interface can also be added to support proactive > >> > reclaim and reclaim-based demotion in systems without memcg. > >> > > >> > For now, let's keep things simple by adding the basic functionality. > >> > >> Yes, I am for the simplicity and this really looks like a bare minumum > >> interface. But it is not really clear who do you want to add flags on > >> top of it? > >> > >> I am not really sure we really need a node aware interface for memcg. > >> The global reclaim interface will likely need a different node because > >> we do not want to make this CONFIG_MEMCG constrained. > > > > A nodemask argument for memory.reclaim can be useful for memory > > tiering between NUMA nodes with different performance. Similar to > > proactive reclaim, it can allow a userspace daemon to drive > > memcg-based proactive demotion via the reclaim-based demotion > > mechanism in the kernel. > > I am not sure whether nodemask is a good way for demoting pages between > different types of memory. For example, for a system with DRAM and > PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what > is the meaning of specifying PMEM node? reclaiming to disk? > > In general, I have no objection to the idea in general. But we should > have a clear and consistent interface. Per my understanding the default > memcg interface is for memory, regardless of memory types. The memory > reclaiming means reduce the memory usage, regardless of memory types. > We need to either extending the semantics of memory reclaiming (to > include memory demoting too), or add another interface for memory > demoting.
On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: > > Wei Xu <weixugc@google.com> writes: > > > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > >> > >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > >> > From: Shakeel Butt <shakeelb@google.com> > >> > > > [snip] > > >> > Possible Extensions: > >> > -------------------- > >> > > >> > - This interface can be extended with an additional parameter or flags > >> > to allow specifying one or more types of memory to reclaim from (e.g. > >> > file, anon, ..). > >> > > >> > - The interface can also be extended with a node mask to reclaim from > >> > specific nodes. This has use cases for reclaim-based demotion in memory > >> > tiering systens. > >> > > >> > - A similar per-node interface can also be added to support proactive > >> > reclaim and reclaim-based demotion in systems without memcg. > >> > > >> > For now, let's keep things simple by adding the basic functionality. > >> > >> Yes, I am for the simplicity and this really looks like a bare minumum > >> interface. But it is not really clear who do you want to add flags on > >> top of it? > >> > >> I am not really sure we really need a node aware interface for memcg. > >> The global reclaim interface will likely need a different node because > >> we do not want to make this CONFIG_MEMCG constrained. > > > > A nodemask argument for memory.reclaim can be useful for memory > > tiering between NUMA nodes with different performance. Similar to > > proactive reclaim, it can allow a userspace daemon to drive > > memcg-based proactive demotion via the reclaim-based demotion > > mechanism in the kernel. > > I am not sure whether nodemask is a good way for demoting pages between > different types of memory. For example, for a system with DRAM and > PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what > is the meaning of specifying PMEM node? reclaiming to disk? > > In general, I have no objection to the idea in general. But we should > have a clear and consistent interface. Per my understanding the default > memcg interface is for memory, regardless of memory types. The memory > reclaiming means reduce the memory usage, regardless of memory types. > We need to either extending the semantics of memory reclaiming (to > include memory demoting too), or add another interface for memory > demoting. Good point. With the "demote pages during reclaim" patch series, reclaim is already extended to demote pages as well. For example, can_reclaim_anon_pages() returns true if demotion is allowed and shrink_page_list() can demote pages instead of reclaiming pages. Currently, demotion is disabled for memcg reclaim, which I think can be relaxed and also necessary for memcg-based proactive demotion. I'd like to suggest that we extend the semantics of memory.reclaim to cover memory demotion as well. A flag can be used to enable/disable the demotion behavior.
Apologies for the delayed response, Yosry Ahmed <yosryahmed@google.com> writes: > On Fri, Apr 1, 2022 at 1:39 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote: >> >> >> Yosry Ahmed <yosryahmed@google.com> writes: >> > From: Shakeel Butt <shakeelb@google.com> >> > >> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. >> <snip> >> >> > + >> > + while (nr_reclaimed < nr_to_reclaim) { >> > + unsigned long reclaimed; >> > + >> > + if (signal_pending(current)) >> > + break; >> > + >> > + reclaimed = try_to_free_mem_cgroup_pages(memcg, >> > + nr_to_reclaim - nr_reclaimed, >> > + GFP_KERNEL, true); >> > + >> > + if (!reclaimed && !nr_retries--) >> > + break; >> > + >> > + nr_reclaimed += reclaimed; >> >> I think there should be a cond_resched() in this loop before >> try_to_free_mem_cgroup_pages() to have better chances of reclaim >> succeding early. >> > Thanks for taking the time to look at this! > > I believe this loop is modeled after the loop in memory_high_write() > for the memory.high interface. Is there a reason why it should be > needed here but not there? > memory_high_write() calls drain_all_stock() atleast once before calling try_to_free_mem_cgroup_pages(). This would drain all percpu stocks for the given memcg and its descendents, giving a high chance try_to_free_mem_cgroup_pages() to succeed quickly. Such a functionality is missing from this patch. Adding a cond_resched() would atleast give chance to other processess within the memcg to run and make forward progress thereby making more pages available for reclaim. Suggestion is partly based on __perform_reclaim() issues a cond_resche() as it may get called repeatedly during direct reclaim path. >> <snip> >> >> -- >> Cheers >> ~ Vaibhav >
On Fri 01-04-22 09:58:59, Roman Gushchin wrote: > On Fri, Apr 01, 2022 at 03:49:19PM +0200, Michal Hocko wrote: > > On Thu 31-03-22 10:25:23, Roman Gushchin wrote: > > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > [...] > > > > - A similar per-node interface can also be added to support proactive > > > > reclaim and reclaim-based demotion in systems without memcg. > > > > > > Maybe an option to specify a timeout? That might simplify the userspace part. > > > > What do you mean by timeout here? Isn't > > timeout $N echo $RECLAIM > .... > > > > enough? > > It's nice and simple when it's a bash script, but when it's a complex > application trying to do the same, it quickly becomes less simple and > likely will require a dedicated thread to avoid blocking the main app > for too long and a mechanism to unblock it by timer/when the need arises. > > In my experience using correctly such semi-blocking interfaces (semi- because > it's not clearly defined how much time the syscall can take and whether it > makes sense to wait longer) is tricky. We have the same approach to setting other limits which need to perform the reclaim. Have we ever hit that as a limitation that would make userspace unnecessarily too complex?
On Fri, Apr 1, 2022 at 1:14 PM Wei Xu <weixugc@google.com> wrote: > [...] > > -EAGAIN sounds good, too. Given that the userspace requests to > reclaim a specified number of bytes, I think it is generally better to > tell the userspace whether the request has been successfully > fulfilled. Ideally, it would be even better to return how many bytes > that have been reclaimed, though that is not easy to do through the > cgroup interface. What would be the challenge on returning the number of bytes reclaimed through cgroup interface?
On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > > From: Shakeel Butt <shakeelb@google.com> > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use case: Proactive Reclaim > > --------------------------- > > > > A userspace proactive reclaimer can continuously probe the memcg to > > reclaim a small amount of memory. This gives more accurate and > > up-to-date workingset estimation as the LRUs are continuously > > sorted and can potentially provide more deterministic memory > > overcommit behavior. The memory overcommit controller can provide > > more proactive response to the changing behavior of the running > > applications instead of being reactive. > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > for kswapd or direct reclaim, it is to proactively identify memory savings > > opportunities and reclaim some amount of cold pages set by the policy > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > A user space proactive reclaimer is used in Google data centers. > > Additionally, Meta's TMO paper recently referenced a very similar > > interface used for user space proactive reclaim: > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > Benefits of a user space reclaimer: > > ----------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Why memory.high is not enough? > > ------------------------------ > > > > - memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim. > > However there is a big downside in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > how much memory to proactively reclaim from a workload. The metrics > > used for this are usually either refaults or PSI, and these metrics > > will become messy if the application gets throttled by hitting the > > high limit. > > > > - memory.high is a stateful interface, if the userspace proactive > > reclaimer crashes for any reason while triggering reclaim it can leave > > the application in a bad state. > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > reclaim memory can result in actually reclaiming more memory than > > intended. > > > > The benefits of such interface and shortcomings of existing interface > > were further discussed in this RFC thread: > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > Interface: > > ---------- > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > > > Possible Extensions: > > -------------------- > > > > - This interface can be extended with an additional parameter or flags > > to allow specifying one or more types of memory to reclaim from (e.g. > > file, anon, ..). > > > > - The interface can also be extended with a node mask to reclaim from > > specific nodes. This has use cases for reclaim-based demotion in memory > > tiering systens. > > > > - A similar per-node interface can also be added to support proactive > > reclaim and reclaim-based demotion in systems without memcg. > > > > For now, let's keep things simple by adding the basic functionality. > > Yes, I am for the simplicity and this really looks like a bare minumum > interface. But it is not really clear who do you want to add flags on > top of it? > Mostly I (or someone at Google) will follow-up with patches to add most of these features. We just wanted to get consensus on the bare minimum interface first, and to avoid derailing this discussion with whether or not we need each of those features and what the best way to implement them is. > I am not really sure we really need a node aware interface for memcg. > The global reclaim interface will likely need a different node because > we do not want to make this CONFIG_MEMCG constrained. > The main use case, as Wei mentioned, is memcg-based proactive demotion via the reclaim-based demotion mechanism in the kernel. We can still have a nodemask argument to the global reclaim interface as well. > > [yosryahmed@google.com: refreshed to current master, updated commit > > message based on recent discussions and use cases] > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > All that being said. I haven't been a great fan for explicit reclaim > triggered from the userspace but I do recognize that limitations of the > existing interfaces is just too restrictive. > > Acked-by: Michal Hocko <mhocko@suse.com> > > Thanks! > -- > Michal Hocko > SUSE Labs
On Fri, Apr 1, 2022 at 11:39 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Fri, Apr 01, 2022 at 02:11:51AM -0700, Yosry Ahmed wrote: > > On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin > > <roman.gushchin@linux.dev> wrote: > > > > > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > > > From: Shakeel Butt <shakeelb@google.com> > > > > > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > > > > > Use case: Proactive Reclaim > > > > --------------------------- > > > > > > > > A userspace proactive reclaimer can continuously probe the memcg to > > > > reclaim a small amount of memory. This gives more accurate and > > > > up-to-date workingset estimation as the LRUs are continuously > > > > sorted and can potentially provide more deterministic memory > > > > overcommit behavior. The memory overcommit controller can provide > > > > more proactive response to the changing behavior of the running > > > > applications instead of being reactive. > > > > > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > > > for kswapd or direct reclaim, it is to proactively identify memory savings > > > > opportunities and reclaim some amount of cold pages set by the policy > > > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > > > > > A user space proactive reclaimer is used in Google data centers. > > > > Additionally, Meta's TMO paper recently referenced a very similar > > > > interface used for user space proactive reclaim: > > > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > > > > > Benefits of a user space reclaimer: > > > > ----------------------------------- > > > > > > > > 1) More flexible on who should be charged for the cpu of the memory > > > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > > > overcommit controller can balance the cost between the cpu usage and > > > > the memory reclaimed. > > > > > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > > > under memory pressure better reclaim candidates are selected. This also > > > > gives more accurate and uptodate notion of working set for an > > > > application. > > > > > > > > Why memory.high is not enough? > > > > ------------------------------ > > > > > > > > - memory.high can be used to trigger reclaim in a memcg and can > > > > potentially be used for proactive reclaim. > > > > However there is a big downside in using memory.high. It can potentially > > > > introduce high reclaim stalls in the target application as the > > > > allocations from the processes or the threads of the application can hit > > > > the temporary memory.high limit. > > > > > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > > > how much memory to proactively reclaim from a workload. The metrics > > > > used for this are usually either refaults or PSI, and these metrics > > > > will become messy if the application gets throttled by hitting the > > > > high limit. > > > > > > > > - memory.high is a stateful interface, if the userspace proactive > > > > reclaimer crashes for any reason while triggering reclaim it can leave > > > > the application in a bad state. > > > > > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > > > reclaim memory can result in actually reclaiming more memory than > > > > intended. > > > > > > > > The benefits of such interface and shortcomings of existing interface > > > > were further discussed in this RFC thread: > > > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > > > Hello! > > > > > > I'm totally up for the proposed feature! It makes total sense and is proved > > > to be useful, let's add it. > > > > > > > > > > > Interface: > > > > ---------- > > > > > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > > > trigger reclaim in the target memory cgroup. > > > > > > > > > > > > Possible Extensions: > > > > -------------------- > > > > > > > > - This interface can be extended with an additional parameter or flags > > > > to allow specifying one or more types of memory to reclaim from (e.g. > > > > file, anon, ..). > > > > > > > > - The interface can also be extended with a node mask to reclaim from > > > > specific nodes. This has use cases for reclaim-based demotion in memory > > > > tiering systens. > > > > > > > > - A similar per-node interface can also be added to support proactive > > > > reclaim and reclaim-based demotion in systems without memcg. > > > > > > Maybe an option to specify a timeout? That might simplify the userspace part. > > > Also, please please add a test to selftests/cgroup/memcg tests. > > > It will also provide an example on how the userspace can use the feature. > > > > > > > Hi Roman, thanks for taking the time to review this! > > > > A timeout can be a good extension, I will add it to the commit message > > in the next version in possible extensions. > > > > I will add a test in v2, thanks! > > Great, thank you! > > > > > > > > > > > > > > For now, let's keep things simple by adding the basic functionality. > > > > > > What I'm worried about is how we gonna extend it? How do you see the interface > > > with 2-3 extensions from the list above? All these extensions look very > > > reasonable to me, so we'll likely have to implement them soon. So let's think > > > about the extensibility now. > > > > > > > My idea is to have these extensions as optional positional arguments > > (like Wei suggested), so that the interface does not get too > > complicated for users who don't care about tuning these options. If > > this is the case then I think there is nothing to worry about. > > Otherwise, if you think some of these options make sense to be a > > required argument instead, we can rethink the initial interface. > > The interface you're proposing is not really extensible, so we'll likely need to > introduce a new interface like memory.reclaim_ext very soon. Why not create > an extensible API from scratch? > > I'm looking at cgroup v2 documentation which describes various interface files > formats and it seems like given the number of potential optional arguments > the best option is nested keyed (please, refer to the Interface Files section). > > E.g. the format can be: > echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim > > We can say that now we don't support any keyed arguments, but they can be > added in the future. > > Basically you don't even need to change any code, only document the interface > properly, so we can extend it later without breaking the API. > Thanks a lot for suggesting this, it indeed looks very much cleaner. I will make sure to document the interface properly as suggested in v2. > > > > > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead? > > > In the end, such a feature might make sense on the system level too. > > > Yes, there is the drop_caches sysctl, but it's too radical for many cases. > > > > > > > I think in the RFC discussion there was consensus to add both a > > per-memcg knob, as well as per-node / per-system knobs (through sysfs > > or syscalls) later. Wei also points out that it's not common for a > > syscall to have a cgroup argument. > > Actually there are examples (e.g. sys_bpf), but my only point is to make > the API extensible, so maybe syscall is not the best idea. > > I'd add the root level interface from scratch: the code change is simple > and it makes sense as a feature. Then likely we don't really need another > system-level interface at all. > I think we would still need a system-level interface anyway for systems with no memcg that wish to make use of proactive memory reclaim. We can still make the memory.reclaim interface available for root as well if you think this is desirable. > Thanks!
On Fri, Apr 1, 2022 at 2:51 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Apr 01, 2022 at 02:21:52PM -0700, Roman Gushchin wrote: > > > On Apr 1, 2022, at 2:13 PM, Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Fri, Apr 01, 2022 at 11:39:30AM -0700, Roman Gushchin wrote: > > >> The interface you're proposing is not really extensible, so we'll likely need to > > >> introduce a new interface like memory.reclaim_ext very soon. Why not create > > >> an extensible API from scratch? > > >> > > >> I'm looking at cgroup v2 documentation which describes various interface files > > >> formats and it seems like given the number of potential optional arguments > > >> the best option is nested keyed (please, refer to the Interface Files section). > > >> > > >> E.g. the format can be: > > >> echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim > > > > > > Yeah, that syntax looks perfect. > > > > > > But why do you think it's not extensible from the current patch? We > > > can add those arguments one by one as we agree on them, and return > > > -EINVAL if somebody passes an unknown parameter. > > > > > > It seems to me the current proposal is forward-compatible that way > > > (with the current set of keyword pararms being the empty set :-)) > > > > It wasn’t obvious to me. We spoke about positional arguments and then it wasn’t clear how to add them in a backward-compatible way. The last thing we want is a bunch of memory.reclaim* interfaces :) > > > > So yeah, let’s just describe it properly in the documentation, no code changes are needed. > > Sounds good to me! To summarize for next version: 1) Add selftests. 2) Add documentation for potential future extension, so whoever adds those features in future should follow the key-value format Roman is suggesting. Yosry, once we have agreement on the return value, please send the next version resolving these three points.
On Sun, Apr 3, 2022 at 8:50 PM Vaibhav Jain <vaibhav@linux.ibm.com> wrote: > > > Apologies for the delayed response, > > Yosry Ahmed <yosryahmed@google.com> writes: > > > On Fri, Apr 1, 2022 at 1:39 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote: > >> > >> > >> Yosry Ahmed <yosryahmed@google.com> writes: > >> > From: Shakeel Butt <shakeelb@google.com> > >> > > >> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > >> <snip> > >> > >> > + > >> > + while (nr_reclaimed < nr_to_reclaim) { > >> > + unsigned long reclaimed; > >> > + > >> > + if (signal_pending(current)) > >> > + break; > >> > + > >> > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > >> > + nr_to_reclaim - nr_reclaimed, > >> > + GFP_KERNEL, true); > >> > + > >> > + if (!reclaimed && !nr_retries--) > >> > + break; > >> > + > >> > + nr_reclaimed += reclaimed; > >> > >> I think there should be a cond_resched() in this loop before > >> try_to_free_mem_cgroup_pages() to have better chances of reclaim > >> succeding early. > >> > > Thanks for taking the time to look at this! > > > > I believe this loop is modeled after the loop in memory_high_write() > > for the memory.high interface. Is there a reason why it should be > > needed here but not there? > > > > memory_high_write() calls drain_all_stock() atleast once before calling > try_to_free_mem_cgroup_pages(). This would drain all percpu stocks > for the given memcg and its descendents, giving a high chance > try_to_free_mem_cgroup_pages() to succeed quickly. Such a functionality > is missing from this patch. > > Adding a cond_resched() would atleast give chance to other processess > within the memcg to run and make forward progress thereby making more > pages available for reclaim. > > Suggestion is partly based on __perform_reclaim() issues a cond_resche() > as it may get called repeatedly during direct reclaim path. > As Michal pointed out, there is already a call to cond_resched() in shrink_node_memcgs(). > > >> <snip> > >> > >> -- > >> Cheers > >> ~ Vaibhav > > > > -- > Cheers > ~ Vaibhav
On Mon, Apr 04, 2022 at 10:13:03AM -0700, Yosry Ahmed wrote: > On Fri, Apr 1, 2022 at 11:39 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Fri, Apr 01, 2022 at 02:11:51AM -0700, Yosry Ahmed wrote: > > > On Thu, Mar 31, 2022 at 10:25 AM Roman Gushchin > > > <roman.gushchin@linux.dev> wrote: > > > > > > > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > > > > From: Shakeel Butt <shakeelb@google.com> > > > > > > > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > > > > > > > Use case: Proactive Reclaim > > > > > --------------------------- > > > > > > > > > > A userspace proactive reclaimer can continuously probe the memcg to > > > > > reclaim a small amount of memory. This gives more accurate and > > > > > up-to-date workingset estimation as the LRUs are continuously > > > > > sorted and can potentially provide more deterministic memory > > > > > overcommit behavior. The memory overcommit controller can provide > > > > > more proactive response to the changing behavior of the running > > > > > applications instead of being reactive. > > > > > > > > > > A userspace reclaimer's purpose in this case is not a complete replacement > > > > > for kswapd or direct reclaim, it is to proactively identify memory savings > > > > > opportunities and reclaim some amount of cold pages set by the policy > > > > > to free up the memory for more demanding jobs or scheduling new jobs. > > > > > > > > > > A user space proactive reclaimer is used in Google data centers. > > > > > Additionally, Meta's TMO paper recently referenced a very similar > > > > > interface used for user space proactive reclaim: > > > > > https://dl.acm.org/doi/pdf/10.1145/3503222.3507731 > > > > > > > > > > Benefits of a user space reclaimer: > > > > > ----------------------------------- > > > > > > > > > > 1) More flexible on who should be charged for the cpu of the memory > > > > > reclaim. For proactive reclaim, it makes more sense to be centralized. > > > > > > > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > > > > overcommit controller can balance the cost between the cpu usage and > > > > > the memory reclaimed. > > > > > > > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > > > > under memory pressure better reclaim candidates are selected. This also > > > > > gives more accurate and uptodate notion of working set for an > > > > > application. > > > > > > > > > > Why memory.high is not enough? > > > > > ------------------------------ > > > > > > > > > > - memory.high can be used to trigger reclaim in a memcg and can > > > > > potentially be used for proactive reclaim. > > > > > However there is a big downside in using memory.high. It can potentially > > > > > introduce high reclaim stalls in the target application as the > > > > > allocations from the processes or the threads of the application can hit > > > > > the temporary memory.high limit. > > > > > > > > > > - Userspace proactive reclaimers usually use feedback loops to decide > > > > > how much memory to proactively reclaim from a workload. The metrics > > > > > used for this are usually either refaults or PSI, and these metrics > > > > > will become messy if the application gets throttled by hitting the > > > > > high limit. > > > > > > > > > > - memory.high is a stateful interface, if the userspace proactive > > > > > reclaimer crashes for any reason while triggering reclaim it can leave > > > > > the application in a bad state. > > > > > > > > > > - If a workload is rapidly expanding, setting memory.high to proactively > > > > > reclaim memory can result in actually reclaiming more memory than > > > > > intended. > > > > > > > > > > The benefits of such interface and shortcomings of existing interface > > > > > were further discussed in this RFC thread: > > > > > https://lore.kernel.org/linux-mm/5df21376-7dd1-bf81-8414-32a73cea45dd@google.com/ > > > > > > > > Hello! > > > > > > > > I'm totally up for the proposed feature! It makes total sense and is proved > > > > to be useful, let's add it. > > > > > > > > > > > > > > Interface: > > > > > ---------- > > > > > > > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > > > > trigger reclaim in the target memory cgroup. > > > > > > > > > > > > > > > Possible Extensions: > > > > > -------------------- > > > > > > > > > > - This interface can be extended with an additional parameter or flags > > > > > to allow specifying one or more types of memory to reclaim from (e.g. > > > > > file, anon, ..). > > > > > > > > > > - The interface can also be extended with a node mask to reclaim from > > > > > specific nodes. This has use cases for reclaim-based demotion in memory > > > > > tiering systens. > > > > > > > > > > - A similar per-node interface can also be added to support proactive > > > > > reclaim and reclaim-based demotion in systems without memcg. > > > > > > > > Maybe an option to specify a timeout? That might simplify the userspace part. > > > > Also, please please add a test to selftests/cgroup/memcg tests. > > > > It will also provide an example on how the userspace can use the feature. > > > > > > > > > > Hi Roman, thanks for taking the time to review this! > > > > > > A timeout can be a good extension, I will add it to the commit message > > > in the next version in possible extensions. > > > > > > I will add a test in v2, thanks! > > > > Great, thank you! > > > > > > > > > > > > > > > > > > > For now, let's keep things simple by adding the basic functionality. > > > > > > > > What I'm worried about is how we gonna extend it? How do you see the interface > > > > with 2-3 extensions from the list above? All these extensions look very > > > > reasonable to me, so we'll likely have to implement them soon. So let's think > > > > about the extensibility now. > > > > > > > > > > My idea is to have these extensions as optional positional arguments > > > (like Wei suggested), so that the interface does not get too > > > complicated for users who don't care about tuning these options. If > > > this is the case then I think there is nothing to worry about. > > > Otherwise, if you think some of these options make sense to be a > > > required argument instead, we can rethink the initial interface. > > > > The interface you're proposing is not really extensible, so we'll likely need to > > introduce a new interface like memory.reclaim_ext very soon. Why not create > > an extensible API from scratch? > > > > I'm looking at cgroup v2 documentation which describes various interface files > > formats and it seems like given the number of potential optional arguments > > the best option is nested keyed (please, refer to the Interface Files section). > > > > E.g. the format can be: > > echo "1G type=file nodemask=1-2 timeout=30s" > memory.reclaim > > > > We can say that now we don't support any keyed arguments, but they can be > > added in the future. > > > > Basically you don't even need to change any code, only document the interface > > properly, so we can extend it later without breaking the API. > > > > Thanks a lot for suggesting this, it indeed looks very much cleaner. > > I will make sure to document the interface properly as suggested in v2. > > > > > > > > I wonder if it makes more sense to introduce a sys_reclaim() syscall instead? > > > > In the end, such a feature might make sense on the system level too. > > > > Yes, there is the drop_caches sysctl, but it's too radical for many cases. > > > > > > > > > > I think in the RFC discussion there was consensus to add both a > > > per-memcg knob, as well as per-node / per-system knobs (through sysfs > > > or syscalls) later. Wei also points out that it's not common for a > > > syscall to have a cgroup argument. > > > > Actually there are examples (e.g. sys_bpf), but my only point is to make > > the API extensible, so maybe syscall is not the best idea. > > > > I'd add the root level interface from scratch: the code change is simple > > and it makes sense as a feature. Then likely we don't really need another > > system-level interface at all. > > > > I think we would still need a system-level interface anyway for > systems with no memcg that wish to make use of proactive memory > reclaim. We can still make the memory.reclaim interface available for > root as well if you think this is desirable. Yes, I think it's a good idea. !memcg systems is a different story, we can handle them separately. Thanks!
On Mon, Apr 04, 2022 at 10:44:04AM +0200, Michal Hocko wrote: > On Fri 01-04-22 09:58:59, Roman Gushchin wrote: > > On Fri, Apr 01, 2022 at 03:49:19PM +0200, Michal Hocko wrote: > > > On Thu 31-03-22 10:25:23, Roman Gushchin wrote: > > > > On Thu, Mar 31, 2022 at 08:41:51AM +0000, Yosry Ahmed wrote: > > > [...] > > > > > - A similar per-node interface can also be added to support proactive > > > > > reclaim and reclaim-based demotion in systems without memcg. > > > > > > > > Maybe an option to specify a timeout? That might simplify the userspace part. > > > > > > What do you mean by timeout here? Isn't > > > timeout $N echo $RECLAIM > .... > > > > > > enough? > > > > It's nice and simple when it's a bash script, but when it's a complex > > application trying to do the same, it quickly becomes less simple and > > likely will require a dedicated thread to avoid blocking the main app > > for too long and a mechanism to unblock it by timer/when the need arises. > > > > In my experience using correctly such semi-blocking interfaces (semi- because > > it's not clearly defined how much time the syscall can take and whether it > > makes sense to wait longer) is tricky. > > We have the same approach to setting other limits which need to perform > the reclaim. Have we ever hit that as a limitation that would make > userspace unnecessarily too complex? The difference here is that some limits are most likely set once and never adjusted, e.g. memory.max or memory.low. I do definitely remember some issues around memory.high, but as I recall, we've fixed them on the kernel side. We've even had a private memory.high.tmp interface with a value and a timeout, which later was replaced with a memory.reclaim interface similar to what we discuss here. But with memory.high we set the limit first, so if a user tries to reclaim a lot of hot memory, it will soon put all processes in the cgroup into the sleep/direct reclaim. So it's not expected to block for too long. In general it all comes to the question how hard the kernel should try to reclaim the memory before giving up. The userspace might have different needs in different cases. But if the interface is defined very vaguely like it tries for an undefined amount of time and then gives up, it's hard to use it in a predictive manner. Thanks!
On Mon, Apr 4, 2022 at 10:08 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Fri, Apr 1, 2022 at 1:14 PM Wei Xu <weixugc@google.com> wrote: > > > [...] > > > > -EAGAIN sounds good, too. Given that the userspace requests to > > reclaim a specified number of bytes, I think it is generally better to > > tell the userspace whether the request has been successfully > > fulfilled. Ideally, it would be even better to return how many bytes > > that have been reclaimed, though that is not easy to do through the > > cgroup interface. > > What would be the challenge on returning the number of bytes reclaimed > through cgroup interface? write() syscall is used to write the command into memory.reclaim, which should return either the number of command bytes written or -1 (errno is set to indicate the actual error). I think we should not return the number of bytes reclaimed through write(). A new sys_reclaim() is better in this regard because we can define its return value, though it would need a cgroup argument, which is not commonly defined for syscalls.
On Mon, Apr 04, 2022 at 10:08:43AM -0700, Shakeel Butt <shakeelb@google.com> wrote: > What would be the challenge on returning the number of bytes reclaimed > through cgroup interface? You'd need an object that represents the write size: > bfd = open("/sys/kernel/mm/reclaim/balloon", O_RDWR); > dprintf(bfd, "type=file nodemask=1-2 timeout=30\n") > > fd = open("/sys/kernel/fs/cgroup/foo/memory.reclaim", O_WRONLY); > reclaimed = splice(bfd, NULL, fd, NULL, reclaim_size); (I'm joking with this API but it is a resembling concept.) Michal
Wei Xu <weixugc@google.com> writes: > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Wei Xu <weixugc@google.com> writes: >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: >> >> > From: Shakeel Butt <shakeelb@google.com> >> >> > >> >> [snip] >> >> >> > Possible Extensions: >> >> > -------------------- >> >> > >> >> > - This interface can be extended with an additional parameter or flags >> >> > to allow specifying one or more types of memory to reclaim from (e.g. >> >> > file, anon, ..). >> >> > >> >> > - The interface can also be extended with a node mask to reclaim from >> >> > specific nodes. This has use cases for reclaim-based demotion in memory >> >> > tiering systens. >> >> > >> >> > - A similar per-node interface can also be added to support proactive >> >> > reclaim and reclaim-based demotion in systems without memcg. >> >> > >> >> > For now, let's keep things simple by adding the basic functionality. >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum >> >> interface. But it is not really clear who do you want to add flags on >> >> top of it? >> >> >> >> I am not really sure we really need a node aware interface for memcg. >> >> The global reclaim interface will likely need a different node because >> >> we do not want to make this CONFIG_MEMCG constrained. >> > >> > A nodemask argument for memory.reclaim can be useful for memory >> > tiering between NUMA nodes with different performance. Similar to >> > proactive reclaim, it can allow a userspace daemon to drive >> > memcg-based proactive demotion via the reclaim-based demotion >> > mechanism in the kernel. >> >> I am not sure whether nodemask is a good way for demoting pages between >> different types of memory. For example, for a system with DRAM and >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what >> is the meaning of specifying PMEM node? reclaiming to disk? >> >> In general, I have no objection to the idea in general. But we should >> have a clear and consistent interface. Per my understanding the default >> memcg interface is for memory, regardless of memory types. The memory >> reclaiming means reduce the memory usage, regardless of memory types. >> We need to either extending the semantics of memory reclaiming (to >> include memory demoting too), or add another interface for memory >> demoting. > > Good point. With the "demote pages during reclaim" patch series, > reclaim is already extended to demote pages as well. For example, > can_reclaim_anon_pages() returns true if demotion is allowed and > shrink_page_list() can demote pages instead of reclaiming pages. These are in-kernel implementation, not the ABI. So we still have the opportunity to define the ABI now. > Currently, demotion is disabled for memcg reclaim, which I think can > be relaxed and also necessary for memcg-based proactive demotion. I'd > like to suggest that we extend the semantics of memory.reclaim to > cover memory demotion as well. A flag can be used to enable/disable > the demotion behavior. If so, # echo A > memory.reclaim means a) "A" bytes memory are freed from the memcg, regardless demoting is used or not. or b) "A" bytes memory are reclaimed from the memcg, some of them may be freed, some of them may be just demoted from DRAM to PMEM. The total number is "A". For me, a) looks more reasonable. Best Regards, Huang, Ying
On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote: > > Wei Xu <weixugc@google.com> writes: > > > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Wei Xu <weixugc@google.com> writes: > >> > >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > >> >> > >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > >> >> > From: Shakeel Butt <shakeelb@google.com> > >> >> > > >> > >> [snip] > >> > >> >> > Possible Extensions: > >> >> > -------------------- > >> >> > > >> >> > - This interface can be extended with an additional parameter or flags > >> >> > to allow specifying one or more types of memory to reclaim from (e.g. > >> >> > file, anon, ..). > >> >> > > >> >> > - The interface can also be extended with a node mask to reclaim from > >> >> > specific nodes. This has use cases for reclaim-based demotion in memory > >> >> > tiering systens. > >> >> > > >> >> > - A similar per-node interface can also be added to support proactive > >> >> > reclaim and reclaim-based demotion in systems without memcg. > >> >> > > >> >> > For now, let's keep things simple by adding the basic functionality. > >> >> > >> >> Yes, I am for the simplicity and this really looks like a bare minumum > >> >> interface. But it is not really clear who do you want to add flags on > >> >> top of it? > >> >> > >> >> I am not really sure we really need a node aware interface for memcg. > >> >> The global reclaim interface will likely need a different node because > >> >> we do not want to make this CONFIG_MEMCG constrained. > >> > > >> > A nodemask argument for memory.reclaim can be useful for memory > >> > tiering between NUMA nodes with different performance. Similar to > >> > proactive reclaim, it can allow a userspace daemon to drive > >> > memcg-based proactive demotion via the reclaim-based demotion > >> > mechanism in the kernel. > >> > >> I am not sure whether nodemask is a good way for demoting pages between > >> different types of memory. For example, for a system with DRAM and > >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what > >> is the meaning of specifying PMEM node? reclaiming to disk? > >> > >> In general, I have no objection to the idea in general. But we should > >> have a clear and consistent interface. Per my understanding the default > >> memcg interface is for memory, regardless of memory types. The memory > >> reclaiming means reduce the memory usage, regardless of memory types. > >> We need to either extending the semantics of memory reclaiming (to > >> include memory demoting too), or add another interface for memory > >> demoting. > > > > Good point. With the "demote pages during reclaim" patch series, > > reclaim is already extended to demote pages as well. For example, > > can_reclaim_anon_pages() returns true if demotion is allowed and > > shrink_page_list() can demote pages instead of reclaiming pages. > > These are in-kernel implementation, not the ABI. So we still have > the opportunity to define the ABI now. > > > Currently, demotion is disabled for memcg reclaim, which I think can > > be relaxed and also necessary for memcg-based proactive demotion. I'd > > like to suggest that we extend the semantics of memory.reclaim to > > cover memory demotion as well. A flag can be used to enable/disable > > the demotion behavior. > > If so, > > # echo A > memory.reclaim > > means > > a) "A" bytes memory are freed from the memcg, regardless demoting is > used or not. > > or > > b) "A" bytes memory are reclaimed from the memcg, some of them may be > freed, some of them may be just demoted from DRAM to PMEM. The total > number is "A". > > For me, a) looks more reasonable. > We can use a DEMOTE flag to control the demotion behavior for memory.reclaim. If the flag is not set (the default), then no_demotion of scan_control can be set to 1, similar to reclaim_pages(). The question is then whether we want to rename memory.reclaim to something more general. I think this name is fine if reclaim-based demotion is an accepted concept.
Wei Xu <weixugc@google.com> writes: > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Wei Xu <weixugc@google.com> writes: >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Wei Xu <weixugc@google.com> writes: >> >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: >> >> >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: >> >> >> > From: Shakeel Butt <shakeelb@google.com> >> >> >> > >> >> >> >> [snip] >> >> >> >> >> > Possible Extensions: >> >> >> > -------------------- >> >> >> > >> >> >> > - This interface can be extended with an additional parameter or flags >> >> >> > to allow specifying one or more types of memory to reclaim from (e.g. >> >> >> > file, anon, ..). >> >> >> > >> >> >> > - The interface can also be extended with a node mask to reclaim from >> >> >> > specific nodes. This has use cases for reclaim-based demotion in memory >> >> >> > tiering systens. >> >> >> > >> >> >> > - A similar per-node interface can also be added to support proactive >> >> >> > reclaim and reclaim-based demotion in systems without memcg. >> >> >> > >> >> >> > For now, let's keep things simple by adding the basic functionality. >> >> >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum >> >> >> interface. But it is not really clear who do you want to add flags on >> >> >> top of it? >> >> >> >> >> >> I am not really sure we really need a node aware interface for memcg. >> >> >> The global reclaim interface will likely need a different node because >> >> >> we do not want to make this CONFIG_MEMCG constrained. >> >> > >> >> > A nodemask argument for memory.reclaim can be useful for memory >> >> > tiering between NUMA nodes with different performance. Similar to >> >> > proactive reclaim, it can allow a userspace daemon to drive >> >> > memcg-based proactive demotion via the reclaim-based demotion >> >> > mechanism in the kernel. >> >> >> >> I am not sure whether nodemask is a good way for demoting pages between >> >> different types of memory. For example, for a system with DRAM and >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what >> >> is the meaning of specifying PMEM node? reclaiming to disk? >> >> >> >> In general, I have no objection to the idea in general. But we should >> >> have a clear and consistent interface. Per my understanding the default >> >> memcg interface is for memory, regardless of memory types. The memory >> >> reclaiming means reduce the memory usage, regardless of memory types. >> >> We need to either extending the semantics of memory reclaiming (to >> >> include memory demoting too), or add another interface for memory >> >> demoting. >> > >> > Good point. With the "demote pages during reclaim" patch series, >> > reclaim is already extended to demote pages as well. For example, >> > can_reclaim_anon_pages() returns true if demotion is allowed and >> > shrink_page_list() can demote pages instead of reclaiming pages. >> >> These are in-kernel implementation, not the ABI. So we still have >> the opportunity to define the ABI now. >> >> > Currently, demotion is disabled for memcg reclaim, which I think can >> > be relaxed and also necessary for memcg-based proactive demotion. I'd >> > like to suggest that we extend the semantics of memory.reclaim to >> > cover memory demotion as well. A flag can be used to enable/disable >> > the demotion behavior. >> >> If so, >> >> # echo A > memory.reclaim >> >> means >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is >> used or not. >> >> or >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be >> freed, some of them may be just demoted from DRAM to PMEM. The total >> number is "A". >> >> For me, a) looks more reasonable. >> > > We can use a DEMOTE flag to control the demotion behavior for > memory.reclaim. If the flag is not set (the default), then > no_demotion of scan_control can be set to 1, similar to > reclaim_pages(). If we have to use a flag to control the behavior, I think it's better to have a separate interface (e.g. memory.demote). But do we really need b)? > The question is then whether we want to rename memory.reclaim to > something more general. I think this name is fine if reclaim-based > demotion is an accepted concept. Best Regards, Huang, Ying
On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote: > > Wei Xu <weixugc@google.com> writes: > > > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Wei Xu <weixugc@google.com> writes: > >> > >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Wei Xu <weixugc@google.com> writes: > >> >> > >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > >> >> >> > >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > >> >> >> > From: Shakeel Butt <shakeelb@google.com> > >> >> >> > > >> >> > >> >> [snip] > >> >> > >> >> >> > Possible Extensions: > >> >> >> > -------------------- > >> >> >> > > >> >> >> > - This interface can be extended with an additional parameter or flags > >> >> >> > to allow specifying one or more types of memory to reclaim from (e.g. > >> >> >> > file, anon, ..). > >> >> >> > > >> >> >> > - The interface can also be extended with a node mask to reclaim from > >> >> >> > specific nodes. This has use cases for reclaim-based demotion in memory > >> >> >> > tiering systens. > >> >> >> > > >> >> >> > - A similar per-node interface can also be added to support proactive > >> >> >> > reclaim and reclaim-based demotion in systems without memcg. > >> >> >> > > >> >> >> > For now, let's keep things simple by adding the basic functionality. > >> >> >> > >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum > >> >> >> interface. But it is not really clear who do you want to add flags on > >> >> >> top of it? > >> >> >> > >> >> >> I am not really sure we really need a node aware interface for memcg. > >> >> >> The global reclaim interface will likely need a different node because > >> >> >> we do not want to make this CONFIG_MEMCG constrained. > >> >> > > >> >> > A nodemask argument for memory.reclaim can be useful for memory > >> >> > tiering between NUMA nodes with different performance. Similar to > >> >> > proactive reclaim, it can allow a userspace daemon to drive > >> >> > memcg-based proactive demotion via the reclaim-based demotion > >> >> > mechanism in the kernel. > >> >> > >> >> I am not sure whether nodemask is a good way for demoting pages between > >> >> different types of memory. For example, for a system with DRAM and > >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what > >> >> is the meaning of specifying PMEM node? reclaiming to disk? > >> >> > >> >> In general, I have no objection to the idea in general. But we should > >> >> have a clear and consistent interface. Per my understanding the default > >> >> memcg interface is for memory, regardless of memory types. The memory > >> >> reclaiming means reduce the memory usage, regardless of memory types. > >> >> We need to either extending the semantics of memory reclaiming (to > >> >> include memory demoting too), or add another interface for memory > >> >> demoting. > >> > > >> > Good point. With the "demote pages during reclaim" patch series, > >> > reclaim is already extended to demote pages as well. For example, > >> > can_reclaim_anon_pages() returns true if demotion is allowed and > >> > shrink_page_list() can demote pages instead of reclaiming pages. > >> > >> These are in-kernel implementation, not the ABI. So we still have > >> the opportunity to define the ABI now. > >> > >> > Currently, demotion is disabled for memcg reclaim, which I think can > >> > be relaxed and also necessary for memcg-based proactive demotion. I'd > >> > like to suggest that we extend the semantics of memory.reclaim to > >> > cover memory demotion as well. A flag can be used to enable/disable > >> > the demotion behavior. > >> > >> If so, > >> > >> # echo A > memory.reclaim > >> > >> means > >> > >> a) "A" bytes memory are freed from the memcg, regardless demoting is > >> used or not. > >> > >> or > >> > >> b) "A" bytes memory are reclaimed from the memcg, some of them may be > >> freed, some of them may be just demoted from DRAM to PMEM. The total > >> number is "A". > >> > >> For me, a) looks more reasonable. > >> > > > > We can use a DEMOTE flag to control the demotion behavior for > > memory.reclaim. If the flag is not set (the default), then > > no_demotion of scan_control can be set to 1, similar to > > reclaim_pages(). > > If we have to use a flag to control the behavior, I think it's better to > have a separate interface (e.g. memory.demote). But do we really need b)? > I am fine with either approach: a separate interface similar to memory.reclaim, but dedicated to demotion, or multiplexing memory.reclaim for demotion with a flag. My understanding is that with the "demote pages during reclaim" support, b) is the expected behavior, or more precisely, pages that cannot be demoted may be freed or swapped out. This is reasonable. Demotion-only can also be supported via some arguments to the interface and changes to demotion code in the kernel. After all, this interface is being designed to be extensible based on the discussions so far. > > The question is then whether we want to rename memory.reclaim to > > something more general. I think this name is fine if reclaim-based > > demotion is an accepted concept. > > Best Regards, > Huang, Ying
Wei Xu <weixugc@google.com> writes: > On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Wei Xu <weixugc@google.com> writes: >> >> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Wei Xu <weixugc@google.com> writes: >> >> >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Wei Xu <weixugc@google.com> writes: >> >> >> >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: >> >> >> >> >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: >> >> >> >> > From: Shakeel Butt <shakeelb@google.com> >> >> >> >> > >> >> >> >> >> >> [snip] >> >> >> >> >> >> >> > Possible Extensions: >> >> >> >> > -------------------- >> >> >> >> > >> >> >> >> > - This interface can be extended with an additional parameter or flags >> >> >> >> > to allow specifying one or more types of memory to reclaim from (e.g. >> >> >> >> > file, anon, ..). >> >> >> >> > >> >> >> >> > - The interface can also be extended with a node mask to reclaim from >> >> >> >> > specific nodes. This has use cases for reclaim-based demotion in memory >> >> >> >> > tiering systens. >> >> >> >> > >> >> >> >> > - A similar per-node interface can also be added to support proactive >> >> >> >> > reclaim and reclaim-based demotion in systems without memcg. >> >> >> >> > >> >> >> >> > For now, let's keep things simple by adding the basic functionality. >> >> >> >> >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum >> >> >> >> interface. But it is not really clear who do you want to add flags on >> >> >> >> top of it? >> >> >> >> >> >> >> >> I am not really sure we really need a node aware interface for memcg. >> >> >> >> The global reclaim interface will likely need a different node because >> >> >> >> we do not want to make this CONFIG_MEMCG constrained. >> >> >> > >> >> >> > A nodemask argument for memory.reclaim can be useful for memory >> >> >> > tiering between NUMA nodes with different performance. Similar to >> >> >> > proactive reclaim, it can allow a userspace daemon to drive >> >> >> > memcg-based proactive demotion via the reclaim-based demotion >> >> >> > mechanism in the kernel. >> >> >> >> >> >> I am not sure whether nodemask is a good way for demoting pages between >> >> >> different types of memory. For example, for a system with DRAM and >> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what >> >> >> is the meaning of specifying PMEM node? reclaiming to disk? >> >> >> >> >> >> In general, I have no objection to the idea in general. But we should >> >> >> have a clear and consistent interface. Per my understanding the default >> >> >> memcg interface is for memory, regardless of memory types. The memory >> >> >> reclaiming means reduce the memory usage, regardless of memory types. >> >> >> We need to either extending the semantics of memory reclaiming (to >> >> >> include memory demoting too), or add another interface for memory >> >> >> demoting. >> >> > >> >> > Good point. With the "demote pages during reclaim" patch series, >> >> > reclaim is already extended to demote pages as well. For example, >> >> > can_reclaim_anon_pages() returns true if demotion is allowed and >> >> > shrink_page_list() can demote pages instead of reclaiming pages. >> >> >> >> These are in-kernel implementation, not the ABI. So we still have >> >> the opportunity to define the ABI now. >> >> >> >> > Currently, demotion is disabled for memcg reclaim, which I think can >> >> > be relaxed and also necessary for memcg-based proactive demotion. I'd >> >> > like to suggest that we extend the semantics of memory.reclaim to >> >> > cover memory demotion as well. A flag can be used to enable/disable >> >> > the demotion behavior. >> >> >> >> If so, >> >> >> >> # echo A > memory.reclaim >> >> >> >> means >> >> >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is >> >> used or not. >> >> >> >> or >> >> >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be >> >> freed, some of them may be just demoted from DRAM to PMEM. The total >> >> number is "A". >> >> >> >> For me, a) looks more reasonable. >> >> >> > >> > We can use a DEMOTE flag to control the demotion behavior for >> > memory.reclaim. If the flag is not set (the default), then >> > no_demotion of scan_control can be set to 1, similar to >> > reclaim_pages(). >> >> If we have to use a flag to control the behavior, I think it's better to >> have a separate interface (e.g. memory.demote). But do we really need b)? >> > > I am fine with either approach: a separate interface similar to > memory.reclaim, but dedicated to demotion, or multiplexing > memory.reclaim for demotion with a flag. > > My understanding is that with the "demote pages during reclaim" > support, b) is the expected behavior, or more precisely, pages that > cannot be demoted may be freed or swapped out. This is reasonable. > Demotion-only can also be supported via some arguments to the > interface and changes to demotion code in the kernel. After all, this > interface is being designed to be extensible based on the discussions > so far. I think we should define the interface not from the current implementation point of view, but from the requirement point of view. For proactive reclaim, per my understanding, the requirement is, we found that there's some cold pages in some workloads, so we can take advantage of the proactive reclaim to reclaim some pages so that other workload can use the freed memory. For proactive demotion, per my understanding, the requirement could be, We found that there's some cold pages in fast memory (e.g. DRAM) in some workloads, so we can take advantage of the proactive demotion to demote some pages so that other workload can use the freed fast memory. Given the DRAM partition support Tim (Cced) is working on. Why do we need something in the middle? Best Regards, Huang, Ying >> > The question is then whether we want to rename memory.reclaim to >> > something more general. I think this name is fine if reclaim-based >> > demotion is an accepted concept. >> >> Best Regards, >> Huang, Ying
On Tue, Apr 5, 2022 at 11:32 PM Huang, Ying <ying.huang@intel.com> wrote: > > Wei Xu <weixugc@google.com> writes: > > > On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Wei Xu <weixugc@google.com> writes: > >> > >> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Wei Xu <weixugc@google.com> writes: > >> >> > >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Wei Xu <weixugc@google.com> writes: > >> >> >> > >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > >> >> >> >> > >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > >> >> >> >> > From: Shakeel Butt <shakeelb@google.com> > >> >> >> >> > > >> >> >> > >> >> >> [snip] > >> >> >> > >> >> >> >> > Possible Extensions: > >> >> >> >> > -------------------- > >> >> >> >> > > >> >> >> >> > - This interface can be extended with an additional parameter or flags > >> >> >> >> > to allow specifying one or more types of memory to reclaim from (e.g. > >> >> >> >> > file, anon, ..). > >> >> >> >> > > >> >> >> >> > - The interface can also be extended with a node mask to reclaim from > >> >> >> >> > specific nodes. This has use cases for reclaim-based demotion in memory > >> >> >> >> > tiering systens. > >> >> >> >> > > >> >> >> >> > - A similar per-node interface can also be added to support proactive > >> >> >> >> > reclaim and reclaim-based demotion in systems without memcg. > >> >> >> >> > > >> >> >> >> > For now, let's keep things simple by adding the basic functionality. > >> >> >> >> > >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum > >> >> >> >> interface. But it is not really clear who do you want to add flags on > >> >> >> >> top of it? > >> >> >> >> > >> >> >> >> I am not really sure we really need a node aware interface for memcg. > >> >> >> >> The global reclaim interface will likely need a different node because > >> >> >> >> we do not want to make this CONFIG_MEMCG constrained. > >> >> >> > > >> >> >> > A nodemask argument for memory.reclaim can be useful for memory > >> >> >> > tiering between NUMA nodes with different performance. Similar to > >> >> >> > proactive reclaim, it can allow a userspace daemon to drive > >> >> >> > memcg-based proactive demotion via the reclaim-based demotion > >> >> >> > mechanism in the kernel. > >> >> >> > >> >> >> I am not sure whether nodemask is a good way for demoting pages between > >> >> >> different types of memory. For example, for a system with DRAM and > >> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what > >> >> >> is the meaning of specifying PMEM node? reclaiming to disk? > >> >> >> > >> >> >> In general, I have no objection to the idea in general. But we should > >> >> >> have a clear and consistent interface. Per my understanding the default > >> >> >> memcg interface is for memory, regardless of memory types. The memory > >> >> >> reclaiming means reduce the memory usage, regardless of memory types. > >> >> >> We need to either extending the semantics of memory reclaiming (to > >> >> >> include memory demoting too), or add another interface for memory > >> >> >> demoting. > >> >> > > >> >> > Good point. With the "demote pages during reclaim" patch series, > >> >> > reclaim is already extended to demote pages as well. For example, > >> >> > can_reclaim_anon_pages() returns true if demotion is allowed and > >> >> > shrink_page_list() can demote pages instead of reclaiming pages. > >> >> > >> >> These are in-kernel implementation, not the ABI. So we still have > >> >> the opportunity to define the ABI now. > >> >> > >> >> > Currently, demotion is disabled for memcg reclaim, which I think can > >> >> > be relaxed and also necessary for memcg-based proactive demotion. I'd > >> >> > like to suggest that we extend the semantics of memory.reclaim to > >> >> > cover memory demotion as well. A flag can be used to enable/disable > >> >> > the demotion behavior. > >> >> > >> >> If so, > >> >> > >> >> # echo A > memory.reclaim > >> >> > >> >> means > >> >> > >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is > >> >> used or not. > >> >> > >> >> or > >> >> > >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be > >> >> freed, some of them may be just demoted from DRAM to PMEM. The total > >> >> number is "A". > >> >> > >> >> For me, a) looks more reasonable. > >> >> > >> > > >> > We can use a DEMOTE flag to control the demotion behavior for > >> > memory.reclaim. If the flag is not set (the default), then > >> > no_demotion of scan_control can be set to 1, similar to > >> > reclaim_pages(). > >> > >> If we have to use a flag to control the behavior, I think it's better to > >> have a separate interface (e.g. memory.demote). But do we really need b)? > >> > > > > I am fine with either approach: a separate interface similar to > > memory.reclaim, but dedicated to demotion, or multiplexing > > memory.reclaim for demotion with a flag. > > > > My understanding is that with the "demote pages during reclaim" > > support, b) is the expected behavior, or more precisely, pages that > > cannot be demoted may be freed or swapped out. This is reasonable. > > Demotion-only can also be supported via some arguments to the > > interface and changes to demotion code in the kernel. After all, this > > interface is being designed to be extensible based on the discussions > > so far. > > I think we should define the interface not from the current > implementation point of view, but from the requirement point of view. > For proactive reclaim, per my understanding, the requirement is, > > we found that there's some cold pages in some workloads, so we can > take advantage of the proactive reclaim to reclaim some pages so that > other workload can use the freed memory. > > For proactive demotion, per my understanding, the requirement could be, > > We found that there's some cold pages in fast memory (e.g. DRAM) in > some workloads, so we can take advantage of the proactive demotion to > demote some pages so that other workload can use the freed fast > memory. Given the DRAM partition support Tim (Cced) is working on. > > Why do we need something in the middle? Maybe there is some misunderstanding. As you said, demotion is to free up fast memory. If pages on fast memory cannot be demoted, but can still be reclaimed to free some fast memory, it is useful, too. Certainly, we can also add the support and configure the policy to only demote, not reclaim, from fast memory in such cases. In any case, we will not reclaim from slow memory for demotion, if that is the middle thing you refer to. This is why nodemask is proposed for memory.reclaim to support the demotion use case. With a separate memory.demote interface and memory tiering topology among NUMA nodes being well defined by the kernel and shared with the userspace, we can omit the nodemask argument. > Best Regards, > Huang, Ying > > >> > The question is then whether we want to rename memory.reclaim to > >> > something more general. I think this name is fine if reclaim-based > >> > demotion is an accepted concept. > >> > >> Best Regards, > >> Huang, Ying
Wei Xu <weixugc@google.com> writes: > On Tue, Apr 5, 2022 at 11:32 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Wei Xu <weixugc@google.com> writes: >> >> > On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Wei Xu <weixugc@google.com> writes: >> >> >> >> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Wei Xu <weixugc@google.com> writes: >> >> >> >> >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Wei Xu <weixugc@google.com> writes: >> >> >> >> >> >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: >> >> >> >> >> >> >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: >> >> >> >> >> > From: Shakeel Butt <shakeelb@google.com> >> >> >> >> >> > >> >> >> >> >> >> >> >> [snip] >> >> >> >> >> >> >> >> >> > Possible Extensions: >> >> >> >> >> > -------------------- >> >> >> >> >> > >> >> >> >> >> > - This interface can be extended with an additional parameter or flags >> >> >> >> >> > to allow specifying one or more types of memory to reclaim from (e.g. >> >> >> >> >> > file, anon, ..). >> >> >> >> >> > >> >> >> >> >> > - The interface can also be extended with a node mask to reclaim from >> >> >> >> >> > specific nodes. This has use cases for reclaim-based demotion in memory >> >> >> >> >> > tiering systens. >> >> >> >> >> > >> >> >> >> >> > - A similar per-node interface can also be added to support proactive >> >> >> >> >> > reclaim and reclaim-based demotion in systems without memcg. >> >> >> >> >> > >> >> >> >> >> > For now, let's keep things simple by adding the basic functionality. >> >> >> >> >> >> >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum >> >> >> >> >> interface. But it is not really clear who do you want to add flags on >> >> >> >> >> top of it? >> >> >> >> >> >> >> >> >> >> I am not really sure we really need a node aware interface for memcg. >> >> >> >> >> The global reclaim interface will likely need a different node because >> >> >> >> >> we do not want to make this CONFIG_MEMCG constrained. >> >> >> >> > >> >> >> >> > A nodemask argument for memory.reclaim can be useful for memory >> >> >> >> > tiering between NUMA nodes with different performance. Similar to >> >> >> >> > proactive reclaim, it can allow a userspace daemon to drive >> >> >> >> > memcg-based proactive demotion via the reclaim-based demotion >> >> >> >> > mechanism in the kernel. >> >> >> >> >> >> >> >> I am not sure whether nodemask is a good way for demoting pages between >> >> >> >> different types of memory. For example, for a system with DRAM and >> >> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what >> >> >> >> is the meaning of specifying PMEM node? reclaiming to disk? >> >> >> >> >> >> >> >> In general, I have no objection to the idea in general. But we should >> >> >> >> have a clear and consistent interface. Per my understanding the default >> >> >> >> memcg interface is for memory, regardless of memory types. The memory >> >> >> >> reclaiming means reduce the memory usage, regardless of memory types. >> >> >> >> We need to either extending the semantics of memory reclaiming (to >> >> >> >> include memory demoting too), or add another interface for memory >> >> >> >> demoting. >> >> >> > >> >> >> > Good point. With the "demote pages during reclaim" patch series, >> >> >> > reclaim is already extended to demote pages as well. For example, >> >> >> > can_reclaim_anon_pages() returns true if demotion is allowed and >> >> >> > shrink_page_list() can demote pages instead of reclaiming pages. >> >> >> >> >> >> These are in-kernel implementation, not the ABI. So we still have >> >> >> the opportunity to define the ABI now. >> >> >> >> >> >> > Currently, demotion is disabled for memcg reclaim, which I think can >> >> >> > be relaxed and also necessary for memcg-based proactive demotion. I'd >> >> >> > like to suggest that we extend the semantics of memory.reclaim to >> >> >> > cover memory demotion as well. A flag can be used to enable/disable >> >> >> > the demotion behavior. >> >> >> >> >> >> If so, >> >> >> >> >> >> # echo A > memory.reclaim >> >> >> >> >> >> means >> >> >> >> >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is >> >> >> used or not. >> >> >> >> >> >> or >> >> >> >> >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be >> >> >> freed, some of them may be just demoted from DRAM to PMEM. The total >> >> >> number is "A". >> >> >> >> >> >> For me, a) looks more reasonable. >> >> >> >> >> > >> >> > We can use a DEMOTE flag to control the demotion behavior for >> >> > memory.reclaim. If the flag is not set (the default), then >> >> > no_demotion of scan_control can be set to 1, similar to >> >> > reclaim_pages(). >> >> >> >> If we have to use a flag to control the behavior, I think it's better to >> >> have a separate interface (e.g. memory.demote). But do we really need b)? >> >> >> > >> > I am fine with either approach: a separate interface similar to >> > memory.reclaim, but dedicated to demotion, or multiplexing >> > memory.reclaim for demotion with a flag. >> > >> > My understanding is that with the "demote pages during reclaim" >> > support, b) is the expected behavior, or more precisely, pages that >> > cannot be demoted may be freed or swapped out. This is reasonable. >> > Demotion-only can also be supported via some arguments to the >> > interface and changes to demotion code in the kernel. After all, this >> > interface is being designed to be extensible based on the discussions >> > so far. >> >> I think we should define the interface not from the current >> implementation point of view, but from the requirement point of view. >> For proactive reclaim, per my understanding, the requirement is, >> >> we found that there's some cold pages in some workloads, so we can >> take advantage of the proactive reclaim to reclaim some pages so that >> other workload can use the freed memory. >> >> For proactive demotion, per my understanding, the requirement could be, >> >> We found that there's some cold pages in fast memory (e.g. DRAM) in >> some workloads, so we can take advantage of the proactive demotion to >> demote some pages so that other workload can use the freed fast >> memory. Given the DRAM partition support Tim (Cced) is working on. >> >> Why do we need something in the middle? > > Maybe there is some misunderstanding. As you said, demotion is to > free up fast memory. If pages on fast memory cannot be demoted, but > can still be reclaimed to free some fast memory, it is useful, too. > Certainly, we can also add the support and configure the policy to > only demote, not reclaim, from fast memory in such cases. Yes. I think it may be better to demote from fast memory nodes only in such cases. We just free some fast memory proactively. But we may reclaim from the slow memory node (which are demotion targets) if necessary. > In any case, we will not reclaim from slow memory for demotion, If there's no free pages in the slow memory to accommodate the demoted pages, why not just reclaim some pages in the slow memory? What are the disadvantages to do that? > if that is the middle thing you refer to. No. I mean, If we reclaim "A" pages proactively, we will free "A" pages, maybe from slow memory firstly. The target is the total memory size of a memcg. If we demote "A" pages proactively, we will demote "A" pages from fast memory to slow memory. The target is the fast memory size of a memcg. In the process, some slow memory may be reclaimed to accommodate the demoted pages. For me, the middle thing is, If we demote some pages from fast memory to slow memory proactively and free some pages from slow memory at the same time, the total number (demote + free) is "A". There's no clear target. I think this is confusing. Per my understanding, you don't suggest this too. > This is why nodemask is > proposed for memory.reclaim to support the demotion use case. With a > separate memory.demote interface and memory tiering topology among > NUMA nodes being well defined by the kernel and shared with the > userspace, we can omit the nodemask argument. Yes. Both seems work. Best Regards, Huang, Ying >> >> >> > The question is then whether we want to rename memory.reclaim to >> >> > something more general. I think this name is fine if reclaim-based >> >> > demotion is an accepted concept. >> >> >> >> Best Regards, >> >> Huang, Ying
On Wed, Apr 6, 2022 at 1:49 AM Huang, Ying <ying.huang@intel.com> wrote: > > Wei Xu <weixugc@google.com> writes: > > > On Tue, Apr 5, 2022 at 11:32 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Wei Xu <weixugc@google.com> writes: > >> > >> > On Tue, Apr 5, 2022 at 7:50 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Wei Xu <weixugc@google.com> writes: > >> >> > >> >> > On Tue, Apr 5, 2022 at 5:49 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Wei Xu <weixugc@google.com> writes: > >> >> >> > >> >> >> > On Sat, Apr 2, 2022 at 1:13 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> >> > >> >> >> >> Wei Xu <weixugc@google.com> writes: > >> >> >> >> > >> >> >> >> > On Fri, Apr 1, 2022 at 6:54 AM Michal Hocko <mhocko@suse.com> wrote: > >> >> >> >> >> > >> >> >> >> >> On Thu 31-03-22 08:41:51, Yosry Ahmed wrote: > >> >> >> >> >> > From: Shakeel Butt <shakeelb@google.com> > >> >> >> >> >> > > >> >> >> >> > >> >> >> >> [snip] > >> >> >> >> > >> >> >> >> >> > Possible Extensions: > >> >> >> >> >> > -------------------- > >> >> >> >> >> > > >> >> >> >> >> > - This interface can be extended with an additional parameter or flags > >> >> >> >> >> > to allow specifying one or more types of memory to reclaim from (e.g. > >> >> >> >> >> > file, anon, ..). > >> >> >> >> >> > > >> >> >> >> >> > - The interface can also be extended with a node mask to reclaim from > >> >> >> >> >> > specific nodes. This has use cases for reclaim-based demotion in memory > >> >> >> >> >> > tiering systens. > >> >> >> >> >> > > >> >> >> >> >> > - A similar per-node interface can also be added to support proactive > >> >> >> >> >> > reclaim and reclaim-based demotion in systems without memcg. > >> >> >> >> >> > > >> >> >> >> >> > For now, let's keep things simple by adding the basic functionality. > >> >> >> >> >> > >> >> >> >> >> Yes, I am for the simplicity and this really looks like a bare minumum > >> >> >> >> >> interface. But it is not really clear who do you want to add flags on > >> >> >> >> >> top of it? > >> >> >> >> >> > >> >> >> >> >> I am not really sure we really need a node aware interface for memcg. > >> >> >> >> >> The global reclaim interface will likely need a different node because > >> >> >> >> >> we do not want to make this CONFIG_MEMCG constrained. > >> >> >> >> > > >> >> >> >> > A nodemask argument for memory.reclaim can be useful for memory > >> >> >> >> > tiering between NUMA nodes with different performance. Similar to > >> >> >> >> > proactive reclaim, it can allow a userspace daemon to drive > >> >> >> >> > memcg-based proactive demotion via the reclaim-based demotion > >> >> >> >> > mechanism in the kernel. > >> >> >> >> > >> >> >> >> I am not sure whether nodemask is a good way for demoting pages between > >> >> >> >> different types of memory. For example, for a system with DRAM and > >> >> >> >> PMEM, if specifying DRAM node in nodemask means demoting to PMEM, what > >> >> >> >> is the meaning of specifying PMEM node? reclaiming to disk? > >> >> >> >> > >> >> >> >> In general, I have no objection to the idea in general. But we should > >> >> >> >> have a clear and consistent interface. Per my understanding the default > >> >> >> >> memcg interface is for memory, regardless of memory types. The memory > >> >> >> >> reclaiming means reduce the memory usage, regardless of memory types. > >> >> >> >> We need to either extending the semantics of memory reclaiming (to > >> >> >> >> include memory demoting too), or add another interface for memory > >> >> >> >> demoting. > >> >> >> > > >> >> >> > Good point. With the "demote pages during reclaim" patch series, > >> >> >> > reclaim is already extended to demote pages as well. For example, > >> >> >> > can_reclaim_anon_pages() returns true if demotion is allowed and > >> >> >> > shrink_page_list() can demote pages instead of reclaiming pages. > >> >> >> > >> >> >> These are in-kernel implementation, not the ABI. So we still have > >> >> >> the opportunity to define the ABI now. > >> >> >> > >> >> >> > Currently, demotion is disabled for memcg reclaim, which I think can > >> >> >> > be relaxed and also necessary for memcg-based proactive demotion. I'd > >> >> >> > like to suggest that we extend the semantics of memory.reclaim to > >> >> >> > cover memory demotion as well. A flag can be used to enable/disable > >> >> >> > the demotion behavior. > >> >> >> > >> >> >> If so, > >> >> >> > >> >> >> # echo A > memory.reclaim > >> >> >> > >> >> >> means > >> >> >> > >> >> >> a) "A" bytes memory are freed from the memcg, regardless demoting is > >> >> >> used or not. > >> >> >> > >> >> >> or > >> >> >> > >> >> >> b) "A" bytes memory are reclaimed from the memcg, some of them may be > >> >> >> freed, some of them may be just demoted from DRAM to PMEM. The total > >> >> >> number is "A". > >> >> >> > >> >> >> For me, a) looks more reasonable. > >> >> >> > >> >> > > >> >> > We can use a DEMOTE flag to control the demotion behavior for > >> >> > memory.reclaim. If the flag is not set (the default), then > >> >> > no_demotion of scan_control can be set to 1, similar to > >> >> > reclaim_pages(). > >> >> > >> >> If we have to use a flag to control the behavior, I think it's better to > >> >> have a separate interface (e.g. memory.demote). But do we really need b)? > >> >> > >> > > >> > I am fine with either approach: a separate interface similar to > >> > memory.reclaim, but dedicated to demotion, or multiplexing > >> > memory.reclaim for demotion with a flag. > >> > > >> > My understanding is that with the "demote pages during reclaim" > >> > support, b) is the expected behavior, or more precisely, pages that > >> > cannot be demoted may be freed or swapped out. This is reasonable. > >> > Demotion-only can also be supported via some arguments to the > >> > interface and changes to demotion code in the kernel. After all, this > >> > interface is being designed to be extensible based on the discussions > >> > so far. > >> > >> I think we should define the interface not from the current > >> implementation point of view, but from the requirement point of view. > >> For proactive reclaim, per my understanding, the requirement is, > >> > >> we found that there's some cold pages in some workloads, so we can > >> take advantage of the proactive reclaim to reclaim some pages so that > >> other workload can use the freed memory. > >> > >> For proactive demotion, per my understanding, the requirement could be, > >> > >> We found that there's some cold pages in fast memory (e.g. DRAM) in > >> some workloads, so we can take advantage of the proactive demotion to > >> demote some pages so that other workload can use the freed fast > >> memory. Given the DRAM partition support Tim (Cced) is working on. > >> > >> Why do we need something in the middle? > > > > Maybe there is some misunderstanding. As you said, demotion is to > > free up fast memory. If pages on fast memory cannot be demoted, but > > can still be reclaimed to free some fast memory, it is useful, too. > > Certainly, we can also add the support and configure the policy to > > only demote, not reclaim, from fast memory in such cases. > > Yes. I think it may be better to demote from fast memory nodes only in > such cases. We just free some fast memory proactively. But we may > reclaim from the slow memory node (which are demotion targets) if > necessary. It can be a policy choice whether to reclaim from slow memory nodes for demotion, or reclaim the pages directly from fast memory nodes, or do nothing, if there isn't enough free space on slow memory nodes for a proactive demotion request. For example, if the file pages on fast memory are clean and cold enough, they can be discarded, which should be cheaper than reclaiming from slow memory nodes and then demoting these pages. A policy for such behaviors can be specified as an argument to the proactive demotion interface when it is desired. > > In any case, we will not reclaim from slow memory for demotion, > > If there's no free pages in the slow memory to accommodate the demoted > pages, why not just reclaim some pages in the slow memory? What are the > disadvantages to do that? We can certainly do what you have described through a policy argument. What I meant is that we will not ask directly via the proactive demotion interface to reclaim from slow memory nodes and count the reclaimed bytes as part of the requested bytes. > > if that is the middle thing you refer to. > > No. I mean, > > If we reclaim "A" pages proactively, we will free "A" pages, maybe from > slow memory firstly. The target is the total memory size of a memcg. > > If we demote "A" pages proactively, we will demote "A" pages from fast > memory to slow memory. The target is the fast memory size of a memcg. > In the process, some slow memory may be reclaimed to accommodate the > demoted pages. > > For me, the middle thing is, > > If we demote some pages from fast memory to slow memory proactively and > free some pages from slow memory at the same time, the total number > (demote + free) is "A". There's no clear target. I think this is > confusing. Per my understanding, you don't suggest this too. I agree and don't suggest this middle thing, either. > > This is why nodemask is > > proposed for memory.reclaim to support the demotion use case. With a > > separate memory.demote interface and memory tiering topology among > > NUMA nodes being well defined by the kernel and shared with the > > userspace, we can omit the nodemask argument. > > Yes. Both seems work. > > Best Regards, > Huang, Ying > > >> > >> >> > The question is then whether we want to rename memory.reclaim to > >> >> > something more general. I think this name is fine if reclaim-based > >> >> > demotion is an accepted concept. > >> >> > >> >> Best Regards, > >> >> Huang, Ying
On Wed 06-04-22 14:32:24, Huang, Ying wrote: [...] > I think we should define the interface not from the current > implementation point of view, but from the requirement point of view. Agreed! > For proactive reclaim, per my understanding, the requirement is, > > we found that there's some cold pages in some workloads, so we can > take advantage of the proactive reclaim to reclaim some pages so that > other workload can use the freed memory. We are talking about memcg here so this is not as much a matter of free memory as it is to decrease the amount of charged memory. Demotion cannot achieve that. > For proactive demotion, per my understanding, the requirement could be, > > We found that there's some cold pages in fast memory (e.g. DRAM) in > some workloads, so we can take advantage of the proactive demotion to > demote some pages so that other workload can use the freed fast > memory. Given the DRAM partition support Tim (Cced) is working on. Yes, this is essentially a kernel assisted memory migration. Userspace can migrate memory but the issue is that it doesn't have any information on the aging so the migration has hard time to find suitable memory to migrate. If we really need this functionality then it would deserve a separate interface IMHO.
On Wed, 2022-04-06 at 10:49 +0800, Huang, Ying wrote: > > > > If so, > > > > > > # echo A > memory.reclaim > > > > > > means > > > > > > a) "A" bytes memory are freed from the memcg, regardless demoting is > > > used or not. > > > > > > or > > > > > > b) "A" bytes memory are reclaimed from the memcg, some of them may be > > > freed, some of them may be just demoted from DRAM to PMEM. The total > > > number is "A". > > > > > > For me, a) looks more reasonable. > > > > > > > We can use a DEMOTE flag to control the demotion behavior for > > memory.reclaim. If the flag is not set (the default), then > > no_demotion of scan_control can be set to 1, similar to > > reclaim_pages(). > > If we have to use a flag to control the behavior, I think it's better to > have a separate interface (e.g. memory.demote). But do we really need b)? > > > The question is then whether we want to rename memory.reclaim to > > something more general. I think this name is fine if reclaim-based > > demotion is an accepted concept. > memory.demote will work for 2 level of memory tiers. But when we have 3 level of memory (e.g. high bandwidth memory, DRAM and PMEM), it gets ambiguous again of wheter we sould demote from high bandwidth memory or DRAM. Will something like this be more general? echo X > memory_[dram,pmem,hbm].reclaim So echo X > memory_dram.reclaim means that we want to free up X bytes from DRAM for the mem cgroup. echo demote > memory_dram.reclaim_policy This means that we prefer demotion for reclaim instead of swapping to disk. Tim
On Thu, Apr 7, 2022 at 2:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > On Wed, 2022-04-06 at 10:49 +0800, Huang, Ying wrote: > > > > > > If so, > > > > > > > > # echo A > memory.reclaim > > > > > > > > means > > > > > > > > a) "A" bytes memory are freed from the memcg, regardless demoting is > > > > used or not. > > > > > > > > or > > > > > > > > b) "A" bytes memory are reclaimed from the memcg, some of them may be > > > > freed, some of them may be just demoted from DRAM to PMEM. The > total > > > > number is "A". > > > > > > > > For me, a) looks more reasonable. > > > > > > > > > > We can use a DEMOTE flag to control the demotion behavior for > > > memory.reclaim. If the flag is not set (the default), then > > > no_demotion of scan_control can be set to 1, similar to > > > reclaim_pages(). > > > > If we have to use a flag to control the behavior, I think it's better to > > have a separate interface (e.g. memory.demote). But do we really need > b)? > > > > > The question is then whether we want to rename memory.reclaim to > > > something more general. I think this name is fine if reclaim-based > > > demotion is an accepted concept. > > > > memory.demote will work for 2 level of memory tiers. But when we have 3 > level > of memory (e.g. high bandwidth memory, DRAM and PMEM), > it gets ambiguous again of wheter we sould demote from high bandwidth > memory > or DRAM. > > Will something like this be more general? > > echo X > memory_[dram,pmem,hbm].reclaim > > So echo X > memory_dram.reclaim > means that we want to free up X bytes from DRAM for the mem cgroup. > > echo demote > memory_dram.reclaim_policy > > This means that we prefer demotion for reclaim instead > of swapping to disk. > > memory.demote can work with any level of memory tiers if a nodemask argument (or a tier argument if there is a more-explicitly defined, userspace visible tiering representation) is provided. The semantics can be to demote X bytes from these nodes to their next tier. memory_dram/memory_pmem assumes the hardware for a particular memory tier, which is undesirable. For example, it is entirely possible that a slow memory tier is implemented by a lower-cost/lower-performance DDR device connected via CXL.mem, not by PMEM. It is better for this interface to speak in either the NUMA node abstraction or a new tier abstraction. It is also desirable to make this interface stateless, i.e. not to require the setting of memory_dram.reclaim_policy. Any policy can be specified as arguments to the request itself and should only affect that particular request. Wei
On Thu, Apr 7, 2022 at 2:26 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > On Wed, 2022-04-06 at 10:49 +0800, Huang, Ying wrote: > > > > > > If so, > > > > > > > > # echo A > memory.reclaim > > > > > > > > means > > > > > > > > a) "A" bytes memory are freed from the memcg, regardless demoting is > > > > used or not. > > > > > > > > or > > > > > > > > b) "A" bytes memory are reclaimed from the memcg, some of them may be > > > > freed, some of them may be just demoted from DRAM to PMEM. The total > > > > number is "A". > > > > > > > > For me, a) looks more reasonable. > > > > > > > > > > We can use a DEMOTE flag to control the demotion behavior for > > > memory.reclaim. If the flag is not set (the default), then > > > no_demotion of scan_control can be set to 1, similar to > > > reclaim_pages(). > > > > If we have to use a flag to control the behavior, I think it's better to > > have a separate interface (e.g. memory.demote). But do we really need b)? > > > > > The question is then whether we want to rename memory.reclaim to > > > something more general. I think this name is fine if reclaim-based > > > demotion is an accepted concept. > > > > memory.demote will work for 2 level of memory tiers. But when we have 3 level > of memory (e.g. high bandwidth memory, DRAM and PMEM), > it gets ambiguous again of wheter we sould demote from high bandwidth memory > or DRAM. > > Will something like this be more general? > > echo X > memory_[dram,pmem,hbm].reclaim > > So echo X > memory_dram.reclaim > means that we want to free up X bytes from DRAM for the mem cgroup. > > echo demote > memory_dram.reclaim_policy > > This means that we prefer demotion for reclaim instead > of swapping to disk. > (resending in plain-text, sorry). memory.demote can work with any level of memory tiers if a nodemask argument (or a tier argument if there is a more-explicitly defined, userspace visible tiering representation) is provided. The semantics can be to demote X bytes from these nodes to their next tier. memory_dram/memory_pmem assumes the hardware for a particular memory tier, which is undesirable. For example, it is entirely possible that a slow memory tier is implemented by a lower-cost/lower-performance DDR device connected via CXL.mem, not by PMEM. It is better for this interface to speak in either the NUMA node abstraction or a new tier abstraction. It is also desirable to make this interface stateless, i.e. not to require the setting of memory_dram.reclaim_policy. Any policy can be specified as arguments to the request itself and should only affect that particular request. Wei
On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote: > > (resending in plain-text, sorry). > > memory.demote can work with any level of memory tiers if a nodemask > argument (or a tier argument if there is a more-explicitly defined, > userspace visible tiering representation) is provided. The semantics > can be to demote X bytes from these nodes to their next tier. > We do need some kind of userspace visible tiering representation. Will be nice if I can tell the memory type, nodemask of nodes in tier Y with cat memory.tier_Y > memory_dram/memory_pmem assumes the hardware for a particular memory > tier, which is undesirable. For example, it is entirely possible that > a slow memory tier is implemented by a lower-cost/lower-performance > DDR device connected via CXL.mem, not by PMEM. It is better for this > interface to speak in either the NUMA node abstraction or a new tier > abstraction. Just from the perspective of memory.reclaim and memory.demote, I think they could work with nodemask. For ease of management, some kind of abstraction of tier information like nodemask, memory type and expected performance should be readily accessible by user space. Tim > > It is also desirable to make this interface stateless, i.e. not to > require the setting of memory_dram.reclaim_policy. Any policy can be > specified as arguments to the request itself and should only affect > that particular request. > > Wei
On Thu, Apr 7, 2022 at 4:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > > On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote: > > > > > (resending in plain-text, sorry). > > > > memory.demote can work with any level of memory tiers if a nodemask > > argument (or a tier argument if there is a more-explicitly defined, > > userspace visible tiering representation) is provided. The semantics > > can be to demote X bytes from these nodes to their next tier. > > > > We do need some kind of userspace visible tiering representation. > Will be nice if I can tell the memory type, nodemask of nodes in tier Y with > > cat memory.tier_Y > > > > memory_dram/memory_pmem assumes the hardware for a particular memory > > tier, which is undesirable. For example, it is entirely possible that > > a slow memory tier is implemented by a lower-cost/lower-performance > > DDR device connected via CXL.mem, not by PMEM. It is better for this > > interface to speak in either the NUMA node abstraction or a new tier > > abstraction. > > Just from the perspective of memory.reclaim and memory.demote, I think > they could work with nodemask. For ease of management, > some kind of abstraction of tier information like nodemask, memory type > and expected performance should be readily accessible by user space. > I agree. The tier information should be provided at the system level. One suggestion is to have a new directory "/sys/devices/system/tier/" for tiers, e.g.: /sys/devices/system/tier/tier0/memlist: all memory nodes in tier 0. /sys/devices/system/tier/tier1/memlist: all memory nodes in tier 1. We can discuss this tier representation in a new thread. > Tim > > > > > It is also desirable to make this interface stateless, i.e. not to > > require the setting of memory_dram.reclaim_policy. Any policy can be > > specified as arguments to the request itself and should only affect > > that particular request. > > > > Wei >
Wei Xu <weixugc@google.com> writes: > On Thu, Apr 7, 2022 at 4:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: >> >> On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote: >> >> > >> > (resending in plain-text, sorry). >> > >> > memory.demote can work with any level of memory tiers if a nodemask >> > argument (or a tier argument if there is a more-explicitly defined, >> > userspace visible tiering representation) is provided. The semantics >> > can be to demote X bytes from these nodes to their next tier. >> > >> >> We do need some kind of userspace visible tiering representation. >> Will be nice if I can tell the memory type, nodemask of nodes in tier Y with >> >> cat memory.tier_Y >> >> >> > memory_dram/memory_pmem assumes the hardware for a particular memory >> > tier, which is undesirable. For example, it is entirely possible that >> > a slow memory tier is implemented by a lower-cost/lower-performance >> > DDR device connected via CXL.mem, not by PMEM. It is better for this >> > interface to speak in either the NUMA node abstraction or a new tier >> > abstraction. >> >> Just from the perspective of memory.reclaim and memory.demote, I think >> they could work with nodemask. For ease of management, >> some kind of abstraction of tier information like nodemask, memory type >> and expected performance should be readily accessible by user space. >> > > I agree. The tier information should be provided at the system level. > One suggestion is to have a new directory "/sys/devices/system/tier/" > for tiers, e.g.: > > /sys/devices/system/tier/tier0/memlist: all memory nodes in tier 0. > /sys/devices/system/tier/tier1/memlist: all memory nodes in tier 1. I think that it may be sufficient to make tier an attribute of "node". Some thing like, /sys/devices/system/node/nodeX/memory_tier Best Regards, Huang, Ying > We can discuss this tier representation in a new thread. > >> Tim >> >> > >> > It is also desirable to make this interface stateless, i.e. not to >> > require the setting of memory_dram.reclaim_policy. Any policy can be >> > specified as arguments to the request itself and should only affect >> > that particular request. >> > >> > Wei >>
On Thu, Apr 7, 2022 at 8:08 PM Huang, Ying <ying.huang@intel.com> wrote: > > Wei Xu <weixugc@google.com> writes: > > > On Thu, Apr 7, 2022 at 4:11 PM Tim Chen <tim.c.chen@linux.intel.com> wrote: > >> > >> On Thu, 2022-04-07 at 15:12 -0700, Wei Xu wrote: > >> > >> > > >> > (resending in plain-text, sorry). > >> > > >> > memory.demote can work with any level of memory tiers if a nodemask > >> > argument (or a tier argument if there is a more-explicitly defined, > >> > userspace visible tiering representation) is provided. The semantics > >> > can be to demote X bytes from these nodes to their next tier. > >> > > >> > >> We do need some kind of userspace visible tiering representation. > >> Will be nice if I can tell the memory type, nodemask of nodes in tier Y with > >> > >> cat memory.tier_Y > >> > >> > >> > memory_dram/memory_pmem assumes the hardware for a particular memory > >> > tier, which is undesirable. For example, it is entirely possible that > >> > a slow memory tier is implemented by a lower-cost/lower-performance > >> > DDR device connected via CXL.mem, not by PMEM. It is better for this > >> > interface to speak in either the NUMA node abstraction or a new tier > >> > abstraction. > >> > >> Just from the perspective of memory.reclaim and memory.demote, I think > >> they could work with nodemask. For ease of management, > >> some kind of abstraction of tier information like nodemask, memory type > >> and expected performance should be readily accessible by user space. > >> > > > > I agree. The tier information should be provided at the system level. > > One suggestion is to have a new directory "/sys/devices/system/tier/" > > for tiers, e.g.: > > > > /sys/devices/system/tier/tier0/memlist: all memory nodes in tier 0. > > /sys/devices/system/tier/tier1/memlist: all memory nodes in tier 1. > > I think that it may be sufficient to make tier an attribute of "node". > Some thing like, > > /sys/devices/system/node/nodeX/memory_tier > This works. If we want additional information about each tier, we can then add a tier-specific subtree. In addition, it would be good to also expose the demotion target nodes (node_demotion[]) via sysfs, e.g.: /sys/devices/system/node/nodeX/demotion_path which returns node_demotion[X]. > Best Regards, > Huang, Ying > > > We can discuss this tier representation in a new thread. > > > >> Tim > >> > >> > > >> > It is also desirable to make this interface stateless, i.e. not to > >> > require the setting of memory_dram.reclaim_policy. Any policy can be > >> > specified as arguments to the request itself and should only affect > >> > that particular request. > >> > > >> > Wei > >> >
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 69d7a6983f78..925aaabb2247 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1208,6 +1208,15 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.reclaim + A write-only file which exists on non-root cgroups. + + This is a simple interface to trigger memory reclaim in the + target cgroup. Write the number of bytes to reclaim to this + file and the kernel will try to reclaim that much memory. + Please note that the kernel can over or under reclaim from + the target cgroup. + memory.oom.group A read-write single value file which exists on non-root cgroups. The default value is "0". diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 725f76723220..994849fab7df 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6355,6 +6355,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned int nr_retries = MAX_RECLAIM_RETRIES; + unsigned long nr_to_reclaim, nr_reclaimed = 0; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "", &nr_to_reclaim); + if (err) + return err; + + while (nr_reclaimed < nr_to_reclaim) { + unsigned long reclaimed; + + if (signal_pending(current)) + break; + + reclaimed = try_to_free_mem_cgroup_pages(memcg, + nr_to_reclaim - nr_reclaimed, + GFP_KERNEL, true); + + if (!reclaimed && !nr_retries--) + break; + + nr_reclaimed += reclaimed; + } + + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -6413,6 +6445,11 @@ static struct cftype memory_files[] = { .seq_show = memory_oom_group_show, .write = memory_oom_group_write, }, + { + .name = "reclaim", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .write = memory_reclaim, + }, { } /* terminate */ };