Message ID | 20200909215752.1725525-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | memcg: introduce per-memcg reclaim interface | expand |
On 2020-09-09T14:57:52-07:00 Shakeel Butt <shakeelb@google.com> wrote: > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use cases: > ---------- > > 1) Per-memcg uswapd: > > Usually applications consists of combination of latency sensitive and > latency tolerant tasks. For example, tasks serving user requests vs > tasks doing data backup for a database application. At the moment the > kernel does not differentiate between such tasks when the application > hits the memcg limits. So, potentially a latency sensitive user facing > task can get stuck in high reclaim and be throttled by the kernel. > > Similarly there are cases of single process applications having two set > of thread pools where threads from one pool have high scheduling > priority and low latency requirement. One concrete example from our > production is the VMM which have high priority low latency thread pool > for the VCPUs while separate thread pool for stats reporting, I/O > emulation, health checks and other managerial operations. The kernel > memory reclaim does not differentiate between VCPU thread or a > non-latency sensitive thread and a VCPU thread can get stuck in high > reclaim. > > One way to resolve this issue is to preemptively trigger the memory > reclaim from a latency tolerant task (uswapd) when the application is > near the limits. Finding 'near the limits' situation is an orthogonal > problem. > > 2) Proactive reclaim: > > This is a similar to the previous use-case, the difference is instead of > waiting for the application to be near its limit to trigger memory > reclaim, continuously pressuring the memcg to reclaim a small amount of > memory. This gives more accurate and uptodate workingset estimation as > the LRUs are continuously sorted and can potentially provide more > deterministic memory overcommit behavior. The memory overcommit > controller can provide more proactive response to the changing behavior > of the running applications instead of being reactive. > > Benefit of user space solution: > ------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to centralized the > overhead while for uswapd, it makes more sense for the application to > pay for the cpu of the memory reclaim. > > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. > > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. > > Questions: > ---------- > > 1) Why memory.high is not enough? > > memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim as well as uswapd use cases. > However there is a big negative in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. > > Another issue with memory.high is that it is not delegatable. To > actually use this interface for uswapd, the application has to introduce > another layer of cgroup on whose memory.high it has write access. > > 2) Why uswapd safe from self induced reclaim? > > This is very similar to the scenario of oomd under global memory > pressure. We can use the similar mechanisms to protect uswapd from self > induced reclaim i.e. memory.min and mlock. > > Interface options: > ------------------ > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. > > In future we might want to reclaim specific type of memory from a memcg, > so, this interface can be extended to allow that. e.g. > > $ echo 10M [all|anon|file|kmem] > memory.reclaim > > However that should be when we have concrete use-cases for such > functionality. Keep things simple for now. > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > 2 files changed, 46 insertions(+) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 6be43781ec7f..58d70b5989d7 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back. > high limit is used and monitored properly, this limit's > utility is limited to providing the final safety net. > > + memory.reclaim > + A write-only file which exists on non-root cgroups. > + > + This is a simple interface to trigger memory reclaim in the > + target cgroup. Write the number of bytes to reclaim to this > + file and the kernel will try to reclaim that much memory. > + Please note that the kernel can over or under reclaim from > + the target cgroup. > + > memory.oom.group > A read-write single value file which exists on non-root > cgroups. The default value is "0". > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 75cd1a1e66c8..2d006c36d7f3 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > return nbytes; > } > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > + size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + int err; > + > + buf = strstrip(buf); > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > + if (err) > + return err; > + > + while (nr_reclaimed < nr_to_reclaim) { > + unsigned long reclaimed; > + > + if (signal_pending(current)) > + break; > + > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > + nr_to_reclaim - nr_reclaimed, > + GFP_KERNEL, true); > + > + if (!reclaimed && !nr_retries--) > + break; Shouldn't the if condition use '||' instead of '&&'? I think it could be easier to read if we put the 'nr_retires' condition in the while condition as below (just my personal preference, though). while (nr_reclaimed < nr_to_reclaim && nr_retires--) Thanks, SeongJae Park
On Wed, Sep 9, 2020 at 11:37 PM SeongJae Park <sjpark@amazon.com> wrote: > > On 2020-09-09T14:57:52-07:00 Shakeel Butt <shakeelb@google.com> wrote: > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use cases: > > ---------- > > > > 1) Per-memcg uswapd: > > > > Usually applications consists of combination of latency sensitive and > > latency tolerant tasks. For example, tasks serving user requests vs > > tasks doing data backup for a database application. At the moment the > > kernel does not differentiate between such tasks when the application > > hits the memcg limits. So, potentially a latency sensitive user facing > > task can get stuck in high reclaim and be throttled by the kernel. > > > > Similarly there are cases of single process applications having two set > > of thread pools where threads from one pool have high scheduling > > priority and low latency requirement. One concrete example from our > > production is the VMM which have high priority low latency thread pool > > for the VCPUs while separate thread pool for stats reporting, I/O > > emulation, health checks and other managerial operations. The kernel > > memory reclaim does not differentiate between VCPU thread or a > > non-latency sensitive thread and a VCPU thread can get stuck in high > > reclaim. > > > > One way to resolve this issue is to preemptively trigger the memory > > reclaim from a latency tolerant task (uswapd) when the application is > > near the limits. Finding 'near the limits' situation is an orthogonal > > problem. > > > > 2) Proactive reclaim: > > > > This is a similar to the previous use-case, the difference is instead of > > waiting for the application to be near its limit to trigger memory > > reclaim, continuously pressuring the memcg to reclaim a small amount of > > memory. This gives more accurate and uptodate workingset estimation as > > the LRUs are continuously sorted and can potentially provide more > > deterministic memory overcommit behavior. The memory overcommit > > controller can provide more proactive response to the changing behavior > > of the running applications instead of being reactive. > > > > Benefit of user space solution: > > ------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to centralized the > > overhead while for uswapd, it makes more sense for the application to > > pay for the cpu of the memory reclaim. > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > > > Questions: > > ---------- > > > > 1) Why memory.high is not enough? > > > > memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim as well as uswapd use cases. > > However there is a big negative in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > > > Another issue with memory.high is that it is not delegatable. To > > actually use this interface for uswapd, the application has to introduce > > another layer of cgroup on whose memory.high it has write access. > > > > 2) Why uswapd safe from self induced reclaim? > > > > This is very similar to the scenario of oomd under global memory > > pressure. We can use the similar mechanisms to protect uswapd from self > > induced reclaim i.e. memory.min and mlock. > > > > Interface options: > > ------------------ > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > > > In future we might want to reclaim specific type of memory from a memcg, > > so, this interface can be extended to allow that. e.g. > > > > $ echo 10M [all|anon|file|kmem] > memory.reclaim > > > > However that should be when we have concrete use-cases for such > > functionality. Keep things simple for now. > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > --- > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > 2 files changed, 46 insertions(+) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 6be43781ec7f..58d70b5989d7 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back. > > high limit is used and monitored properly, this limit's > > utility is limited to providing the final safety net. > > > > + memory.reclaim > > + A write-only file which exists on non-root cgroups. > > + > > + This is a simple interface to trigger memory reclaim in the > > + target cgroup. Write the number of bytes to reclaim to this > > + file and the kernel will try to reclaim that much memory. > > + Please note that the kernel can over or under reclaim from > > + the target cgroup. > > + > > memory.oom.group > > A read-write single value file which exists on non-root > > cgroups. The default value is "0". > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 75cd1a1e66c8..2d006c36d7f3 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > return nbytes; > > } > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > + size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > + int err; > > + > > + buf = strstrip(buf); > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > + if (err) > > + return err; > > + > > + while (nr_reclaimed < nr_to_reclaim) { > > + unsigned long reclaimed; > > + > > + if (signal_pending(current)) > > + break; > > + > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > + nr_to_reclaim - nr_reclaimed, > > + GFP_KERNEL, true); > > + > > + if (!reclaimed && !nr_retries--) > > + break; > > Shouldn't the if condition use '||' instead of '&&'? I copied the pattern from memory_high_write(). > I think it could be > easier to read if we put the 'nr_retires' condition in the while condition as > below (just my personal preference, though). > > while (nr_reclaimed < nr_to_reclaim && nr_retires--) > The semantics will be different. In my version, it means tolerate MAX_RECLAIM_RETRIES reclaim failures and your suggestion means total MAX_RECLAIM_RETRIES tries. Please note that try_to_free_mem_cgroup_pages() internally does 'nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX)', so, we might need more than MAX_RECLAIM_RETRIES successful tries to actually reclaim the amount of memory the user has requested. > > Thanks, > SeongJae Park
> On Wed, Sep 9, 2020 at 11:37 PM SeongJae Park <sjpark@amazon.com> wrote: > > > > On 2020-09-09T14:57:52-07:00 Shakeel Butt <shakeelb@google.com> wrote: > > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > > > Use cases: > > > ---------- > > > > > > 1) Per-memcg uswapd: > > > > > > Usually applications consists of combination of latency sensitive and > > > latency tolerant tasks. For example, tasks serving user requests vs > > > tasks doing data backup for a database application. At the moment the > > > kernel does not differentiate between such tasks when the application > > > hits the memcg limits. So, potentially a latency sensitive user facing > > > task can get stuck in high reclaim and be throttled by the kernel. > > > > > > Similarly there are cases of single process applications having two set > > > of thread pools where threads from one pool have high scheduling > > > priority and low latency requirement. One concrete example from our > > > production is the VMM which have high priority low latency thread pool > > > for the VCPUs while separate thread pool for stats reporting, I/O > > > emulation, health checks and other managerial operations. The kernel > > > memory reclaim does not differentiate between VCPU thread or a > > > non-latency sensitive thread and a VCPU thread can get stuck in high > > > reclaim. > > > > > > One way to resolve this issue is to preemptively trigger the memory > > > reclaim from a latency tolerant task (uswapd) when the application is > > > near the limits. Finding 'near the limits' situation is an orthogonal > > > problem. > > > > > > 2) Proactive reclaim: > > > > > > This is a similar to the previous use-case, the difference is instead of > > > waiting for the application to be near its limit to trigger memory > > > reclaim, continuously pressuring the memcg to reclaim a small amount of > > > memory. This gives more accurate and uptodate workingset estimation as > > > the LRUs are continuously sorted and can potentially provide more > > > deterministic memory overcommit behavior. The memory overcommit > > > controller can provide more proactive response to the changing behavior > > > of the running applications instead of being reactive. > > > > > > Benefit of user space solution: > > > ------------------------------- > > > > > > 1) More flexible on who should be charged for the cpu of the memory > > > reclaim. For proactive reclaim, it makes more sense to centralized the > > > overhead while for uswapd, it makes more sense for the application to > > > pay for the cpu of the memory reclaim. > > > > > > 2) More flexible on dedicating the resources (like cpu). The memory > > > overcommit controller can balance the cost between the cpu usage and > > > the memory reclaimed. > > > > > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > > under memory pressure better reclaim candidates are selected. This also > > > gives more accurate and uptodate notion of working set for an > > > application. > > > > > > Questions: > > > ---------- > > > > > > 1) Why memory.high is not enough? > > > > > > memory.high can be used to trigger reclaim in a memcg and can > > > potentially be used for proactive reclaim as well as uswapd use cases. > > > However there is a big negative in using memory.high. It can potentially > > > introduce high reclaim stalls in the target application as the > > > allocations from the processes or the threads of the application can hit > > > the temporary memory.high limit. > > > > > > Another issue with memory.high is that it is not delegatable. To > > > actually use this interface for uswapd, the application has to introduce > > > another layer of cgroup on whose memory.high it has write access. > > > > > > 2) Why uswapd safe from self induced reclaim? > > > > > > This is very similar to the scenario of oomd under global memory > > > pressure. We can use the similar mechanisms to protect uswapd from self > > > induced reclaim i.e. memory.min and mlock. > > > > > > Interface options: > > > ------------------ > > > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > > trigger reclaim in the target memory cgroup. > > > > > > In future we might want to reclaim specific type of memory from a memcg, > > > so, this interface can be extended to allow that. e.g. > > > > > > $ echo 10M [all|anon|file|kmem] > memory.reclaim > > > > > > However that should be when we have concrete use-cases for such > > > functionality. Keep things simple for now. > > > > > > Signed-off-by: Shakeel Butt <shakeelb@google.com> > > > --- > > > Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ > > > mm/memcontrol.c | 37 +++++++++++++++++++++++++ > > > 2 files changed, 46 insertions(+) > > > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > > index 6be43781ec7f..58d70b5989d7 100644 > > > --- a/Documentation/admin-guide/cgroup-v2.rst > > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > > @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back. > > > high limit is used and monitored properly, this limit's > > > utility is limited to providing the final safety net. > > > > > > + memory.reclaim > > > + A write-only file which exists on non-root cgroups. > > > + > > > + This is a simple interface to trigger memory reclaim in the > > > + target cgroup. Write the number of bytes to reclaim to this > > > + file and the kernel will try to reclaim that much memory. > > > + Please note that the kernel can over or under reclaim from > > > + the target cgroup. > > > + > > > memory.oom.group > > > A read-write single value file which exists on non-root > > > cgroups. The default value is "0". > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 75cd1a1e66c8..2d006c36d7f3 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, > > > return nbytes; > > > } > > > > > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, > > > + size_t nbytes, loff_t off) > > > +{ > > > + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); > > > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > > > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > > > + int err; > > > + > > > + buf = strstrip(buf); > > > + err = page_counter_memparse(buf, "", &nr_to_reclaim); > > > + if (err) > > > + return err; > > > + > > > + while (nr_reclaimed < nr_to_reclaim) { > > > + unsigned long reclaimed; > > > + > > > + if (signal_pending(current)) > > > + break; > > > + > > > + reclaimed = try_to_free_mem_cgroup_pages(memcg, > > > + nr_to_reclaim - nr_reclaimed, > > > + GFP_KERNEL, true); > > > + > > > + if (!reclaimed && !nr_retries--) > > > + break; > > > > Shouldn't the if condition use '||' instead of '&&'? > > I copied the pattern from memory_high_write(). > > > I think it could be > > easier to read if we put the 'nr_retires' condition in the while condition as > > below (just my personal preference, though). > > > > while (nr_reclaimed < nr_to_reclaim && nr_retires--) > > > > The semantics will be different. In my version, it means tolerate > MAX_RECLAIM_RETRIES reclaim failures and your suggestion means total > MAX_RECLAIM_RETRIES tries. > > Please note that try_to_free_mem_cgroup_pages() internally does > 'nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX)', so, we might need > more than MAX_RECLAIM_RETRIES successful tries to actually reclaim the > amount of memory the user has requested. Thanks, understood your intention and agreed on the point. Reviewed-by: SeongJae Park <sjpark@amazon.de> Thanks, SeongJae Park
On Wed 09-09-20 14:57:52, Shakeel Butt wrote: > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use cases: > ---------- > > 1) Per-memcg uswapd: > > Usually applications consists of combination of latency sensitive and > latency tolerant tasks. For example, tasks serving user requests vs > tasks doing data backup for a database application. At the moment the > kernel does not differentiate between such tasks when the application > hits the memcg limits. So, potentially a latency sensitive user facing > task can get stuck in high reclaim and be throttled by the kernel. > > Similarly there are cases of single process applications having two set > of thread pools where threads from one pool have high scheduling > priority and low latency requirement. One concrete example from our > production is the VMM which have high priority low latency thread pool > for the VCPUs while separate thread pool for stats reporting, I/O > emulation, health checks and other managerial operations. The kernel > memory reclaim does not differentiate between VCPU thread or a > non-latency sensitive thread and a VCPU thread can get stuck in high > reclaim. As those are presumably in the same cgroup what does prevent them to get stuck behind shared resources with taken during the reclaim performed by somebody else? I mean, memory reclaim might drop memory used by the high priority task. Or they might simply stumble over same locks. I am also more interested in actual numbers here. The high limit reclaim is normally swift and should be mostly unnoticeable. If the reclaim gets more expensive then it can get really noticeable for sure. But for the later the same can happen with the external pro-activee reclaimer as well, right? So there is no real "guarantee". Do you have any numbers from your workloads where you can demonstrate that the external reclaim has saved you this amount of effective cpu time of the sensitive workload? (Essentially measure how much time it has to consume in the high limit reclaim) To the feature itself, I am not yet convinced we want to have a feature like that. It surely sounds easy to use and attractive for a better user space control. It is also much well defined than drop_caches/force_empty because it is not all or nothing. But it also sounds like something too easy to use incorrectly (remember drop_caches). I am also a bit worried about corner cases wich would be easier to hit - e.g. fill up the swap limit and turn anonymous memory into unreclaimable and who knows what else.
On Mon, Sep 21, 2020 at 9:30 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 09-09-20 14:57:52, Shakeel Butt wrote: > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use cases: > > ---------- > > > > 1) Per-memcg uswapd: > > > > Usually applications consists of combination of latency sensitive and > > latency tolerant tasks. For example, tasks serving user requests vs > > tasks doing data backup for a database application. At the moment the > > kernel does not differentiate between such tasks when the application > > hits the memcg limits. So, potentially a latency sensitive user facing > > task can get stuck in high reclaim and be throttled by the kernel. > > > > Similarly there are cases of single process applications having two set > > of thread pools where threads from one pool have high scheduling > > priority and low latency requirement. One concrete example from our > > production is the VMM which have high priority low latency thread pool > > for the VCPUs while separate thread pool for stats reporting, I/O > > emulation, health checks and other managerial operations. The kernel > > memory reclaim does not differentiate between VCPU thread or a > > non-latency sensitive thread and a VCPU thread can get stuck in high > > reclaim. > > As those are presumably in the same cgroup what does prevent them to get > stuck behind shared resources with taken during the reclaim performed by > somebody else? I mean, memory reclaim might drop memory used by the high > priority task. Or they might simply stumble over same locks. > Yes there are a lot of challenges in providing isolation between latency sensitive and latency tolerant jobs/threads. This proposal aims to solve one specific challenge memcg limit reclaim. > I am also more interested in actual numbers here. The high limit reclaim > is normally swift and should be mostly unnoticeable. If the reclaim gets > more expensive then it can get really noticeable for sure. But for the > later the same can happen with the external pro-activee reclaimer as I think you meant 'uswapd' here instead of pro-active reclaimer. > well, right? So there is no real "guarantee". Do you have any numbers > from your workloads where you can demonstrate that the external reclaim > has saved you this amount of effective cpu time of the sensitive > workload? (Essentially measure how much time it has to consume in the > high limit reclaim) > What we actually use in our production is the 'proactive reclaim' which I have explained in the original message but I will add a couple more sentences below. For the uswapd use-case, let me point to the previous discussions and feature requests by others [1, 2]. One of the limiting factors of these previous proposals was the lack of CPU accounting of the background reclaimer which the current proposal solves by enabling the user space solution. [1] https://lwn.net/Articles/753162/ [2] http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org Let me add one more point. Even if the high limit reclaim is swift, it can still take 100s of usecs. Most of our jobs are anon-only and we use zswap. Compressing a page can take a couple usec, so 100s of usecs in limit reclaim is normal. For latency sensitive jobs, this amount of hiccups do matters. For the proactive reclaim, based on the refault medium, we define tolerable refault rate of the applications. Then we proactively reclaim memory from the applications and monitor the refault rates. Based on the refault rates, the memory overcommit manager controls the aggressiveness of the proactive reclaim. This is exactly what we do in the production. Please let me know if you want to know why we do proactive reclaim in the first place. > To the feature itself, I am not yet convinced we want to have a feature > like that. It surely sounds easy to use and attractive for a better user > space control. It is also much well defined than drop_caches/force_empty > because it is not all or nothing. But it also sounds like something too > easy to use incorrectly (remember drop_caches). I am also a bit worried > about corner cases wich would be easier to hit - e.g. fill up the swap > limit and turn anonymous memory into unreclaimable and who knows what > else. The corner cases you are worried about are already possible with the existing interfaces. We can already do all such things with memory.high interface but with some limitations. This new interface resolves that limitation as explained in the original email. Please let me know if you have more questions. thanks, Shakeel
On Mon 21-09-20 10:50:14, Shakeel Butt wrote: > On Mon, Sep 21, 2020 at 9:30 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 09-09-20 14:57:52, Shakeel Butt wrote: > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > > > Use cases: > > > ---------- > > > > > > 1) Per-memcg uswapd: > > > > > > Usually applications consists of combination of latency sensitive and > > > latency tolerant tasks. For example, tasks serving user requests vs > > > tasks doing data backup for a database application. At the moment the > > > kernel does not differentiate between such tasks when the application > > > hits the memcg limits. So, potentially a latency sensitive user facing > > > task can get stuck in high reclaim and be throttled by the kernel. > > > > > > Similarly there are cases of single process applications having two set > > > of thread pools where threads from one pool have high scheduling > > > priority and low latency requirement. One concrete example from our > > > production is the VMM which have high priority low latency thread pool > > > for the VCPUs while separate thread pool for stats reporting, I/O > > > emulation, health checks and other managerial operations. The kernel > > > memory reclaim does not differentiate between VCPU thread or a > > > non-latency sensitive thread and a VCPU thread can get stuck in high > > > reclaim. > > > > As those are presumably in the same cgroup what does prevent them to get > > stuck behind shared resources with taken during the reclaim performed by > > somebody else? I mean, memory reclaim might drop memory used by the high > > priority task. Or they might simply stumble over same locks. > > > > Yes there are a lot of challenges in providing isolation between > latency sensitive and latency tolerant jobs/threads. This proposal > aims to solve one specific challenge memcg limit reclaim. I am fully aware that a complete isolation is hard to achieve. I am just trying evaluate how is this specific usecase worth a new interface that we will have to maintain for ever. Especially when I suspect that the interface will likely only paper over immediate problems rather than offer a long term maintainable solution for it. > > I am also more interested in actual numbers here. The high limit reclaim > > is normally swift and should be mostly unnoticeable. If the reclaim gets > > more expensive then it can get really noticeable for sure. But for the > > later the same can happen with the external pro-activee reclaimer as > > I think you meant 'uswapd' here instead of pro-active reclaimer. > > > well, right? So there is no real "guarantee". Do you have any numbers > > from your workloads where you can demonstrate that the external reclaim > > has saved you this amount of effective cpu time of the sensitive > > workload? (Essentially measure how much time it has to consume in the > > high limit reclaim) > > > > What we actually use in our production is the 'proactive reclaim' > which I have explained in the original message but I will add a couple > more sentences below. > > For the uswapd use-case, let me point to the previous discussions and > feature requests by others [1, 2]. One of the limiting factors of > these previous proposals was the lack of CPU accounting of the > background reclaimer which the current proposal solves by enabling the > user space solution. > > [1] https://lwn.net/Articles/753162/ > [2] http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org I remember those. My understanding was that the only problem is to properly account for CPU on behalf of the reclaimed cgroup and that has been work in progress for that. Outsourcing all that to userspace surely sounds like an attractive option but it comes with usual user API price. More on that later. > Let me add one more point. Even if the high limit reclaim is swift, it > can still take 100s of usecs. Most of our jobs are anon-only and we > use zswap. Compressing a page can take a couple usec, so 100s of usecs > in limit reclaim is normal. For latency sensitive jobs, this amount of > hiccups do matters. Understood. But isn't this an implementation detail of zswap? Can it offload some of the heavy lifting to a different context and reduce the general overhead? > For the proactive reclaim, based on the refault medium, we define > tolerable refault rate of the applications. Then we proactively > reclaim memory from the applications and monitor the refault rates. > Based on the refault rates, the memory overcommit manager controls the > aggressiveness of the proactive reclaim. > > This is exactly what we do in the production. Please let me know if > you want to know why we do proactive reclaim in the first place. This information is definitely useful and having it in the changelog would be useful. IIUC the only reason why you cannot use high limit to control this pro-active reclaim is the potential throttling due to expensive reclaim, correct? > > To the feature itself, I am not yet convinced we want to have a feature > > like that. It surely sounds easy to use and attractive for a better user > > space control. It is also much well defined than drop_caches/force_empty > > because it is not all or nothing. But it also sounds like something too > > easy to use incorrectly (remember drop_caches). I am also a bit worried > > about corner cases wich would be easier to hit - e.g. fill up the swap > > limit and turn anonymous memory into unreclaimable and who knows what > > else. > > The corner cases you are worried about are already possible with the > existing interfaces. We can already do all such things with > memory.high interface but with some limitations. This new interface > resolves that limitation as explained in the original email. You are right that misconfigured limits can result in problems. But such a configuration should be quite easy to spot which is not the case for targetted reclaim calls which do not leave any footprints behind. Existing interfaces are trying to not expose internal implementation details as much as well. You are proposing a very targeted interface to fine control the memory reclaim. There is a risk that userspace will start depending on a specific reclaim implementation/behavior and future changes would be prone to regressions in workloads relying on that. So effectively, any user space memory reclaimer would need to be tuned to a specific implementation of the memory reclaim. My past experience tells me that this is not a great thing for maintainability of neither kernel nor the userspace part. All that being said, we really should consider whether the proposed interface is trying to work around existing limitations in the reclaim or the interface. If this is the former then I do not think we should be adding it. If the later then we should discuss on how to improve our existing interfaces (or their implementations) to be better usable and allow your usecase to work better. What is your take on that Johannes?
On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 21-09-20 10:50:14, Shakeel Butt wrote: > > On Mon, Sep 21, 2020 at 9:30 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Wed 09-09-20 14:57:52, Shakeel Butt wrote: > > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > > > > > Use cases: > > > > ---------- > > > > > > > > 1) Per-memcg uswapd: > > > > > > > > Usually applications consists of combination of latency sensitive and > > > > latency tolerant tasks. For example, tasks serving user requests vs > > > > tasks doing data backup for a database application. At the moment the > > > > kernel does not differentiate between such tasks when the application > > > > hits the memcg limits. So, potentially a latency sensitive user facing > > > > task can get stuck in high reclaim and be throttled by the kernel. > > > > > > > > Similarly there are cases of single process applications having two set > > > > of thread pools where threads from one pool have high scheduling > > > > priority and low latency requirement. One concrete example from our > > > > production is the VMM which have high priority low latency thread pool > > > > for the VCPUs while separate thread pool for stats reporting, I/O > > > > emulation, health checks and other managerial operations. The kernel > > > > memory reclaim does not differentiate between VCPU thread or a > > > > non-latency sensitive thread and a VCPU thread can get stuck in high > > > > reclaim. > > > > > > As those are presumably in the same cgroup what does prevent them to get > > > stuck behind shared resources with taken during the reclaim performed by > > > somebody else? I mean, memory reclaim might drop memory used by the high > > > priority task. Or they might simply stumble over same locks. > > > > > > > Yes there are a lot of challenges in providing isolation between > > latency sensitive and latency tolerant jobs/threads. This proposal > > aims to solve one specific challenge memcg limit reclaim. > > I am fully aware that a complete isolation is hard to achieve. I am just > trying evaluate how is this specific usecase worth a new interface that > we will have to maintain for ever. Especially when I suspect that the > interface will likely only paper over immediate problems rather than > offer a long term maintainable solution for it. > I think you are getting too focused on the uswapd use-case only. The proposed interface enables the proactive reclaim as well which we actually use in production. > > > I am also more interested in actual numbers here. The high limit reclaim > > > is normally swift and should be mostly unnoticeable. If the reclaim gets > > > more expensive then it can get really noticeable for sure. But for the > > > later the same can happen with the external pro-activee reclaimer as > > > > I think you meant 'uswapd' here instead of pro-active reclaimer. > > > > > well, right? So there is no real "guarantee". Do you have any numbers > > > from your workloads where you can demonstrate that the external reclaim > > > has saved you this amount of effective cpu time of the sensitive > > > workload? (Essentially measure how much time it has to consume in the > > > high limit reclaim) > > > > > > > What we actually use in our production is the 'proactive reclaim' > > which I have explained in the original message but I will add a couple > > more sentences below. > > > > For the uswapd use-case, let me point to the previous discussions and > > feature requests by others [1, 2]. One of the limiting factors of > > these previous proposals was the lack of CPU accounting of the > > background reclaimer which the current proposal solves by enabling the > > user space solution. > > > > [1] https://lwn.net/Articles/753162/ > > [2] http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org > > I remember those. My understanding was that the only problem is to > properly account for CPU on behalf of the reclaimed cgroup and that has > been work in progress for that. > > Outsourcing all that to userspace surely sounds like an attractive > option but it comes with usual user API price. More on that later. > > > Let me add one more point. Even if the high limit reclaim is swift, it > > can still take 100s of usecs. Most of our jobs are anon-only and we > > use zswap. Compressing a page can take a couple usec, so 100s of usecs > > in limit reclaim is normal. For latency sensitive jobs, this amount of > > hiccups do matters. > > Understood. But isn't this an implementation detail of zswap? Can it > offload some of the heavy lifting to a different context and reduce the > general overhead? > Are you saying doing the compression asynchronously? Similar to how the disk-based swap triggers the writeback and puts the page back to LRU, so the next time reclaim sees it, it will be instantly reclaimed? Or send the batch of pages to be compressed to a different CPU and wait for the completion? BTW the proactive reclaim naturally offloads that to a different context. > > For the proactive reclaim, based on the refault medium, we define > > tolerable refault rate of the applications. Then we proactively > > reclaim memory from the applications and monitor the refault rates. > > Based on the refault rates, the memory overcommit manager controls the > > aggressiveness of the proactive reclaim. > > > > This is exactly what we do in the production. Please let me know if > > you want to know why we do proactive reclaim in the first place. > > This information is definitely useful and having it in the changelog > would be useful. IIUC the only reason why you cannot use high limit > to control this pro-active reclaim is the potential throttling due to > expensive reclaim, correct? > Yes. > > > To the feature itself, I am not yet convinced we want to have a feature > > > like that. It surely sounds easy to use and attractive for a better user > > > space control. It is also much well defined than drop_caches/force_empty > > > because it is not all or nothing. But it also sounds like something too > > > easy to use incorrectly (remember drop_caches). I am also a bit worried > > > about corner cases wich would be easier to hit - e.g. fill up the swap > > > limit and turn anonymous memory into unreclaimable and who knows what > > > else. > > > > The corner cases you are worried about are already possible with the > > existing interfaces. We can already do all such things with > > memory.high interface but with some limitations. This new interface > > resolves that limitation as explained in the original email. > > You are right that misconfigured limits can result in problems. But such > a configuration should be quite easy to spot which is not the case for > targetted reclaim calls which do not leave any footprints behind. > Existing interfaces are trying to not expose internal implementation > details as much as well. You are proposing a very targeted interface to > fine control the memory reclaim. There is a risk that userspace will > start depending on a specific reclaim implementation/behavior and future > changes would be prone to regressions in workloads relying on that. So > effectively, any user space memory reclaimer would need to be tuned to a > specific implementation of the memory reclaim. I don't see the exposure of internal memory reclaim implementation. The interface is very simple. Reclaim a given amount of memory. Either the kernel will reclaim less memory or it will over reclaim. In case of reclaiming less memory, the user space can retry given there is enough reclaimable memory. For the over reclaim case, the user space will backoff for a longer time. How are the internal reclaim implementation details exposed? > My past experience tells > me that this is not a great thing for maintainability of neither kernel > nor the userspace part. > > All that being said, we really should consider whether the proposed > interface is trying to work around existing limitations in the reclaim > or the interface. If this is the former then I do not think we should be > adding it. If the later then we should discuss on how to improve our > existing interfaces (or their implementations) to be better usable and > allow your usecase to work better. It is the limitation of the interface. My concern is in fixing the interface we might convolute the memory.high interface making it more burden to maintain than simply adding a new interface. > > What is your take on that Johannes? > -- > Michal Hocko > SUSE Labs
On Tue 22-09-20 08:54:25, Shakeel Butt wrote: > On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 21-09-20 10:50:14, Shakeel Butt wrote: [...] > > > Let me add one more point. Even if the high limit reclaim is swift, it > > > can still take 100s of usecs. Most of our jobs are anon-only and we > > > use zswap. Compressing a page can take a couple usec, so 100s of usecs > > > in limit reclaim is normal. For latency sensitive jobs, this amount of > > > hiccups do matters. > > > > Understood. But isn't this an implementation detail of zswap? Can it > > offload some of the heavy lifting to a different context and reduce the > > general overhead? > > > > Are you saying doing the compression asynchronously? Similar to how > the disk-based swap triggers the writeback and puts the page back to > LRU, so the next time reclaim sees it, it will be instantly reclaimed? > Or send the batch of pages to be compressed to a different CPU and > wait for the completion? Yes. [...] > > You are right that misconfigured limits can result in problems. But such > > a configuration should be quite easy to spot which is not the case for > > targetted reclaim calls which do not leave any footprints behind. > > Existing interfaces are trying to not expose internal implementation > > details as much as well. You are proposing a very targeted interface to > > fine control the memory reclaim. There is a risk that userspace will > > start depending on a specific reclaim implementation/behavior and future > > changes would be prone to regressions in workloads relying on that. So > > effectively, any user space memory reclaimer would need to be tuned to a > > specific implementation of the memory reclaim. > > I don't see the exposure of internal memory reclaim implementation. > The interface is very simple. Reclaim a given amount of memory. Either > the kernel will reclaim less memory or it will over reclaim. In case > of reclaiming less memory, the user space can retry given there is > enough reclaimable memory. For the over reclaim case, the user space > will backoff for a longer time. How are the internal reclaim > implementation details exposed? In an ideal world yes. A feedback mechanism will be independent on the particular implementation. But the reality tends to disagree quite often. Once we provide a tool there will be users using it to the best of their knowlege. Very often as a hammer. This is what the history of kernel regressions and "we have to revert an obvious fix because userspace depends on an undocumented behavior which happened to work for some time" has thought us in a hard way. I really do not want to deal with reports where a new heuristic in the memory reclaim will break something just because the reclaim takes slightly longer or over/under reclaims differently so the existing assumptions break and the overall balancing from userspace breaks. This might be a shiny exception of course. And please note that I am not saying that the interface is completely wrong or unacceptable. I just want to be absolutely sure we cannot move forward with the existing API space that we have. So far I have learned that you are primarily working around an implementation detail in the zswap which is doing the swapout path directly in the pageout path. That sounds like a very bad reason to add a new interface. You are right that there are likely other usecases to like this new interface - mostly to emulate drop_caches - but I believe those are quite misguided as well and we should work harder to help them out to use the existing APIs. Last but not least the memcg background reclaim is something that should be possible without a new interface.
On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 22-09-20 08:54:25, Shakeel Butt wrote: > > On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 21-09-20 10:50:14, Shakeel Butt wrote: > [...] > > > > Let me add one more point. Even if the high limit reclaim is swift, it > > > > can still take 100s of usecs. Most of our jobs are anon-only and we > > > > use zswap. Compressing a page can take a couple usec, so 100s of usecs > > > > in limit reclaim is normal. For latency sensitive jobs, this amount of > > > > hiccups do matters. > > > > > > Understood. But isn't this an implementation detail of zswap? Can it > > > offload some of the heavy lifting to a different context and reduce the > > > general overhead? > > > > > > > Are you saying doing the compression asynchronously? Similar to how > > the disk-based swap triggers the writeback and puts the page back to > > LRU, so the next time reclaim sees it, it will be instantly reclaimed? > > Or send the batch of pages to be compressed to a different CPU and > > wait for the completion? > > Yes. > Adding Minchan, if he has more experience/opinion on async swap on zram/zswap. > [...] > > > > You are right that misconfigured limits can result in problems. But such > > > a configuration should be quite easy to spot which is not the case for > > > targetted reclaim calls which do not leave any footprints behind. > > > Existing interfaces are trying to not expose internal implementation > > > details as much as well. You are proposing a very targeted interface to > > > fine control the memory reclaim. There is a risk that userspace will > > > start depending on a specific reclaim implementation/behavior and future > > > changes would be prone to regressions in workloads relying on that. So > > > effectively, any user space memory reclaimer would need to be tuned to a > > > specific implementation of the memory reclaim. > > > > I don't see the exposure of internal memory reclaim implementation. > > The interface is very simple. Reclaim a given amount of memory. Either > > the kernel will reclaim less memory or it will over reclaim. In case > > of reclaiming less memory, the user space can retry given there is > > enough reclaimable memory. For the over reclaim case, the user space > > will backoff for a longer time. How are the internal reclaim > > implementation details exposed? > > In an ideal world yes. A feedback mechanism will be independent on the > particular implementation. But the reality tends to disagree quite > often. Once we provide a tool there will be users using it to the best > of their knowlege. Very often as a hammer. This is what the history of > kernel regressions and "we have to revert an obvious fix because > userspace depends on an undocumented behavior which happened to work for > some time" has thought us in a hard way. > > I really do not want to deal with reports where a new heuristic in the > memory reclaim will break something just because the reclaim takes > slightly longer or over/under reclaims differently so the existing > assumptions break and the overall balancing from userspace breaks. > > This might be a shiny exception of course. And please note that I am not > saying that the interface is completely wrong or unacceptable. I just > want to be absolutely sure we cannot move forward with the existing API > space that we have. > > So far I have learned that you are primarily working around an > implementation detail in the zswap which is doing the swapout path > directly in the pageout path. Wait how did you reach this conclusion? I have explicitly said that we are not using uswapd like functionality in production. We are using this interface for proactive reclaim and proactive reclaim is not a workaround for implementation detail in the zswap. > That sounds like a very bad reason to add > a new interface. You are right that there are likely other usecases to > like this new interface - mostly to emulate drop_caches - but I believe > those are quite misguided as well and we should work harder to help > them out to use the existing APIs. I am not really understanding your concern specific for the new API. All of your concerns (user expectation of reclaim time or over/under reclaim) are still possible with the existing API i.e. memory.high. > Last but not least the memcg > background reclaim is something that should be possible without a new > interface. So, it comes down to adding more functionality/semantics to memory.high or introducing a new simple interface. I am fine with either of one but IMO convoluted memory.high might have a higher maintenance cost. I can send the patch to add the functionality in the memory.high but I would like to get Johannes's opinion first. Shakeel
On Tue 22-09-20 11:10:17, Shakeel Butt wrote: > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > So far I have learned that you are primarily working around an > > implementation detail in the zswap which is doing the swapout path > > directly in the pageout path. > > Wait how did you reach this conclusion? I have explicitly said that we > are not using uswapd like functionality in production. We are using > this interface for proactive reclaim and proactive reclaim is not a > workaround for implementation detail in the zswap. Hmm, I must have missed the distinction between the two you have mentioned. Correct me if I am wrong but "latency sensitive" workload is the one that cannot use the high limit, right. For some reason I thought that your pro-active reclaim usecase is also not compatible with the throttling imposed by the high limit. Hence my conclusion above.
On Tue, Sep 22, 2020 at 11:31 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 22-09-20 11:10:17, Shakeel Butt wrote: > > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > So far I have learned that you are primarily working around an > > > implementation detail in the zswap which is doing the swapout path > > > directly in the pageout path. > > > > Wait how did you reach this conclusion? I have explicitly said that we > > are not using uswapd like functionality in production. We are using > > this interface for proactive reclaim and proactive reclaim is not a > > workaround for implementation detail in the zswap. > > Hmm, I must have missed the distinction between the two you have > mentioned. Correct me if I am wrong but "latency sensitive" workload is > the one that cannot use the high limit, right. Yes. > For some reason I thought > that your pro-active reclaim usecase is also not compatible with the > throttling imposed by the high limit. Hence my conclusion above. > For proactive reclaim use-case, it is more about the weirdness of using memory.high interface for proactive reclaim. Let's suppose I want to reclaim 20 MiB from a job. To use memory.high, I have to read memory.current and subtract 20MiB from it and then write that to memory.high and once that is done, I have to set memory.high to the previous value (job's original high limit). There is a time window where the allocation of the target job can hit the temporary memory.high which will cause uninteresting MEMCG_HIGH event, PSI pressure and can potentially over reclaim. Also there is a race between reading memory.current and setting the temporary memory.high. There are many non-deterministic variables added to the request of reclaiming 20MiB from a job. Shakeel
On Tue 22-09-20 11:10:17, Shakeel Butt wrote: > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > Last but not least the memcg > > background reclaim is something that should be possible without a new > > interface. > > So, it comes down to adding more functionality/semantics to > memory.high or introducing a new simple interface. I am fine with > either of one but IMO convoluted memory.high might have a higher > maintenance cost. One idea would be to schedule a background worker (which work on behalf on the memcg) to do the high limit reclaim with high limit target as soon as the high limit is reached. There would be one work item for each memcg. Userspace would recheck the high limit on return to the userspace and do the reclaim if the excess is larger than a threshold, and sleep as the fallback. Excessive consumers would get throttled if the background work cannot keep up with the charge pace and most of them would return without doing any reclaim because there is somebody working on their behalf - and is accounted for that. The semantic of high limit would be preserved IMHO because high limit is actively throttled. Where that work is done shouldn't matter as long as it is accounted properly and memcg cannot outsource all the work to the rest of the system. Would something like that (with many details to be sorted out of course) be feasible? If we do not want to change the existing semantic of high and want a new api then I think having another limit for the background reclaim then that would make more sense to me. It would resemble the global reclaim and kswapd model and something that would be easier to reason about. Comparing to echo $N > reclaim which might mean to reclaim any number pages around N.
On Tue, Sep 22, 2020 at 12:09 PM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 22-09-20 11:10:17, Shakeel Butt wrote: > > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > Last but not least the memcg > > > background reclaim is something that should be possible without a new > > > interface. > > > > So, it comes down to adding more functionality/semantics to > > memory.high or introducing a new simple interface. I am fine with > > either of one but IMO convoluted memory.high might have a higher > > maintenance cost. > > One idea would be to schedule a background worker (which work on behalf > on the memcg) to do the high limit reclaim with high limit target as > soon as the high limit is reached. There would be one work item for each > memcg. Userspace would recheck the high limit on return to the userspace > and do the reclaim if the excess is larger than a threshold, and sleep > as the fallback. > > Excessive consumers would get throttled if the background work cannot > keep up with the charge pace and most of them would return without doing > any reclaim because there is somebody working on their behalf - and is > accounted for that. > > The semantic of high limit would be preserved IMHO because high limit is > actively throttled. Where that work is done shouldn't matter as long as > it is accounted properly and memcg cannot outsource all the work to the > rest of the system. > > Would something like that (with many details to be sorted out of course) > be feasible? This is exactly how our "per-memcg kswapd" works. The missing piece is how to account the background worker (it is a kernel work thread) properly as what we discussed before. You mentioned such work is WIP in earlier email of this thread, I think once this is done the per-memcg background worker could be supported easily. > > If we do not want to change the existing semantic of high and want a new > api then I think having another limit for the background reclaim then > that would make more sense to me. It would resemble the global reclaim > and kswapd model and something that would be easier to reason about. > Comparing to echo $N > reclaim which might mean to reclaim any number > pages around N. > -- > Michal Hocko > SUSE Labs
On Tue, Sep 22, 2020 at 12:09 PM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 22-09-20 11:10:17, Shakeel Butt wrote: > > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > Last but not least the memcg > > > background reclaim is something that should be possible without a new > > > interface. > > > > So, it comes down to adding more functionality/semantics to > > memory.high or introducing a new simple interface. I am fine with > > either of one but IMO convoluted memory.high might have a higher > > maintenance cost. > > One idea would be to schedule a background worker (which work on behalf > on the memcg) to do the high limit reclaim with high limit target as > soon as the high limit is reached. There would be one work item for each > memcg. Userspace would recheck the high limit on return to the userspace > and do the reclaim if the excess is larger than a threshold, and sleep > as the fallback. > > Excessive consumers would get throttled if the background work cannot > keep up with the charge pace and most of them would return without doing > any reclaim because there is somebody working on their behalf - and is > accounted for that. > > The semantic of high limit would be preserved IMHO because high limit is > actively throttled. Where that work is done shouldn't matter as long as > it is accounted properly and memcg cannot outsource all the work to the > rest of the system. > > Would something like that (with many details to be sorted out of course) > be feasible? > Well what about the proactive reclaim use-case? You are targeting only uswapd/background-reclaim use-case. > If we do not want to change the existing semantic of high and want a new > api then I think having another limit for the background reclaim then > that would make more sense to me. It would resemble the global reclaim > and kswapd model and something that would be easier to reason about. > Comparing to echo $N > reclaim which might mean to reclaim any number > pages around N. > -- I am not really against the approach you are proposing but "echo $N > reclaim" allows more flexibility and enables more use-cases.
Hello, I apologize for the late reply. The proposed interface has been an ongoing topic and area of experimentation within Facebook as well, which makes it a bit difficult to respond with certainty here. I agree with both your usecases. They apply to us as well. We currently make two small changes to our kernel to solve them. They work okay-ish in our production environment, but they aren't quite there yet, and not ready for upstream. Some thoughts and comments below. On Wed, Sep 09, 2020 at 02:57:52PM -0700, Shakeel Butt wrote: > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > Use cases: > ---------- > > 1) Per-memcg uswapd: > > Usually applications consists of combination of latency sensitive and > latency tolerant tasks. For example, tasks serving user requests vs > tasks doing data backup for a database application. At the moment the > kernel does not differentiate between such tasks when the application > hits the memcg limits. So, potentially a latency sensitive user facing > task can get stuck in high reclaim and be throttled by the kernel. > > Similarly there are cases of single process applications having two set > of thread pools where threads from one pool have high scheduling > priority and low latency requirement. One concrete example from our > production is the VMM which have high priority low latency thread pool > for the VCPUs while separate thread pool for stats reporting, I/O > emulation, health checks and other managerial operations. The kernel > memory reclaim does not differentiate between VCPU thread or a > non-latency sensitive thread and a VCPU thread can get stuck in high > reclaim. > > One way to resolve this issue is to preemptively trigger the memory > reclaim from a latency tolerant task (uswapd) when the application is > near the limits. Finding 'near the limits' situation is an orthogonal > problem. I don't think a userspace implementation is suitable for this purpose. Kswapd-style background reclaim is beneficial to probably 99% of all workloads. Because doing reclaim inside the execution stream of the workload itself is so unnecessary in a multi-CPU environment, whether the workload is particularly latency sensitive or only cares about overall throughput. In most cases, spare cores are available to do this work concurrently, and the buffer memory required between the workload and the async reclaimer tends to be negligible. Requiring non-trivial userspace participation for such a basic optimization does not seem like a good idea to me. We'd probably end up with four or five hyperscalers having four or five different implementations, and not much user coverage beyond that. I floated this patch before: https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ It's blocked on infrastructure work in the CPU controller that allows accounting CPU cycles spent on behalf of other cgroups. But we need this functionality in other places as well - network, async filesystem encryption, various other stuff bounced to workers. > 2) Proactive reclaim: > > This is a similar to the previous use-case, the difference is instead of > waiting for the application to be near its limit to trigger memory > reclaim, continuously pressuring the memcg to reclaim a small amount of > memory. This gives more accurate and uptodate workingset estimation as > the LRUs are continuously sorted and can potentially provide more > deterministic memory overcommit behavior. The memory overcommit > controller can provide more proactive response to the changing behavior > of the running applications instead of being reactive. This is an important usecase for us as well. And we use it not just to keep the LRUs warm, but to actively sample the workingset size - the true amount of memory required, trimmed of all its unused cache and cold pages that can be read back from disk on demand. For this purpose, we're essentially using memory.high right now. The only modification we make here is adding a memory.high.tmp variant that takes a timeout argument in addition to the limit. This ensures we don't leave an unsafe limit behind if the userspace daemon crashes. We have experienced some of the problems you describe below with it. > Benefit of user space solution: > ------------------------------- > > 1) More flexible on who should be charged for the cpu of the memory > reclaim. For proactive reclaim, it makes more sense to centralized the > overhead while for uswapd, it makes more sense for the application to > pay for the cpu of the memory reclaim. Agreed on both counts. > 2) More flexible on dedicating the resources (like cpu). The memory > overcommit controller can balance the cost between the cpu usage and > the memory reclaimed. This could use some elaboration, I think. > 3) Provides a way to the applications to keep their LRUs sorted, so, > under memory pressure better reclaim candidates are selected. This also > gives more accurate and uptodate notion of working set for an > application. That's a valid argument for proactive reclaim, and I agree with it. But not necessarily an argument for which part of the proactive reclaim logic should be in-kernel and which should be in userspace. > Questions: > ---------- > > 1) Why memory.high is not enough? > > memory.high can be used to trigger reclaim in a memcg and can > potentially be used for proactive reclaim as well as uswapd use cases. > However there is a big negative in using memory.high. It can potentially > introduce high reclaim stalls in the target application as the > allocations from the processes or the threads of the application can hit > the temporary memory.high limit. That's something we have run into as well. Async memory.high reclaim helps, but when proactive reclaim does bigger steps and lowers the limit below the async reclaim buffer, the workload can still enter direct reclaim. This is undesirable. > Another issue with memory.high is that it is not delegatable. To > actually use this interface for uswapd, the application has to introduce > another layer of cgroup on whose memory.high it has write access. Fair enough. I would generalize that and say that limiting the maximum container size and driving proactive reclaim are separate jobs, with separate goals, happening at different layers of the system. Sharing a single control knob for that can be a coordination nightmare. > 2) Why uswapd safe from self induced reclaim? > > This is very similar to the scenario of oomd under global memory > pressure. We can use the similar mechanisms to protect uswapd from self > induced reclaim i.e. memory.min and mlock. Agreed. > Interface options: > ------------------ > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > trigger reclaim in the target memory cgroup. This gets the assigning and attribution of targeted reclaim work right (although it doesn't solve kswapd cycle attribution yet). However, it also ditches the limit semantics, which themselves aren't actually a problem. And indeed I would argue have some upsides. As per above, the kernel knows best when and how much to reclaim to match the allocation rate, since it's in control of the allocation path. To do proactive reclaim with the memory.reclaim interface, you would need to monitor memory consumption closely. Workloads may not allocate anything for hours, and then suddenly allocate gigabytes within seconds. A sudden onset of streaming reads through the filesystem could destroy the workingset measurements, whereas a limit would catch it and do drop-behind (and thus workingset sampling) at the exact rate of allocations. Again I believe something that may be doable as a hyperscale operator, but likely too fragile to get wider applications beyond that. My take is that a proactive reclaim feature, whose goal is never to thrash or punish but to keep the LRUs warm and the workingset trimmed, would ideally have: - a pressure or size target specified by userspace but with enforcement driven inside the kernel from the allocation path - the enforcement work NOT be done synchronously by the workload (something I'd argue we want for *all* memory limits) - the enforcement work ACCOUNTED to the cgroup, though, since it's the cgroup's memory allocations causing the work (again something I'd argue we want in general) - a delegatable knob that is independent of setting the maximum size of a container, as that expresses a different type of policy - if size target, self-limiting (ha) enforcement on a pressure threshold or stop enforcement when the userspace component dies Thoughts?
On Mon 28-09-20 17:02:16, Johannes Weiner wrote: [...] > My take is that a proactive reclaim feature, whose goal is never to > thrash or punish but to keep the LRUs warm and the workingset trimmed, > would ideally have: > > - a pressure or size target specified by userspace but with > enforcement driven inside the kernel from the allocation path > > - the enforcement work NOT be done synchronously by the workload > (something I'd argue we want for *all* memory limits) > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > cgroup's memory allocations causing the work (again something I'd > argue we want in general) > > - a delegatable knob that is independent of setting the maximum size > of a container, as that expresses a different type of policy > > - if size target, self-limiting (ha) enforcement on a pressure > threshold or stop enforcement when the userspace component dies > > Thoughts? Agreed with above points. What do you think about http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. I assume that you do not want to override memory.high to implement this because that tends to be tricky from the configuration POV as you mentioned above. But a new limit (memory.middle for a lack of a better name) to define the background reclaim sounds like a good fit with above points.
On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote: > On Mon 28-09-20 17:02:16, Johannes Weiner wrote: > [...] > > My take is that a proactive reclaim feature, whose goal is never to > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > would ideally have: > > > > - a pressure or size target specified by userspace but with > > enforcement driven inside the kernel from the allocation path > > > > - the enforcement work NOT be done synchronously by the workload > > (something I'd argue we want for *all* memory limits) > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > cgroup's memory allocations causing the work (again something I'd > > argue we want in general) > > > > - a delegatable knob that is independent of setting the maximum size > > of a container, as that expresses a different type of policy > > > > - if size target, self-limiting (ha) enforcement on a pressure > > threshold or stop enforcement when the userspace component dies > > > > Thoughts? > > Agreed with above points. What do you think about > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. I definitely agree with what you wrote in this email for background reclaim. Indeed, your description sounds like what I proposed in https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ - what's missing from that patch is proper work attribution. > I assume that you do not want to override memory.high to implement > this because that tends to be tricky from the configuration POV as > you mentioned above. But a new limit (memory.middle for a lack of a > better name) to define the background reclaim sounds like a good fit > with above points. I can see that with a new memory.middle you could kind of sort of do both - background reclaim and proactive reclaim. That said, I do see advantages in keeping them separate: 1. Background reclaim is essentially an allocation optimization that we may want to provide per default, just like kswapd. Kswapd is tweakable of course, but I think actually few users do, and it works pretty well out of the box. It would be nice to provide the same thing on a per-cgroup basis per default and not ask users to make decisions that we are generally better at making. 2. Proactive reclaim may actually be better configured through a pressure threshold rather than a size target. As per above, the goal is not to be punitive or containing. The goal is to keep the LRUs warm and move the colder pages to disk. But how aggressively do you run reclaim for this purpose? What target value should a user write to such a memory.middle file? For one, it depends on the job. A batch job, or a less important background job, may tolerate higher paging overhead than an interactive job. That means more of its pages could be trimmed from RAM and reloaded on-demand from disk. But also, it depends on the storage device. If you move a workload from a machine with a slow disk to a machine with a fast disk, you can page more data in the same amount of time. That means while your workload tolerances stays the same, the faster the disk, the more aggressively you can do reclaim and offload memory. So again, what should a user write to such a control file? Of course, you can approximate an optimal target size for the workload. You can run a manual workingset analysis with page_idle, damon, or similar, determine a hot/cold cutoff based on what you know about the storage characteristics, then echo a number of pages or a size target into a cgroup file and let kernel do the reclaim accordingly. The drawbacks are that the kernel LRU may do a different hot/cold classification than you did and evict the wrong pages, the storage device latencies may vary based on overall IO pattern, and two equally warm pages may have very different paging overhead depending on whether readahead can avert a major fault or not. So it's easy to overshoot the tolerance target and disrupt the workload, or undershoot and have stale LRU data, waste memory etc. You can also do a feedback loop, where you guess an optimal size, then adjust based on how much paging overhead the workload is experiencing, i.e. memory pressure. The drawbacks are that you have to monitor pressure closely and react quickly when the workload is expanding, as it can be potentially sensitive to latencies in the usec range. This can be tricky to do from userspace. So instead of asking users for a target size whose suitability heavily depends on the kernel's LRU implementation, the readahead code, the IO device's capability and general load, why not directly ask the user for a pressure level that the workload is comfortable with and which captures all of the above factors implicitly? Then let the kernel do this feedback loop from a per-cgroup worker.
Hi Johannes, On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Hello, > > I apologize for the late reply. The proposed interface has been an > ongoing topic and area of experimentation within Facebook as well, > which makes it a bit difficult to respond with certainty here. > > I agree with both your usecases. They apply to us as well. We > currently make two small changes to our kernel to solve them. They > work okay-ish in our production environment, but they aren't quite > there yet, and not ready for upstream. > > Some thoughts and comments below. > > On Wed, Sep 09, 2020 at 02:57:52PM -0700, Shakeel Butt wrote: > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup. > > > > Use cases: > > ---------- > > > > 1) Per-memcg uswapd: > > > > Usually applications consists of combination of latency sensitive and > > latency tolerant tasks. For example, tasks serving user requests vs > > tasks doing data backup for a database application. At the moment the > > kernel does not differentiate between such tasks when the application > > hits the memcg limits. So, potentially a latency sensitive user facing > > task can get stuck in high reclaim and be throttled by the kernel. > > > > Similarly there are cases of single process applications having two set > > of thread pools where threads from one pool have high scheduling > > priority and low latency requirement. One concrete example from our > > production is the VMM which have high priority low latency thread pool > > for the VCPUs while separate thread pool for stats reporting, I/O > > emulation, health checks and other managerial operations. The kernel > > memory reclaim does not differentiate between VCPU thread or a > > non-latency sensitive thread and a VCPU thread can get stuck in high > > reclaim. > > > > One way to resolve this issue is to preemptively trigger the memory > > reclaim from a latency tolerant task (uswapd) when the application is > > near the limits. Finding 'near the limits' situation is an orthogonal > > problem. > > I don't think a userspace implementation is suitable for this purpose. > > Kswapd-style background reclaim is beneficial to probably 99% of all > workloads. Because doing reclaim inside the execution stream of the > workload itself is so unnecessary in a multi-CPU environment, whether > the workload is particularly latency sensitive or only cares about > overall throughput. In most cases, spare cores are available to do > this work concurrently, and the buffer memory required between the > workload and the async reclaimer tends to be negligible. > > Requiring non-trivial userspace participation for such a basic > optimization does not seem like a good idea to me. We'd probably end > up with four or five hyperscalers having four or five different > implementations, and not much user coverage beyond that. > I understand your point and having an out of the box kernel-based solution would be more helpful for the users. > I floated this patch before: > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ > > It's blocked on infrastructure work in the CPU controller that allows > accounting CPU cycles spent on behalf of other cgroups. But we need > this functionality in other places as well - network, async filesystem > encryption, various other stuff bounced to workers. > > > 2) Proactive reclaim: > > > > This is a similar to the previous use-case, the difference is instead of > > waiting for the application to be near its limit to trigger memory > > reclaim, continuously pressuring the memcg to reclaim a small amount of > > memory. This gives more accurate and uptodate workingset estimation as > > the LRUs are continuously sorted and can potentially provide more > > deterministic memory overcommit behavior. The memory overcommit > > controller can provide more proactive response to the changing behavior > > of the running applications instead of being reactive. > > This is an important usecase for us as well. And we use it not just to > keep the LRUs warm, but to actively sample the workingset size - the > true amount of memory required, trimmed of all its unused cache and > cold pages that can be read back from disk on demand. > > For this purpose, we're essentially using memory.high right now. > > The only modification we make here is adding a memory.high.tmp variant > that takes a timeout argument in addition to the limit. This ensures > we don't leave an unsafe limit behind if the userspace daemon crashes. > > We have experienced some of the problems you describe below with it. > > > Benefit of user space solution: > > ------------------------------- > > > > 1) More flexible on who should be charged for the cpu of the memory > > reclaim. For proactive reclaim, it makes more sense to centralized the > > overhead while for uswapd, it makes more sense for the application to > > pay for the cpu of the memory reclaim. > > Agreed on both counts. > > > 2) More flexible on dedicating the resources (like cpu). The memory > > overcommit controller can balance the cost between the cpu usage and > > the memory reclaimed. > > This could use some elaboration, I think. > My point was from the resource planning perspective i.e. have flexibility on the amount of resources (CPU) to dedicate for proactive reclaim. > > 3) Provides a way to the applications to keep their LRUs sorted, so, > > under memory pressure better reclaim candidates are selected. This also > > gives more accurate and uptodate notion of working set for an > > application. > > That's a valid argument for proactive reclaim, and I agree with > it. But not necessarily an argument for which part of the proactive > reclaim logic should be in-kernel and which should be in userspace. > > > Questions: > > ---------- > > > > 1) Why memory.high is not enough? > > > > memory.high can be used to trigger reclaim in a memcg and can > > potentially be used for proactive reclaim as well as uswapd use cases. > > However there is a big negative in using memory.high. It can potentially > > introduce high reclaim stalls in the target application as the > > allocations from the processes or the threads of the application can hit > > the temporary memory.high limit. > > That's something we have run into as well. Async memory.high reclaim > helps, but when proactive reclaim does bigger steps and lowers the > limit below the async reclaim buffer, the workload can still enter > direct reclaim. This is undesirable. > > > Another issue with memory.high is that it is not delegatable. To > > actually use this interface for uswapd, the application has to introduce > > another layer of cgroup on whose memory.high it has write access. > > Fair enough. > > I would generalize that and say that limiting the maximum container > size and driving proactive reclaim are separate jobs, with separate > goals, happening at different layers of the system. Sharing a single > control knob for that can be a coordination nightmare. > Agreed. > > 2) Why uswapd safe from self induced reclaim? > > > > This is very similar to the scenario of oomd under global memory > > pressure. We can use the similar mechanisms to protect uswapd from self > > induced reclaim i.e. memory.min and mlock. > > Agreed. > > > Interface options: > > ------------------ > > > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to > > trigger reclaim in the target memory cgroup. > > This gets the assigning and attribution of targeted reclaim work right > (although it doesn't solve kswapd cycle attribution yet). > > However, it also ditches the limit semantics, which themselves aren't > actually a problem. And indeed I would argue have some upsides. > > As per above, the kernel knows best when and how much to reclaim to > match the allocation rate, since it's in control of the allocation > path. To do proactive reclaim with the memory.reclaim interface, you > would need to monitor memory consumption closely. To calculate the amount to reclaim with the memory.reclaim interface in production, we actually use two sources of information, refault rate and idle age histogram (extracted from a more efficient version of Page Idle Tracking). > Workloads may not > allocate anything for hours, and then suddenly allocate gigabytes > within seconds. A sudden onset of streaming reads through the > filesystem could destroy the workingset measurements, whereas a limit > would catch it and do drop-behind (and thus workingset sampling) at > the exact rate of allocations. > > Again I believe something that may be doable as a hyperscale operator, > but likely too fragile to get wider applications beyond that. > > My take is that a proactive reclaim feature, whose goal is never to > thrash or punish but to keep the LRUs warm and the workingset trimmed, > would ideally have: > > - a pressure or size target specified by userspace but with > enforcement driven inside the kernel from the allocation path > > - the enforcement work NOT be done synchronously by the workload > (something I'd argue we want for *all* memory limits) > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > cgroup's memory allocations causing the work (again something I'd > argue we want in general) For this point I think we want more flexibility to control the resources we want to dedicate for proactive reclaim. One particular example from our production is the batch jobs with high memory footprint. These jobs don't have enough CPU quota but we do want to proactively reclaim from them. We would prefer to dedicate some amount of CPU to proactively reclaim from them independent of their own CPU quota. > > - a delegatable knob that is independent of setting the maximum size > of a container, as that expresses a different type of policy > > - if size target, self-limiting (ha) enforcement on a pressure > threshold or stop enforcement when the userspace component dies > > Thoughts?
On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote: > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote: > > [...] > > > My take is that a proactive reclaim feature, whose goal is never to > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > would ideally have: > > > > > > - a pressure or size target specified by userspace but with > > > enforcement driven inside the kernel from the allocation path > > > > > > - the enforcement work NOT be done synchronously by the workload > > > (something I'd argue we want for *all* memory limits) > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > cgroup's memory allocations causing the work (again something I'd > > > argue we want in general) > > > > > > - a delegatable knob that is independent of setting the maximum size > > > of a container, as that expresses a different type of policy > > > > > > - if size target, self-limiting (ha) enforcement on a pressure > > > threshold or stop enforcement when the userspace component dies > > > > > > Thoughts? > > > > Agreed with above points. What do you think about > > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. > > I definitely agree with what you wrote in this email for background > reclaim. Indeed, your description sounds like what I proposed in > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ > - what's missing from that patch is proper work attribution. > > > I assume that you do not want to override memory.high to implement > > this because that tends to be tricky from the configuration POV as > > you mentioned above. But a new limit (memory.middle for a lack of a > > better name) to define the background reclaim sounds like a good fit > > with above points. > > I can see that with a new memory.middle you could kind of sort of do > both - background reclaim and proactive reclaim. > > That said, I do see advantages in keeping them separate: > > 1. Background reclaim is essentially an allocation optimization that > we may want to provide per default, just like kswapd. > > Kswapd is tweakable of course, but I think actually few users do, > and it works pretty well out of the box. It would be nice to > provide the same thing on a per-cgroup basis per default and not > ask users to make decisions that we are generally better at making. > > 2. Proactive reclaim may actually be better configured through a > pressure threshold rather than a size target. > > As per above, the goal is not to be punitive or containing. The > goal is to keep the LRUs warm and move the colder pages to disk. > > But how aggressively do you run reclaim for this purpose? What > target value should a user write to such a memory.middle file? > > For one, it depends on the job. A batch job, or a less important > background job, may tolerate higher paging overhead than an > interactive job. That means more of its pages could be trimmed from > RAM and reloaded on-demand from disk. > > But also, it depends on the storage device. If you move a workload > from a machine with a slow disk to a machine with a fast disk, you > can page more data in the same amount of time. That means while > your workload tolerances stays the same, the faster the disk, the > more aggressively you can do reclaim and offload memory. > > So again, what should a user write to such a control file? > > Of course, you can approximate an optimal target size for the > workload. You can run a manual workingset analysis with page_idle, > damon, or similar, determine a hot/cold cutoff based on what you > know about the storage characteristics, then echo a number of pages > or a size target into a cgroup file and let kernel do the reclaim > accordingly. The drawbacks are that the kernel LRU may do a > different hot/cold classification than you did and evict the wrong > pages, the storage device latencies may vary based on overall IO > pattern, and two equally warm pages may have very different paging > overhead depending on whether readahead can avert a major fault or > not. So it's easy to overshoot the tolerance target and disrupt the > workload, or undershoot and have stale LRU data, waste memory etc. > > You can also do a feedback loop, where you guess an optimal size, > then adjust based on how much paging overhead the workload is > experiencing, i.e. memory pressure. The drawbacks are that you have > to monitor pressure closely and react quickly when the workload is > expanding, as it can be potentially sensitive to latencies in the > usec range. This can be tricky to do from userspace. > This is actually what we do in our production i.e. feedback loop to adjust the next iteration of proactive reclaim. We eliminated the IO or slow disk issues you mentioned by only focusing on anon memory and doing zswap. > So instead of asking users for a target size whose suitability > heavily depends on the kernel's LRU implementation, the readahead > code, the IO device's capability and general load, why not directly > ask the user for a pressure level that the workload is comfortable > with and which captures all of the above factors implicitly? Then > let the kernel do this feedback loop from a per-cgroup worker. I am assuming here by pressure level you are referring to the PSI like interface e.g. allowing the users to tell about their jobs that X amount of stalls in a fixed time window is tolerable. Seems promising though I would like flexibility for giving the resource to the per-cgroup worker. Are you planning to work on this or should I give it a try?
On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote: > On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote: > > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote: > > > [...] > > > > My take is that a proactive reclaim feature, whose goal is never to > > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > > would ideally have: > > > > > > > > - a pressure or size target specified by userspace but with > > > > enforcement driven inside the kernel from the allocation path > > > > > > > > - the enforcement work NOT be done synchronously by the workload > > > > (something I'd argue we want for *all* memory limits) > > > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > > cgroup's memory allocations causing the work (again something I'd > > > > argue we want in general) > > > > > > > > - a delegatable knob that is independent of setting the maximum size > > > > of a container, as that expresses a different type of policy > > > > > > > > - if size target, self-limiting (ha) enforcement on a pressure > > > > threshold or stop enforcement when the userspace component dies > > > > > > > > Thoughts? > > > > > > Agreed with above points. What do you think about > > > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. > > > > I definitely agree with what you wrote in this email for background > > reclaim. Indeed, your description sounds like what I proposed in > > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/ > > - what's missing from that patch is proper work attribution. > > > > > I assume that you do not want to override memory.high to implement > > > this because that tends to be tricky from the configuration POV as > > > you mentioned above. But a new limit (memory.middle for a lack of a > > > better name) to define the background reclaim sounds like a good fit > > > with above points. > > > > I can see that with a new memory.middle you could kind of sort of do > > both - background reclaim and proactive reclaim. > > > > That said, I do see advantages in keeping them separate: > > > > 1. Background reclaim is essentially an allocation optimization that > > we may want to provide per default, just like kswapd. > > > > Kswapd is tweakable of course, but I think actually few users do, > > and it works pretty well out of the box. It would be nice to > > provide the same thing on a per-cgroup basis per default and not > > ask users to make decisions that we are generally better at making. > > > > 2. Proactive reclaim may actually be better configured through a > > pressure threshold rather than a size target. > > > > As per above, the goal is not to be punitive or containing. The > > goal is to keep the LRUs warm and move the colder pages to disk. > > > > But how aggressively do you run reclaim for this purpose? What > > target value should a user write to such a memory.middle file? > > > > For one, it depends on the job. A batch job, or a less important > > background job, may tolerate higher paging overhead than an > > interactive job. That means more of its pages could be trimmed from > > RAM and reloaded on-demand from disk. > > > > But also, it depends on the storage device. If you move a workload > > from a machine with a slow disk to a machine with a fast disk, you > > can page more data in the same amount of time. That means while > > your workload tolerances stays the same, the faster the disk, the > > more aggressively you can do reclaim and offload memory. > > > > So again, what should a user write to such a control file? > > > > Of course, you can approximate an optimal target size for the > > workload. You can run a manual workingset analysis with page_idle, > > damon, or similar, determine a hot/cold cutoff based on what you > > know about the storage characteristics, then echo a number of pages > > or a size target into a cgroup file and let kernel do the reclaim > > accordingly. The drawbacks are that the kernel LRU may do a > > different hot/cold classification than you did and evict the wrong > > pages, the storage device latencies may vary based on overall IO > > pattern, and two equally warm pages may have very different paging > > overhead depending on whether readahead can avert a major fault or > > not. So it's easy to overshoot the tolerance target and disrupt the > > workload, or undershoot and have stale LRU data, waste memory etc. > > > > You can also do a feedback loop, where you guess an optimal size, > > then adjust based on how much paging overhead the workload is > > experiencing, i.e. memory pressure. The drawbacks are that you have > > to monitor pressure closely and react quickly when the workload is > > expanding, as it can be potentially sensitive to latencies in the > > usec range. This can be tricky to do from userspace. > > > > This is actually what we do in our production i.e. feedback loop to > adjust the next iteration of proactive reclaim. That's what we do also right now. It works reasonably well, the only two pain points are/have been the reaction time under quick workload expansion and inadvertently forcing the workload into direct reclaim. > We eliminated the IO or slow disk issues you mentioned by only > focusing on anon memory and doing zswap. Interesting, may I ask how the file cache is managed in this setup? > > So instead of asking users for a target size whose suitability > > heavily depends on the kernel's LRU implementation, the readahead > > code, the IO device's capability and general load, why not directly > > ask the user for a pressure level that the workload is comfortable > > with and which captures all of the above factors implicitly? Then > > let the kernel do this feedback loop from a per-cgroup worker. > > I am assuming here by pressure level you are referring to the PSI like > interface e.g. allowing the users to tell about their jobs that X > amount of stalls in a fixed time window is tolerable. Right, essentially the same parameters that psi poll() would take.
Hello Shakeel, On Wed, Sep 30, 2020 at 08:26:26AM -0700, Shakeel Butt wrote: > On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Workloads may not > > allocate anything for hours, and then suddenly allocate gigabytes > > within seconds. A sudden onset of streaming reads through the > > filesystem could destroy the workingset measurements, whereas a limit > > would catch it and do drop-behind (and thus workingset sampling) at > > the exact rate of allocations. > > > > Again I believe something that may be doable as a hyperscale operator, > > but likely too fragile to get wider applications beyond that. > > > > My take is that a proactive reclaim feature, whose goal is never to > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > would ideally have: > > > > - a pressure or size target specified by userspace but with > > enforcement driven inside the kernel from the allocation path > > > > - the enforcement work NOT be done synchronously by the workload > > (something I'd argue we want for *all* memory limits) > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > cgroup's memory allocations causing the work (again something I'd > > argue we want in general) > > For this point I think we want more flexibility to control the > resources we want to dedicate for proactive reclaim. One particular > example from our production is the batch jobs with high memory > footprint. These jobs don't have enough CPU quota but we do want to > proactively reclaim from them. We would prefer to dedicate some amount > of CPU to proactively reclaim from them independent of their own CPU > quota. Would it not work to add headroom for this reclaim overhead to the CPU quota of the job? The reason I'm asking is because reclaim is only one side of the proactive reclaim medal. The other side is taking faults and having to do IO and/or decompression (zswap, compressed btrfs) on the workload side. And that part is unavoidably consuming CPU and IO quota of the workload. So I wonder how much this can generally be separated out. It's certainly something we've been thinking about as well. Currently, because we use memory.high, we have all the reclaim work being done by a privileged daemon outside the cgroup, and the workload pressure only stems from the refault side. But that means a workload is consuming privileged CPU cycles, and the amount varies depending on the memory access patterns - how many rotations the reclaim scanner is doing etc. So I do wonder whether this "cost of business" of running a workload with a certain memory footprint should be accounted to the workload itself. Because at the end of the day, the CPU you have available will dictate how much memory you need, and both of these axes affect how you can schedule this job in a shared compute pool. Do neighboring jobs on the same host leave you either the memory for your colder pages, or the CPU (and IO) to trim them off? For illustration, compare extreme examples of this. A) A workload that has its executable/libraries and a fixed set of hot heap pages. Proactive reclaim will be relatively slow and cheap - a couple of deactivations/rotations. B) A workload that does high-speed streaming IO and generates a lot of drop-behind cache; or a workload that has a huge virtual anon set with lots of allocations and MADV_FREEing going on. Proactive reclaim will be fast and expensive. Even at the same memory target size, these two types of jobs have very different requirements toward the host environment they can run on. It seems to me that this is cost that should be captured in the job's overall resource footprint.
Hi Johannes, On Thu, Oct 1, 2020 at 8:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Hello Shakeel, > > On Wed, Sep 30, 2020 at 08:26:26AM -0700, Shakeel Butt wrote: > > On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > Workloads may not > > > allocate anything for hours, and then suddenly allocate gigabytes > > > within seconds. A sudden onset of streaming reads through the > > > filesystem could destroy the workingset measurements, whereas a limit > > > would catch it and do drop-behind (and thus workingset sampling) at > > > the exact rate of allocations. > > > > > > Again I believe something that may be doable as a hyperscale operator, > > > but likely too fragile to get wider applications beyond that. > > > > > > My take is that a proactive reclaim feature, whose goal is never to > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > would ideally have: > > > > > > - a pressure or size target specified by userspace but with > > > enforcement driven inside the kernel from the allocation path > > > > > > - the enforcement work NOT be done synchronously by the workload > > > (something I'd argue we want for *all* memory limits) > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > cgroup's memory allocations causing the work (again something I'd > > > argue we want in general) > > > > For this point I think we want more flexibility to control the > > resources we want to dedicate for proactive reclaim. One particular > > example from our production is the batch jobs with high memory > > footprint. These jobs don't have enough CPU quota but we do want to > > proactively reclaim from them. We would prefer to dedicate some amount > > of CPU to proactively reclaim from them independent of their own CPU > > quota. > > Would it not work to add headroom for this reclaim overhead to the CPU > quota of the job? > > The reason I'm asking is because reclaim is only one side of the > proactive reclaim medal. The other side is taking faults and having to > do IO and/or decompression (zswap, compressed btrfs) on the workload > side. And that part is unavoidably consuming CPU and IO quota of the > workload. So I wonder how much this can generally be separated out. > > It's certainly something we've been thinking about as well. Currently, > because we use memory.high, we have all the reclaim work being done by > a privileged daemon outside the cgroup, and the workload pressure only > stems from the refault side. > > But that means a workload is consuming privileged CPU cycles, and the > amount varies depending on the memory access patterns - how many > rotations the reclaim scanner is doing etc. > > So I do wonder whether this "cost of business" of running a workload > with a certain memory footprint should be accounted to the workload > itself. Because at the end of the day, the CPU you have available will > dictate how much memory you need, and both of these axes affect how > you can schedule this job in a shared compute pool. Do neighboring > jobs on the same host leave you either the memory for your colder > pages, or the CPU (and IO) to trim them off? > > For illustration, compare extreme examples of this. > > A) A workload that has its executable/libraries and a fixed > set of hot heap pages. Proactive reclaim will be relatively > slow and cheap - a couple of deactivations/rotations. > > B) A workload that does high-speed streaming IO and generates > a lot of drop-behind cache; or a workload that has a huge > virtual anon set with lots of allocations and MADV_FREEing > going on. Proactive reclaim will be fast and expensive. > > Even at the same memory target size, these two types of jobs have very > different requirements toward the host environment they can run on. > > It seems to me that this is cost that should be captured in the job's > overall resource footprint. I understand your point but from the usability perspective, I am finding it hard to deploy/use. As you said, the proactive reclaim cost will be different for different types of workload but I do not expect the job owners telling me how much headroom their jobs need. I would have to start with a fixed headroom for a job, have to monitor the resource usage of the proactive reclaim for it and dynamically adjust the headroom to not steal the CPU from the job (I am assuming there is no isolation between job and proactive reclaim). This seems very hard to use as compared to setting aside a fixed amount of CPU for proactive reclaim system wide. Please correct me if I am misunderstanding something.
On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > [snip] > > > So instead of asking users for a target size whose suitability > > > heavily depends on the kernel's LRU implementation, the readahead > > > code, the IO device's capability and general load, why not directly > > > ask the user for a pressure level that the workload is comfortable > > > with and which captures all of the above factors implicitly? Then > > > let the kernel do this feedback loop from a per-cgroup worker. > > > > I am assuming here by pressure level you are referring to the PSI like > > interface e.g. allowing the users to tell about their jobs that X > > amount of stalls in a fixed time window is tolerable. > > Right, essentially the same parameters that psi poll() would take. I thought a bit more on the semantics of the psi usage for the proactive reclaim. Suppose I have a top level cgroup A on which I want to enable proactive reclaim. Which memory psi events should the proactive reclaim should consider? The simplest would be the memory.psi at 'A'. However memory.psi is hierarchical and I would not really want the pressure due limits in children of 'A' to impact the proactive reclaim. PSI due to refaults and slow IO should be included or maybe only those which are caused by the proactive reclaim itself. I am undecided on the PSI due to compaction. PSI due to global reclaim for 'A' is even more complicated. This is a stall due to reclaiming from the system including self. It might not really cause more refaults and IOs for 'A'. Should proactive reclaim ignore the pressure due to global pressure when tuning its aggressiveness. Am I overthinking here?
On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote: > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > [snip] > > > > So instead of asking users for a target size whose suitability > > > > heavily depends on the kernel's LRU implementation, the readahead > > > > code, the IO device's capability and general load, why not directly > > > > ask the user for a pressure level that the workload is comfortable > > > > with and which captures all of the above factors implicitly? Then > > > > let the kernel do this feedback loop from a per-cgroup worker. > > > > > > I am assuming here by pressure level you are referring to the PSI like > > > interface e.g. allowing the users to tell about their jobs that X > > > amount of stalls in a fixed time window is tolerable. > > > > Right, essentially the same parameters that psi poll() would take. > > I thought a bit more on the semantics of the psi usage for the > proactive reclaim. > > Suppose I have a top level cgroup A on which I want to enable > proactive reclaim. Which memory psi events should the proactive > reclaim should consider? > > The simplest would be the memory.psi at 'A'. However memory.psi is > hierarchical and I would not really want the pressure due limits in > children of 'A' to impact the proactive reclaim. I don't think pressure from limits down the tree can be separated out, generally. All events are accounted recursively as well. Of course, we remember the reclaim level for evicted entries - but if there is reclaim triggered at A and A/B concurrently, the distribution of who ends up reclaiming the physical pages in A/B is pretty arbitrary/racy. If A/B decides to do its own proactive reclaim with the sublimit, and ends up consuming the pressure budget assigned to proactive reclaim in A, there isn't much that can be done. It's also possible that proactive reclaim in A keeps A/B from hitting its limit in the first place. I have to say, the configuration doesn't really strike me as sensible, though. Limits make sense for doing fixed partitioning: A gets 4G, A/B gets 2G out of that. But if you do proactive reclaim on A you're essentially saying A as a whole is auto-sizing dynamically based on its memory access pattern. I'm not sure what it means to then start doing fixed partitions in the sublevel. > PSI due to refaults and slow IO should be included or maybe only > those which are caused by the proactive reclaim itself. I am > undecided on the PSI due to compaction. PSI due to global reclaim > for 'A' is even more complicated. This is a stall due to reclaiming > from the system including self. It might not really cause more > refaults and IOs for 'A'. Should proactive reclaim ignore the > pressure due to global pressure when tuning its aggressiveness. Yeah, I think they should all be included, because ultimately what matters is what the workload can tolerate without sacrificing performance. Proactive reclaim can destroy THPs, so the cost of recreating them should be reflected. Otherwise you can easily overpressurize. For global reclaim, if you say you want a workload pressurized to X percent in order to drive the LRUs and chop off all cold pages the workload can live without, it doesn't matter who does the work. If there is an abundance of physical memory, it's going to be proactive reclaim. If physical memory is already tight enough that global reclaim does it for you, there is nothing to be done in addition, and proactive reclaim should hang back. Otherwise you can again easily overpressurize the workload.
On Mon, Oct 05, 2020 at 02:59:10PM -0700, Shakeel Butt wrote: > Hi Johannes, > > On Thu, Oct 1, 2020 at 8:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Hello Shakeel, > > > > On Wed, Sep 30, 2020 at 08:26:26AM -0700, Shakeel Butt wrote: > > > On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Workloads may not > > > > allocate anything for hours, and then suddenly allocate gigabytes > > > > within seconds. A sudden onset of streaming reads through the > > > > filesystem could destroy the workingset measurements, whereas a limit > > > > would catch it and do drop-behind (and thus workingset sampling) at > > > > the exact rate of allocations. > > > > > > > > Again I believe something that may be doable as a hyperscale operator, > > > > but likely too fragile to get wider applications beyond that. > > > > > > > > My take is that a proactive reclaim feature, whose goal is never to > > > > thrash or punish but to keep the LRUs warm and the workingset trimmed, > > > > would ideally have: > > > > > > > > - a pressure or size target specified by userspace but with > > > > enforcement driven inside the kernel from the allocation path > > > > > > > > - the enforcement work NOT be done synchronously by the workload > > > > (something I'd argue we want for *all* memory limits) > > > > > > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the > > > > cgroup's memory allocations causing the work (again something I'd > > > > argue we want in general) > > > > > > For this point I think we want more flexibility to control the > > > resources we want to dedicate for proactive reclaim. One particular > > > example from our production is the batch jobs with high memory > > > footprint. These jobs don't have enough CPU quota but we do want to > > > proactively reclaim from them. We would prefer to dedicate some amount > > > of CPU to proactively reclaim from them independent of their own CPU > > > quota. > > > > Would it not work to add headroom for this reclaim overhead to the CPU > > quota of the job? > > > > The reason I'm asking is because reclaim is only one side of the > > proactive reclaim medal. The other side is taking faults and having to > > do IO and/or decompression (zswap, compressed btrfs) on the workload > > side. And that part is unavoidably consuming CPU and IO quota of the > > workload. So I wonder how much this can generally be separated out. > > > > It's certainly something we've been thinking about as well. Currently, > > because we use memory.high, we have all the reclaim work being done by > > a privileged daemon outside the cgroup, and the workload pressure only > > stems from the refault side. > > > > But that means a workload is consuming privileged CPU cycles, and the > > amount varies depending on the memory access patterns - how many > > rotations the reclaim scanner is doing etc. > > > > So I do wonder whether this "cost of business" of running a workload > > with a certain memory footprint should be accounted to the workload > > itself. Because at the end of the day, the CPU you have available will > > dictate how much memory you need, and both of these axes affect how > > you can schedule this job in a shared compute pool. Do neighboring > > jobs on the same host leave you either the memory for your colder > > pages, or the CPU (and IO) to trim them off? > > > > For illustration, compare extreme examples of this. > > > > A) A workload that has its executable/libraries and a fixed > > set of hot heap pages. Proactive reclaim will be relatively > > slow and cheap - a couple of deactivations/rotations. > > > > B) A workload that does high-speed streaming IO and generates > > a lot of drop-behind cache; or a workload that has a huge > > virtual anon set with lots of allocations and MADV_FREEing > > going on. Proactive reclaim will be fast and expensive. > > > > Even at the same memory target size, these two types of jobs have very > > different requirements toward the host environment they can run on. > > > > It seems to me that this is cost that should be captured in the job's > > overall resource footprint. > > I understand your point but from the usability perspective, I am > finding it hard to deploy/use. > > As you said, the proactive reclaim cost will be different for > different types of workload but I do not expect the job owners telling > me how much headroom their jobs need. Isn't that the same for all work performed by the kernel? Instead of proactive reclaim, it could just be regular reclaim due to a limit, whose required headroom depends on the workload's allocation rate. We wouldn't question whether direct reclaim cycles should be charged to the cgroup. I'm not quite sure why proactive reclaim is different - it's the same work done earlier. > I would have to start with a fixed headroom for a job, have to monitor > the resource usage of the proactive reclaim for it and dynamically > adjust the headroom to not steal the CPU from the job (I am assuming > there is no isolation between job and proactive reclaim). > > This seems very hard to use as compared to setting aside a fixed > amount of CPU for proactive reclaim system wide. Please correct me if > I am misunderstanding something. I see your point, but I don't know how a fixed system-wide pool is easier to configure if you don't know the constituent consumers. How much would you set aside? A shared resource outside the natural cgroup hierarchy also triggers my priority inversion alarm bells. How do you prevent a lower priority job from consuming a disproportionate share of this pool? And as a result cause the reclaim in higher priority groups to slow down, which causes their memory footprint to expand and their LRUs to go stale. It also still leaves the question around IO budget. Even if you manage to not eat into the CPU budget of the job, you'd still eat into the IO budget of the job, and that's harder to separate out.
On Thu, Oct 8, 2020 at 7:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote: > > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > [snip] > > > > > So instead of asking users for a target size whose suitability > > > > > heavily depends on the kernel's LRU implementation, the readahead > > > > > code, the IO device's capability and general load, why not directly > > > > > ask the user for a pressure level that the workload is comfortable > > > > > with and which captures all of the above factors implicitly? Then > > > > > let the kernel do this feedback loop from a per-cgroup worker. > > > > > > > > I am assuming here by pressure level you are referring to the PSI like > > > > interface e.g. allowing the users to tell about their jobs that X > > > > amount of stalls in a fixed time window is tolerable. > > > > > > Right, essentially the same parameters that psi poll() would take. > > > > I thought a bit more on the semantics of the psi usage for the > > proactive reclaim. > > > > Suppose I have a top level cgroup A on which I want to enable > > proactive reclaim. Which memory psi events should the proactive > > reclaim should consider? > > > > The simplest would be the memory.psi at 'A'. However memory.psi is > > hierarchical and I would not really want the pressure due limits in > > children of 'A' to impact the proactive reclaim. > > I don't think pressure from limits down the tree can be separated out, > generally. All events are accounted recursively as well. Of course, we > remember the reclaim level for evicted entries - but if there is > reclaim triggered at A and A/B concurrently, the distribution of who > ends up reclaiming the physical pages in A/B is pretty arbitrary/racy. > > If A/B decides to do its own proactive reclaim with the sublimit, and > ends up consuming the pressure budget assigned to proactive reclaim in > A, there isn't much that can be done. > > It's also possible that proactive reclaim in A keeps A/B from hitting > its limit in the first place. > > I have to say, the configuration doesn't really strike me as sensible, > though. Limits make sense for doing fixed partitioning: A gets 4G, A/B > gets 2G out of that. But if you do proactive reclaim on A you're > essentially saying A as a whole is auto-sizing dynamically based on > its memory access pattern. I'm not sure what it means to then start > doing fixed partitions in the sublevel. > Think of the scenario where there is an infrastructure owner and the large number of job owners. The aim of the infra owner is to reduce cost by stuffing as many jobs as possible on the same machine while job owners want consistent performance. The job owners usually have meta jobs i.e. a set of small jobs that run on the same machines and they manage these sub-jobs themselves. The infra owner wants to do proactive reclaim to trim the current jobs without impacting their performance and more importantly to have enough memory to land new jobs (We have learned the hard way that depending on global reclaim for memory overcommit is really bad for isolation). In the above scenario the configuration you mentioned might not be sensible is really possible. This is exactly what we have in prod. You can also get the idea why I am asking for flexibility for the cost of proactive reclaim.
On Thu, Oct 08, 2020 at 08:55:57AM -0700, Shakeel Butt wrote: > On Thu, Oct 8, 2020 at 7:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote: > > > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > [snip] > > > > > > So instead of asking users for a target size whose suitability > > > > > > heavily depends on the kernel's LRU implementation, the readahead > > > > > > code, the IO device's capability and general load, why not directly > > > > > > ask the user for a pressure level that the workload is comfortable > > > > > > with and which captures all of the above factors implicitly? Then > > > > > > let the kernel do this feedback loop from a per-cgroup worker. > > > > > > > > > > I am assuming here by pressure level you are referring to the PSI like > > > > > interface e.g. allowing the users to tell about their jobs that X > > > > > amount of stalls in a fixed time window is tolerable. > > > > > > > > Right, essentially the same parameters that psi poll() would take. > > > > > > I thought a bit more on the semantics of the psi usage for the > > > proactive reclaim. > > > > > > Suppose I have a top level cgroup A on which I want to enable > > > proactive reclaim. Which memory psi events should the proactive > > > reclaim should consider? > > > > > > The simplest would be the memory.psi at 'A'. However memory.psi is > > > hierarchical and I would not really want the pressure due limits in > > > children of 'A' to impact the proactive reclaim. > > > > I don't think pressure from limits down the tree can be separated out, > > generally. All events are accounted recursively as well. Of course, we > > remember the reclaim level for evicted entries - but if there is > > reclaim triggered at A and A/B concurrently, the distribution of who > > ends up reclaiming the physical pages in A/B is pretty arbitrary/racy. > > > > If A/B decides to do its own proactive reclaim with the sublimit, and > > ends up consuming the pressure budget assigned to proactive reclaim in > > A, there isn't much that can be done. > > > > It's also possible that proactive reclaim in A keeps A/B from hitting > > its limit in the first place. > > > > I have to say, the configuration doesn't really strike me as sensible, > > though. Limits make sense for doing fixed partitioning: A gets 4G, A/B > > gets 2G out of that. But if you do proactive reclaim on A you're > > essentially saying A as a whole is auto-sizing dynamically based on > > its memory access pattern. I'm not sure what it means to then start > > doing fixed partitions in the sublevel. > > > > Think of the scenario where there is an infrastructure owner and the > large number of job owners. The aim of the infra owner is to reduce > cost by stuffing as many jobs as possible on the same machine while > job owners want consistent performance. > > The job owners usually have meta jobs i.e. a set of small jobs that > run on the same machines and they manage these sub-jobs themselves. > > The infra owner wants to do proactive reclaim to trim the current jobs > without impacting their performance and more importantly to have > enough memory to land new jobs (We have learned the hard way that > depending on global reclaim for memory overcommit is really bad for > isolation). > > In the above scenario the configuration you mentioned might not be > sensible is really possible. This is exactly what we have in prod. I apologize if my statement was worded too broadly. I fully understand your motivation and understand the sub job structure. It's more about at which level to run proactive reclaim when there are sub-domains. You said you're already using a feedback loop to adjust proactive reclaim based on refault rates. How do you deal with this issue today of one subgroup potentially having higher refaults due to a limit? It appears that as soon as the subgroups can age independently, you also need to treat them independently for proactive reclaim. Because one group hitting its pressure limit says nothing about its sibling. If you apply equal reclaim on them both based on the independently pressured subjob, you'll under-reclaim the siblings. If you apply equal reclaim on them both based on the unpressured siblings alone, you'll over-pressurize the one with its own limit. This seems independent of the exact metric you're using, and more about at which level you apply pressure, and whether reclaim subdomains created through a hard limit can be treated as part of a larger shared pool or not.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 6be43781ec7f..58d70b5989d7 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.reclaim + A write-only file which exists on non-root cgroups. + + This is a simple interface to trigger memory reclaim in the + target cgroup. Write the number of bytes to reclaim to this + file and the kernel will try to reclaim that much memory. + Please note that the kernel can over or under reclaim from + the target cgroup. + memory.oom.group A read-write single value file which exists on non-root cgroups. The default value is "0". diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75cd1a1e66c8..2d006c36d7f3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned int nr_retries = MAX_RECLAIM_RETRIES; + unsigned long nr_to_reclaim, nr_reclaimed = 0; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "", &nr_to_reclaim); + if (err) + return err; + + while (nr_reclaimed < nr_to_reclaim) { + unsigned long reclaimed; + + if (signal_pending(current)) + break; + + reclaimed = try_to_free_mem_cgroup_pages(memcg, + nr_to_reclaim - nr_reclaimed, + GFP_KERNEL, true); + + if (!reclaimed && !nr_retries--) + break; + + nr_reclaimed += reclaimed; + } + + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -6508,6 +6540,11 @@ static struct cftype memory_files[] = { .seq_show = memory_oom_group_show, .write = memory_oom_group_write, }, + { + .name = "reclaim", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .write = memory_reclaim, + }, { } /* terminate */ };
Introduce an memcg interface to trigger memory reclaim on a memory cgroup. Use cases: ---------- 1) Per-memcg uswapd: Usually applications consists of combination of latency sensitive and latency tolerant tasks. For example, tasks serving user requests vs tasks doing data backup for a database application. At the moment the kernel does not differentiate between such tasks when the application hits the memcg limits. So, potentially a latency sensitive user facing task can get stuck in high reclaim and be throttled by the kernel. Similarly there are cases of single process applications having two set of thread pools where threads from one pool have high scheduling priority and low latency requirement. One concrete example from our production is the VMM which have high priority low latency thread pool for the VCPUs while separate thread pool for stats reporting, I/O emulation, health checks and other managerial operations. The kernel memory reclaim does not differentiate between VCPU thread or a non-latency sensitive thread and a VCPU thread can get stuck in high reclaim. One way to resolve this issue is to preemptively trigger the memory reclaim from a latency tolerant task (uswapd) when the application is near the limits. Finding 'near the limits' situation is an orthogonal problem. 2) Proactive reclaim: This is a similar to the previous use-case, the difference is instead of waiting for the application to be near its limit to trigger memory reclaim, continuously pressuring the memcg to reclaim a small amount of memory. This gives more accurate and uptodate workingset estimation as the LRUs are continuously sorted and can potentially provide more deterministic memory overcommit behavior. The memory overcommit controller can provide more proactive response to the changing behavior of the running applications instead of being reactive. Benefit of user space solution: ------------------------------- 1) More flexible on who should be charged for the cpu of the memory reclaim. For proactive reclaim, it makes more sense to centralized the overhead while for uswapd, it makes more sense for the application to pay for the cpu of the memory reclaim. 2) More flexible on dedicating the resources (like cpu). The memory overcommit controller can balance the cost between the cpu usage and the memory reclaimed. 3) Provides a way to the applications to keep their LRUs sorted, so, under memory pressure better reclaim candidates are selected. This also gives more accurate and uptodate notion of working set for an application. Questions: ---------- 1) Why memory.high is not enough? memory.high can be used to trigger reclaim in a memcg and can potentially be used for proactive reclaim as well as uswapd use cases. However there is a big negative in using memory.high. It can potentially introduce high reclaim stalls in the target application as the allocations from the processes or the threads of the application can hit the temporary memory.high limit. Another issue with memory.high is that it is not delegatable. To actually use this interface for uswapd, the application has to introduce another layer of cgroup on whose memory.high it has write access. 2) Why uswapd safe from self induced reclaim? This is very similar to the scenario of oomd under global memory pressure. We can use the similar mechanisms to protect uswapd from self induced reclaim i.e. memory.min and mlock. Interface options: ------------------ Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to trigger reclaim in the target memory cgroup. In future we might want to reclaim specific type of memory from a memcg, so, this interface can be extended to allow that. e.g. $ echo 10M [all|anon|file|kmem] > memory.reclaim However that should be when we have concrete use-cases for such functionality. Keep things simple for now. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ mm/memcontrol.c | 37 +++++++++++++++++++++++++ 2 files changed, 46 insertions(+)