memcg: introduce per-memcg reclaim interface

Message ID	20200909215752.1725525-1-shakeelb@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=0rDd=CS=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 82D2121D92 Date: Wed, 9 Sep 2020 14:57:52 -0700 Message-Id: <20200909215752.1725525-1-shakeelb@google.com> Mime-Version: 1.0 Subject: [PATCH] memcg: introduce per-memcg reclaim interface From: Shakeel Butt <shakeelb@google.com> To: Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, Michal Hocko <mhocko@kernel.org>, Yang Shi <yang.shi@linux.alibaba.com>, Greg Thelen <gthelen@google.com>, David Rientjes <rientjes@google.com>, " =?utf-8?q?Michal_Koutn=C3=BD?= " <mkoutny@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: introduce per-memcg reclaim interface \| expand memcg: introduce per-memcg reclaim interface

Shakeel Butt Sept. 9, 2020, 9:57 p.m. UTC

Introduce an memcg interface to trigger memory reclaim on a memory cgroup.

Use cases:
----------

1) Per-memcg uswapd:

Usually applications consists of combination of latency sensitive and
latency tolerant tasks. For example, tasks serving user requests vs
tasks doing data backup for a database application. At the moment the
kernel does not differentiate between such tasks when the application
hits the memcg limits. So, potentially a latency sensitive user facing
task can get stuck in high reclaim and be throttled by the kernel.

Similarly there are cases of single process applications having two set
of thread pools where threads from one pool have high scheduling
priority and low latency requirement. One concrete example from our
production is the VMM which have high priority low latency thread pool
for the VCPUs while separate thread pool for stats reporting, I/O
emulation, health checks and other managerial operations. The kernel
memory reclaim does not differentiate between VCPU thread or a
non-latency sensitive thread and a VCPU thread can get stuck in high
reclaim.

One way to resolve this issue is to preemptively trigger the memory
reclaim from a latency tolerant task (uswapd) when the application is
near the limits. Finding 'near the limits' situation is an orthogonal
problem.

2) Proactive reclaim:

This is a similar to the previous use-case, the difference is instead of
waiting for the application to be near its limit to trigger memory
reclaim, continuously pressuring the memcg to reclaim a small amount of
memory. This gives more accurate and uptodate workingset estimation as
the LRUs are continuously sorted and can potentially provide more
deterministic memory overcommit behavior. The memory overcommit
controller can provide more proactive response to the changing behavior
of the running applications instead of being reactive.

Benefit of user space solution:
-------------------------------

1) More flexible on who should be charged for the cpu of the memory
reclaim. For proactive reclaim, it makes more sense to centralized the
overhead while for uswapd, it makes more sense for the application to
pay for the cpu of the memory reclaim.

2) More flexible on dedicating the resources (like cpu). The memory
overcommit controller can balance the cost between the cpu usage and
the memory reclaimed.

3) Provides a way to the applications to keep their LRUs sorted, so,
under memory pressure better reclaim candidates are selected. This also
gives more accurate and uptodate notion of working set for an
application.

Questions:
----------

1) Why memory.high is not enough?

memory.high can be used to trigger reclaim in a memcg and can
potentially be used for proactive reclaim as well as uswapd use cases.
However there is a big negative in using memory.high. It can potentially
introduce high reclaim stalls in the target application as the
allocations from the processes or the threads of the application can hit
the temporary memory.high limit.

Another issue with memory.high is that it is not delegatable. To
actually use this interface for uswapd, the application has to introduce
another layer of cgroup on whose memory.high it has write access.

2) Why uswapd safe from self induced reclaim?

This is very similar to the scenario of oomd under global memory
pressure. We can use the similar mechanisms to protect uswapd from self
induced reclaim i.e. memory.min and mlock.

Interface options:
------------------

Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
trigger reclaim in the target memory cgroup.

In future we might want to reclaim specific type of memory from a memcg,
so, this interface can be extended to allow that. e.g.

$ echo 10M [all|anon|file|kmem] > memory.reclaim

However that should be when we have concrete use-cases for such
functionality. Keep things simple for now.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
 mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

SeongJae Park Sept. 10, 2020, 6:36 a.m. UTC | #1

On 2020-09-09T14:57:52-07:00 Shakeel Butt <shakeelb@google.com> wrote:

> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> 
> Use cases:
> ----------
> 
> 1) Per-memcg uswapd:
> 
> Usually applications consists of combination of latency sensitive and
> latency tolerant tasks. For example, tasks serving user requests vs
> tasks doing data backup for a database application. At the moment the
> kernel does not differentiate between such tasks when the application
> hits the memcg limits. So, potentially a latency sensitive user facing
> task can get stuck in high reclaim and be throttled by the kernel.
> 
> Similarly there are cases of single process applications having two set
> of thread pools where threads from one pool have high scheduling
> priority and low latency requirement. One concrete example from our
> production is the VMM which have high priority low latency thread pool
> for the VCPUs while separate thread pool for stats reporting, I/O
> emulation, health checks and other managerial operations. The kernel
> memory reclaim does not differentiate between VCPU thread or a
> non-latency sensitive thread and a VCPU thread can get stuck in high
> reclaim.
> 
> One way to resolve this issue is to preemptively trigger the memory
> reclaim from a latency tolerant task (uswapd) when the application is
> near the limits. Finding 'near the limits' situation is an orthogonal
> problem.
> 
> 2) Proactive reclaim:
> 
> This is a similar to the previous use-case, the difference is instead of
> waiting for the application to be near its limit to trigger memory
> reclaim, continuously pressuring the memcg to reclaim a small amount of
> memory. This gives more accurate and uptodate workingset estimation as
> the LRUs are continuously sorted and can potentially provide more
> deterministic memory overcommit behavior. The memory overcommit
> controller can provide more proactive response to the changing behavior
> of the running applications instead of being reactive.
> 
> Benefit of user space solution:
> -------------------------------
> 
> 1) More flexible on who should be charged for the cpu of the memory
> reclaim. For proactive reclaim, it makes more sense to centralized the
> overhead while for uswapd, it makes more sense for the application to
> pay for the cpu of the memory reclaim.
> 
> 2) More flexible on dedicating the resources (like cpu). The memory
> overcommit controller can balance the cost between the cpu usage and
> the memory reclaimed.
> 
> 3) Provides a way to the applications to keep their LRUs sorted, so,
> under memory pressure better reclaim candidates are selected. This also
> gives more accurate and uptodate notion of working set for an
> application.
> 
> Questions:
> ----------
> 
> 1) Why memory.high is not enough?
> 
> memory.high can be used to trigger reclaim in a memcg and can
> potentially be used for proactive reclaim as well as uswapd use cases.
> However there is a big negative in using memory.high. It can potentially
> introduce high reclaim stalls in the target application as the
> allocations from the processes or the threads of the application can hit
> the temporary memory.high limit.
> 
> Another issue with memory.high is that it is not delegatable. To
> actually use this interface for uswapd, the application has to introduce
> another layer of cgroup on whose memory.high it has write access.
> 
> 2) Why uswapd safe from self induced reclaim?
> 
> This is very similar to the scenario of oomd under global memory
> pressure. We can use the similar mechanisms to protect uswapd from self
> induced reclaim i.e. memory.min and mlock.
> 
> Interface options:
> ------------------
> 
> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> trigger reclaim in the target memory cgroup.
> 
> In future we might want to reclaim specific type of memory from a memcg,
> so, this interface can be extended to allow that. e.g.
> 
> $ echo 10M [all|anon|file|kmem] > memory.reclaim
> 
> However that should be when we have concrete use-cases for such
> functionality. Keep things simple for now.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
>  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 6be43781ec7f..58d70b5989d7 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back.
>  	high limit is used and monitored properly, this limit's
>  	utility is limited to providing the final safety net.
>  
> +  memory.reclaim
> +	A write-only file which exists on non-root cgroups.
> +
> +	This is a simple interface to trigger memory reclaim in the
> +	target cgroup. Write the number of bytes to reclaim to this
> +	file and the kernel will try to reclaim that much memory.
> +	Please note that the kernel can over or under reclaim from
> +	the target cgroup.
> +
>    memory.oom.group
>  	A read-write single value file which exists on non-root
>  	cgroups.  The default value is "0".
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75cd1a1e66c8..2d006c36d7f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>  	return nbytes;
>  }
>  
> +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> +			      size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> +	unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +	int err;
> +
> +	buf = strstrip(buf);
> +	err = page_counter_memparse(buf, "", &nr_to_reclaim);
> +	if (err)
> +		return err;
> +
> +	while (nr_reclaimed < nr_to_reclaim) {
> +		unsigned long reclaimed;
> +
> +		if (signal_pending(current))
> +			break;
> +
> +		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> +						nr_to_reclaim - nr_reclaimed,
> +						GFP_KERNEL, true);
> +
> +		if (!reclaimed && !nr_retries--)
> +			break;

Shouldn't the if condition use '||' instead of '&&'?  I think it could be
easier to read if we put the 'nr_retires' condition in the while condition as
below (just my personal preference, though).

    while (nr_reclaimed < nr_to_reclaim && nr_retires--)


Thanks,
SeongJae Park

Shakeel Butt Sept. 10, 2020, 4:10 p.m. UTC | #2

On Wed, Sep 9, 2020 at 11:37 PM SeongJae Park <sjpark@amazon.com> wrote:
>
> On 2020-09-09T14:57:52-07:00 Shakeel Butt <shakeelb@google.com> wrote:
>
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use cases:
> > ----------
> >
> > 1) Per-memcg uswapd:
> >
> > Usually applications consists of combination of latency sensitive and
> > latency tolerant tasks. For example, tasks serving user requests vs
> > tasks doing data backup for a database application. At the moment the
> > kernel does not differentiate between such tasks when the application
> > hits the memcg limits. So, potentially a latency sensitive user facing
> > task can get stuck in high reclaim and be throttled by the kernel.
> >
> > Similarly there are cases of single process applications having two set
> > of thread pools where threads from one pool have high scheduling
> > priority and low latency requirement. One concrete example from our
> > production is the VMM which have high priority low latency thread pool
> > for the VCPUs while separate thread pool for stats reporting, I/O
> > emulation, health checks and other managerial operations. The kernel
> > memory reclaim does not differentiate between VCPU thread or a
> > non-latency sensitive thread and a VCPU thread can get stuck in high
> > reclaim.
> >
> > One way to resolve this issue is to preemptively trigger the memory
> > reclaim from a latency tolerant task (uswapd) when the application is
> > near the limits. Finding 'near the limits' situation is an orthogonal
> > problem.
> >
> > 2) Proactive reclaim:
> >
> > This is a similar to the previous use-case, the difference is instead of
> > waiting for the application to be near its limit to trigger memory
> > reclaim, continuously pressuring the memcg to reclaim a small amount of
> > memory. This gives more accurate and uptodate workingset estimation as
> > the LRUs are continuously sorted and can potentially provide more
> > deterministic memory overcommit behavior. The memory overcommit
> > controller can provide more proactive response to the changing behavior
> > of the running applications instead of being reactive.
> >
> > Benefit of user space solution:
> > -------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to centralized the
> > overhead while for uswapd, it makes more sense for the application to
> > pay for the cpu of the memory reclaim.
> >
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
> >
> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
> >
> > Questions:
> > ----------
> >
> > 1) Why memory.high is not enough?
> >
> > memory.high can be used to trigger reclaim in a memcg and can
> > potentially be used for proactive reclaim as well as uswapd use cases.
> > However there is a big negative in using memory.high. It can potentially
> > introduce high reclaim stalls in the target application as the
> > allocations from the processes or the threads of the application can hit
> > the temporary memory.high limit.
> >
> > Another issue with memory.high is that it is not delegatable. To
> > actually use this interface for uswapd, the application has to introduce
> > another layer of cgroup on whose memory.high it has write access.
> >
> > 2) Why uswapd safe from self induced reclaim?
> >
> > This is very similar to the scenario of oomd under global memory
> > pressure. We can use the similar mechanisms to protect uswapd from self
> > induced reclaim i.e. memory.min and mlock.
> >
> > Interface options:
> > ------------------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
> >
> > In future we might want to reclaim specific type of memory from a memcg,
> > so, this interface can be extended to allow that. e.g.
> >
> > $ echo 10M [all|anon|file|kmem] > memory.reclaim
> >
> > However that should be when we have concrete use-cases for such
> > functionality. Keep things simple for now.
> >
> > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> >  2 files changed, 46 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 6be43781ec7f..58d70b5989d7 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back.
> >       high limit is used and monitored properly, this limit's
> >       utility is limited to providing the final safety net.
> >
> > +  memory.reclaim
> > +     A write-only file which exists on non-root cgroups.
> > +
> > +     This is a simple interface to trigger memory reclaim in the
> > +     target cgroup. Write the number of bytes to reclaim to this
> > +     file and the kernel will try to reclaim that much memory.
> > +     Please note that the kernel can over or under reclaim from
> > +     the target cgroup.
> > +
> >    memory.oom.group
> >       A read-write single value file which exists on non-root
> >       cgroups.  The default value is "0".
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 75cd1a1e66c8..2d006c36d7f3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> >       return nbytes;
> >  }
> >
> > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > +                           size_t nbytes, loff_t off)
> > +{
> > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +     int err;
> > +
> > +     buf = strstrip(buf);
> > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > +     if (err)
> > +             return err;
> > +
> > +     while (nr_reclaimed < nr_to_reclaim) {
> > +             unsigned long reclaimed;
> > +
> > +             if (signal_pending(current))
> > +                     break;
> > +
> > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > +                                             nr_to_reclaim - nr_reclaimed,
> > +                                             GFP_KERNEL, true);
> > +
> > +             if (!reclaimed && !nr_retries--)
> > +                     break;
>
> Shouldn't the if condition use '||' instead of '&&'?

I copied the pattern from memory_high_write().

> I think it could be
> easier to read if we put the 'nr_retires' condition in the while condition as
> below (just my personal preference, though).
>
>     while (nr_reclaimed < nr_to_reclaim && nr_retires--)
>

The semantics will be different. In my version, it means tolerate
MAX_RECLAIM_RETRIES reclaim failures and your suggestion means total
MAX_RECLAIM_RETRIES tries.

Please note that try_to_free_mem_cgroup_pages() internally does
'nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX)', so, we might need
more than MAX_RECLAIM_RETRIES successful tries to actually reclaim the
amount of memory the user has requested.

>
> Thanks,
> SeongJae Park

SeongJae Park Sept. 10, 2020, 4:34 p.m. UTC | #3

> On Wed, Sep 9, 2020 at 11:37 PM SeongJae Park <sjpark@amazon.com> wrote:
> >
> > On 2020-09-09T14:57:52-07:00 Shakeel Butt <shakeelb@google.com> wrote:
> >
> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > >
> > > Use cases:
> > > ----------
> > >
> > > 1) Per-memcg uswapd:
> > >
> > > Usually applications consists of combination of latency sensitive and
> > > latency tolerant tasks. For example, tasks serving user requests vs
> > > tasks doing data backup for a database application. At the moment the
> > > kernel does not differentiate between such tasks when the application
> > > hits the memcg limits. So, potentially a latency sensitive user facing
> > > task can get stuck in high reclaim and be throttled by the kernel.
> > >
> > > Similarly there are cases of single process applications having two set
> > > of thread pools where threads from one pool have high scheduling
> > > priority and low latency requirement. One concrete example from our
> > > production is the VMM which have high priority low latency thread pool
> > > for the VCPUs while separate thread pool for stats reporting, I/O
> > > emulation, health checks and other managerial operations. The kernel
> > > memory reclaim does not differentiate between VCPU thread or a
> > > non-latency sensitive thread and a VCPU thread can get stuck in high
> > > reclaim.
> > >
> > > One way to resolve this issue is to preemptively trigger the memory
> > > reclaim from a latency tolerant task (uswapd) when the application is
> > > near the limits. Finding 'near the limits' situation is an orthogonal
> > > problem.
> > >
> > > 2) Proactive reclaim:
> > >
> > > This is a similar to the previous use-case, the difference is instead of
> > > waiting for the application to be near its limit to trigger memory
> > > reclaim, continuously pressuring the memcg to reclaim a small amount of
> > > memory. This gives more accurate and uptodate workingset estimation as
> > > the LRUs are continuously sorted and can potentially provide more
> > > deterministic memory overcommit behavior. The memory overcommit
> > > controller can provide more proactive response to the changing behavior
> > > of the running applications instead of being reactive.
> > >
> > > Benefit of user space solution:
> > > -------------------------------
> > >
> > > 1) More flexible on who should be charged for the cpu of the memory
> > > reclaim. For proactive reclaim, it makes more sense to centralized the
> > > overhead while for uswapd, it makes more sense for the application to
> > > pay for the cpu of the memory reclaim.
> > >
> > > 2) More flexible on dedicating the resources (like cpu). The memory
> > > overcommit controller can balance the cost between the cpu usage and
> > > the memory reclaimed.
> > >
> > > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > > under memory pressure better reclaim candidates are selected. This also
> > > gives more accurate and uptodate notion of working set for an
> > > application.
> > >
> > > Questions:
> > > ----------
> > >
> > > 1) Why memory.high is not enough?
> > >
> > > memory.high can be used to trigger reclaim in a memcg and can
> > > potentially be used for proactive reclaim as well as uswapd use cases.
> > > However there is a big negative in using memory.high. It can potentially
> > > introduce high reclaim stalls in the target application as the
> > > allocations from the processes or the threads of the application can hit
> > > the temporary memory.high limit.
> > >
> > > Another issue with memory.high is that it is not delegatable. To
> > > actually use this interface for uswapd, the application has to introduce
> > > another layer of cgroup on whose memory.high it has write access.
> > >
> > > 2) Why uswapd safe from self induced reclaim?
> > >
> > > This is very similar to the scenario of oomd under global memory
> > > pressure. We can use the similar mechanisms to protect uswapd from self
> > > induced reclaim i.e. memory.min and mlock.
> > >
> > > Interface options:
> > > ------------------
> > >
> > > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > > trigger reclaim in the target memory cgroup.
> > >
> > > In future we might want to reclaim specific type of memory from a memcg,
> > > so, this interface can be extended to allow that. e.g.
> > >
> > > $ echo 10M [all|anon|file|kmem] > memory.reclaim
> > >
> > > However that should be when we have concrete use-cases for such
> > > functionality. Keep things simple for now.
> > >
> > > Signed-off-by: Shakeel Butt <shakeelb@google.com>
> > > ---
> > >  Documentation/admin-guide/cgroup-v2.rst |  9 ++++++
> > >  mm/memcontrol.c                         | 37 +++++++++++++++++++++++++
> > >  2 files changed, 46 insertions(+)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 6be43781ec7f..58d70b5989d7 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back.
> > >       high limit is used and monitored properly, this limit's
> > >       utility is limited to providing the final safety net.
> > >
> > > +  memory.reclaim
> > > +     A write-only file which exists on non-root cgroups.
> > > +
> > > +     This is a simple interface to trigger memory reclaim in the
> > > +     target cgroup. Write the number of bytes to reclaim to this
> > > +     file and the kernel will try to reclaim that much memory.
> > > +     Please note that the kernel can over or under reclaim from
> > > +     the target cgroup.
> > > +
> > >    memory.oom.group
> > >       A read-write single value file which exists on non-root
> > >       cgroups.  The default value is "0".
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 75cd1a1e66c8..2d006c36d7f3 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
> > >       return nbytes;
> > >  }
> > >
> > > +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > > +                           size_t nbytes, loff_t off)
> > > +{
> > > +     struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > > +     unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > > +     unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > +     int err;
> > > +
> > > +     buf = strstrip(buf);
> > > +     err = page_counter_memparse(buf, "", &nr_to_reclaim);
> > > +     if (err)
> > > +             return err;
> > > +
> > > +     while (nr_reclaimed < nr_to_reclaim) {
> > > +             unsigned long reclaimed;
> > > +
> > > +             if (signal_pending(current))
> > > +                     break;
> > > +
> > > +             reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > > +                                             nr_to_reclaim - nr_reclaimed,
> > > +                                             GFP_KERNEL, true);
> > > +
> > > +             if (!reclaimed && !nr_retries--)
> > > +                     break;
> >
> > Shouldn't the if condition use '||' instead of '&&'?
> 
> I copied the pattern from memory_high_write().
> 
> > I think it could be
> > easier to read if we put the 'nr_retires' condition in the while condition as
> > below (just my personal preference, though).
> >
> >     while (nr_reclaimed < nr_to_reclaim && nr_retires--)
> >
> 
> The semantics will be different. In my version, it means tolerate
> MAX_RECLAIM_RETRIES reclaim failures and your suggestion means total
> MAX_RECLAIM_RETRIES tries.
> 
> Please note that try_to_free_mem_cgroup_pages() internally does
> 'nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX)', so, we might need
> more than MAX_RECLAIM_RETRIES successful tries to actually reclaim the
> amount of memory the user has requested.

Thanks, understood your intention and agreed on the point.

Reviewed-by: SeongJae Park <sjpark@amazon.de>


Thanks,
SeongJae Park

Michal Hocko Sept. 21, 2020, 4:30 p.m. UTC | #4

On Wed 09-09-20 14:57:52, Shakeel Butt wrote:
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> 
> Use cases:
> ----------
> 
> 1) Per-memcg uswapd:
> 
> Usually applications consists of combination of latency sensitive and
> latency tolerant tasks. For example, tasks serving user requests vs
> tasks doing data backup for a database application. At the moment the
> kernel does not differentiate between such tasks when the application
> hits the memcg limits. So, potentially a latency sensitive user facing
> task can get stuck in high reclaim and be throttled by the kernel.
> 
> Similarly there are cases of single process applications having two set
> of thread pools where threads from one pool have high scheduling
> priority and low latency requirement. One concrete example from our
> production is the VMM which have high priority low latency thread pool
> for the VCPUs while separate thread pool for stats reporting, I/O
> emulation, health checks and other managerial operations. The kernel
> memory reclaim does not differentiate between VCPU thread or a
> non-latency sensitive thread and a VCPU thread can get stuck in high
> reclaim.

As those are presumably in the same cgroup what does prevent them to get
stuck behind shared resources with taken during the reclaim performed by
somebody else? I mean, memory reclaim might drop memory used by the high
priority task. Or they might simply stumble over same locks.

I am also more interested in actual numbers here. The high limit reclaim
is normally swift and should be mostly unnoticeable. If the reclaim gets
more expensive then it can get really noticeable for sure. But for the
later the same can happen with the external pro-activee reclaimer as
well, right? So there is no real "guarantee". Do you have any numbers
from your workloads where you can demonstrate that the external reclaim
has saved you this amount of effective cpu time of the sensitive
workload? (Essentially measure how much time it has to consume in the
high limit reclaim)

To the feature itself, I am not yet convinced we want to have a feature
like that. It surely sounds easy to use and attractive for a better user
space control. It is also much well defined than drop_caches/force_empty
because it is not all or nothing. But it also sounds like something too
easy to use incorrectly (remember drop_caches). I am also a bit worried
about corner cases wich would be easier to hit - e.g. fill up the swap
limit and turn anonymous memory into unreclaimable and who knows what
else.

Shakeel Butt Sept. 21, 2020, 5:50 p.m. UTC | #5

On Mon, Sep 21, 2020 at 9:30 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 09-09-20 14:57:52, Shakeel Butt wrote:
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use cases:
> > ----------
> >
> > 1) Per-memcg uswapd:
> >
> > Usually applications consists of combination of latency sensitive and
> > latency tolerant tasks. For example, tasks serving user requests vs
> > tasks doing data backup for a database application. At the moment the
> > kernel does not differentiate between such tasks when the application
> > hits the memcg limits. So, potentially a latency sensitive user facing
> > task can get stuck in high reclaim and be throttled by the kernel.
> >
> > Similarly there are cases of single process applications having two set
> > of thread pools where threads from one pool have high scheduling
> > priority and low latency requirement. One concrete example from our
> > production is the VMM which have high priority low latency thread pool
> > for the VCPUs while separate thread pool for stats reporting, I/O
> > emulation, health checks and other managerial operations. The kernel
> > memory reclaim does not differentiate between VCPU thread or a
> > non-latency sensitive thread and a VCPU thread can get stuck in high
> > reclaim.
>
> As those are presumably in the same cgroup what does prevent them to get
> stuck behind shared resources with taken during the reclaim performed by
> somebody else? I mean, memory reclaim might drop memory used by the high
> priority task. Or they might simply stumble over same locks.
>

Yes there are a lot of challenges in providing isolation between
latency sensitive and latency tolerant jobs/threads. This proposal
aims to solve one specific challenge memcg limit reclaim.

> I am also more interested in actual numbers here. The high limit reclaim
> is normally swift and should be mostly unnoticeable. If the reclaim gets
> more expensive then it can get really noticeable for sure. But for the
> later the same can happen with the external pro-activee reclaimer as

I think you meant 'uswapd' here instead of pro-active reclaimer.

> well, right? So there is no real "guarantee". Do you have any numbers
> from your workloads where you can demonstrate that the external reclaim
> has saved you this amount of effective cpu time of the sensitive
> workload? (Essentially measure how much time it has to consume in the
> high limit reclaim)
>

What we actually use in our production is the 'proactive reclaim'
which I have explained in the original message but I will add a couple
more sentences below.

For the uswapd use-case, let me point to the previous discussions and
feature requests by others [1, 2]. One of the limiting factors of
these previous proposals was the lack of CPU accounting of the
background reclaimer which the current proposal solves by enabling the
user space solution.

[1] https://lwn.net/Articles/753162/
[2] http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org

Let me add one more point. Even if the high limit reclaim is swift, it
can still take 100s of usecs. Most of our jobs are anon-only and we
use zswap. Compressing a page can take a couple usec, so 100s of usecs
in limit reclaim is normal. For latency sensitive jobs, this amount of
hiccups do matters.

For the proactive reclaim, based on the refault medium, we define
tolerable refault rate of the applications. Then we proactively
reclaim memory from the applications and monitor the refault rates.
Based on the refault rates, the memory overcommit manager controls the
aggressiveness of the proactive reclaim.

This is exactly what we do in the production. Please let me know if
you want to know why we do proactive reclaim in the first place.

> To the feature itself, I am not yet convinced we want to have a feature
> like that. It surely sounds easy to use and attractive for a better user
> space control. It is also much well defined than drop_caches/force_empty
> because it is not all or nothing. But it also sounds like something too
> easy to use incorrectly (remember drop_caches). I am also a bit worried
> about corner cases wich would be easier to hit - e.g. fill up the swap
> limit and turn anonymous memory into unreclaimable and who knows what
> else.

The corner cases you are worried about are already possible with the
existing interfaces. We can already do all such things with
memory.high interface but with some limitations. This new interface
resolves that limitation as explained in the original email.

Please let me know if you have more questions.

thanks,
Shakeel

Michal Hocko Sept. 22, 2020, 11:49 a.m. UTC | #6

On Mon 21-09-20 10:50:14, Shakeel Butt wrote:
> On Mon, Sep 21, 2020 at 9:30 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 09-09-20 14:57:52, Shakeel Butt wrote:
> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > >
> > > Use cases:
> > > ----------
> > >
> > > 1) Per-memcg uswapd:
> > >
> > > Usually applications consists of combination of latency sensitive and
> > > latency tolerant tasks. For example, tasks serving user requests vs
> > > tasks doing data backup for a database application. At the moment the
> > > kernel does not differentiate between such tasks when the application
> > > hits the memcg limits. So, potentially a latency sensitive user facing
> > > task can get stuck in high reclaim and be throttled by the kernel.
> > >
> > > Similarly there are cases of single process applications having two set
> > > of thread pools where threads from one pool have high scheduling
> > > priority and low latency requirement. One concrete example from our
> > > production is the VMM which have high priority low latency thread pool
> > > for the VCPUs while separate thread pool for stats reporting, I/O
> > > emulation, health checks and other managerial operations. The kernel
> > > memory reclaim does not differentiate between VCPU thread or a
> > > non-latency sensitive thread and a VCPU thread can get stuck in high
> > > reclaim.
> >
> > As those are presumably in the same cgroup what does prevent them to get
> > stuck behind shared resources with taken during the reclaim performed by
> > somebody else? I mean, memory reclaim might drop memory used by the high
> > priority task. Or they might simply stumble over same locks.
> >
> 
> Yes there are a lot of challenges in providing isolation between
> latency sensitive and latency tolerant jobs/threads. This proposal
> aims to solve one specific challenge memcg limit reclaim.

I am fully aware that a complete isolation is hard to achieve. I am just
trying evaluate how is this specific usecase worth a new interface that
we will have to maintain for ever. Especially when I suspect that the
interface will likely only paper over immediate problems rather than
offer a long term maintainable solution for it.

> > I am also more interested in actual numbers here. The high limit reclaim
> > is normally swift and should be mostly unnoticeable. If the reclaim gets
> > more expensive then it can get really noticeable for sure. But for the
> > later the same can happen with the external pro-activee reclaimer as
> 
> I think you meant 'uswapd' here instead of pro-active reclaimer.
>
> > well, right? So there is no real "guarantee". Do you have any numbers
> > from your workloads where you can demonstrate that the external reclaim
> > has saved you this amount of effective cpu time of the sensitive
> > workload? (Essentially measure how much time it has to consume in the
> > high limit reclaim)
> >
> 
> What we actually use in our production is the 'proactive reclaim'
> which I have explained in the original message but I will add a couple
> more sentences below.
> 
> For the uswapd use-case, let me point to the previous discussions and
> feature requests by others [1, 2]. One of the limiting factors of
> these previous proposals was the lack of CPU accounting of the
> background reclaimer which the current proposal solves by enabling the
> user space solution.
> 
> [1] https://lwn.net/Articles/753162/
> [2] http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org

I remember those. My understanding was that the only problem is to
properly account for CPU on behalf of the reclaimed cgroup and that has
been work in progress for that.

Outsourcing all that to userspace surely sounds like an attractive
option but it comes with usual user API price. More on that later.

> Let me add one more point. Even if the high limit reclaim is swift, it
> can still take 100s of usecs. Most of our jobs are anon-only and we
> use zswap. Compressing a page can take a couple usec, so 100s of usecs
> in limit reclaim is normal. For latency sensitive jobs, this amount of
> hiccups do matters.

Understood. But isn't this an implementation detail of zswap? Can it
offload some of the heavy lifting to a different context and reduce the
general overhead?

> For the proactive reclaim, based on the refault medium, we define
> tolerable refault rate of the applications. Then we proactively
> reclaim memory from the applications and monitor the refault rates.
> Based on the refault rates, the memory overcommit manager controls the
> aggressiveness of the proactive reclaim.
> 
> This is exactly what we do in the production. Please let me know if
> you want to know why we do proactive reclaim in the first place.

This information is definitely useful and having it in the changelog
would be useful. IIUC the only reason why you cannot use high limit
to control this pro-active reclaim is the potential throttling due to
expensive reclaim, correct?

> > To the feature itself, I am not yet convinced we want to have a feature
> > like that. It surely sounds easy to use and attractive for a better user
> > space control. It is also much well defined than drop_caches/force_empty
> > because it is not all or nothing. But it also sounds like something too
> > easy to use incorrectly (remember drop_caches). I am also a bit worried
> > about corner cases wich would be easier to hit - e.g. fill up the swap
> > limit and turn anonymous memory into unreclaimable and who knows what
> > else.
> 
> The corner cases you are worried about are already possible with the
> existing interfaces. We can already do all such things with
> memory.high interface but with some limitations. This new interface
> resolves that limitation as explained in the original email.

You are right that misconfigured limits can result in problems. But such
a configuration should be quite easy to spot which is not the case for
targetted reclaim calls which do not leave any footprints behind.
Existing interfaces are trying to not expose internal implementation
details as much as well. You are proposing a very targeted interface to
fine control the memory reclaim. There is a risk that userspace will
start depending on a specific reclaim implementation/behavior and future
changes would be prone to regressions in workloads relying on that. So
effectively, any user space memory reclaimer would need to be tuned to a
specific implementation of the memory reclaim. My past experience tells
me that this is not a great thing for maintainability of neither kernel
nor the userspace part.

All that being said, we really should consider whether the proposed
interface is trying to work around existing limitations in the reclaim
or the interface. If this is the former then I do not think we should be
adding it. If the later then we should discuss on how to improve our
existing interfaces (or their implementations) to be better usable and
allow your usecase to work better.

What is your take on that Johannes?

Shakeel Butt Sept. 22, 2020, 3:54 p.m. UTC | #7

On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 21-09-20 10:50:14, Shakeel Butt wrote:
> > On Mon, Sep 21, 2020 at 9:30 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Wed 09-09-20 14:57:52, Shakeel Butt wrote:
> > > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > > >
> > > > Use cases:
> > > > ----------
> > > >
> > > > 1) Per-memcg uswapd:
> > > >
> > > > Usually applications consists of combination of latency sensitive and
> > > > latency tolerant tasks. For example, tasks serving user requests vs
> > > > tasks doing data backup for a database application. At the moment the
> > > > kernel does not differentiate between such tasks when the application
> > > > hits the memcg limits. So, potentially a latency sensitive user facing
> > > > task can get stuck in high reclaim and be throttled by the kernel.
> > > >
> > > > Similarly there are cases of single process applications having two set
> > > > of thread pools where threads from one pool have high scheduling
> > > > priority and low latency requirement. One concrete example from our
> > > > production is the VMM which have high priority low latency thread pool
> > > > for the VCPUs while separate thread pool for stats reporting, I/O
> > > > emulation, health checks and other managerial operations. The kernel
> > > > memory reclaim does not differentiate between VCPU thread or a
> > > > non-latency sensitive thread and a VCPU thread can get stuck in high
> > > > reclaim.
> > >
> > > As those are presumably in the same cgroup what does prevent them to get
> > > stuck behind shared resources with taken during the reclaim performed by
> > > somebody else? I mean, memory reclaim might drop memory used by the high
> > > priority task. Or they might simply stumble over same locks.
> > >
> >
> > Yes there are a lot of challenges in providing isolation between
> > latency sensitive and latency tolerant jobs/threads. This proposal
> > aims to solve one specific challenge memcg limit reclaim.
>
> I am fully aware that a complete isolation is hard to achieve. I am just
> trying evaluate how is this specific usecase worth a new interface that
> we will have to maintain for ever. Especially when I suspect that the
> interface will likely only paper over immediate problems rather than
> offer a long term maintainable solution for it.
>

I think you are getting too focused on the uswapd use-case only. The
proposed interface enables the proactive reclaim as well which we
actually use in production.

> > > I am also more interested in actual numbers here. The high limit reclaim
> > > is normally swift and should be mostly unnoticeable. If the reclaim gets
> > > more expensive then it can get really noticeable for sure. But for the
> > > later the same can happen with the external pro-activee reclaimer as
> >
> > I think you meant 'uswapd' here instead of pro-active reclaimer.
> >
> > > well, right? So there is no real "guarantee". Do you have any numbers
> > > from your workloads where you can demonstrate that the external reclaim
> > > has saved you this amount of effective cpu time of the sensitive
> > > workload? (Essentially measure how much time it has to consume in the
> > > high limit reclaim)
> > >
> >
> > What we actually use in our production is the 'proactive reclaim'
> > which I have explained in the original message but I will add a couple
> > more sentences below.
> >
> > For the uswapd use-case, let me point to the previous discussions and
> > feature requests by others [1, 2]. One of the limiting factors of
> > these previous proposals was the lack of CPU accounting of the
> > background reclaimer which the current proposal solves by enabling the
> > user space solution.
> >
> > [1] https://lwn.net/Articles/753162/
> > [2] http://lkml.kernel.org/r/20200219181219.54356-1-hannes@cmpxchg.org
>
> I remember those. My understanding was that the only problem is to
> properly account for CPU on behalf of the reclaimed cgroup and that has
> been work in progress for that.
>
> Outsourcing all that to userspace surely sounds like an attractive
> option but it comes with usual user API price. More on that later.
>
> > Let me add one more point. Even if the high limit reclaim is swift, it
> > can still take 100s of usecs. Most of our jobs are anon-only and we
> > use zswap. Compressing a page can take a couple usec, so 100s of usecs
> > in limit reclaim is normal. For latency sensitive jobs, this amount of
> > hiccups do matters.
>
> Understood. But isn't this an implementation detail of zswap? Can it
> offload some of the heavy lifting to a different context and reduce the
> general overhead?
>

Are you saying doing the compression asynchronously? Similar to how
the disk-based swap triggers the writeback and puts the page back to
LRU, so the next time reclaim sees it, it will be instantly reclaimed?
Or send the batch of pages to be compressed to a different CPU and
wait for the completion?

BTW the proactive reclaim naturally offloads that to a different context.

> > For the proactive reclaim, based on the refault medium, we define
> > tolerable refault rate of the applications. Then we proactively
> > reclaim memory from the applications and monitor the refault rates.
> > Based on the refault rates, the memory overcommit manager controls the
> > aggressiveness of the proactive reclaim.
> >
> > This is exactly what we do in the production. Please let me know if
> > you want to know why we do proactive reclaim in the first place.
>
> This information is definitely useful and having it in the changelog
> would be useful. IIUC the only reason why you cannot use high limit
> to control this pro-active reclaim is the potential throttling due to
> expensive reclaim, correct?
>

Yes.

> > > To the feature itself, I am not yet convinced we want to have a feature
> > > like that. It surely sounds easy to use and attractive for a better user
> > > space control. It is also much well defined than drop_caches/force_empty
> > > because it is not all or nothing. But it also sounds like something too
> > > easy to use incorrectly (remember drop_caches). I am also a bit worried
> > > about corner cases wich would be easier to hit - e.g. fill up the swap
> > > limit and turn anonymous memory into unreclaimable and who knows what
> > > else.
> >
> > The corner cases you are worried about are already possible with the
> > existing interfaces. We can already do all such things with
> > memory.high interface but with some limitations. This new interface
> > resolves that limitation as explained in the original email.
>
> You are right that misconfigured limits can result in problems. But such
> a configuration should be quite easy to spot which is not the case for
> targetted reclaim calls which do not leave any footprints behind.
> Existing interfaces are trying to not expose internal implementation
> details as much as well. You are proposing a very targeted interface to
> fine control the memory reclaim. There is a risk that userspace will
> start depending on a specific reclaim implementation/behavior and future
> changes would be prone to regressions in workloads relying on that. So
> effectively, any user space memory reclaimer would need to be tuned to a
> specific implementation of the memory reclaim.

I don't see the exposure of internal memory reclaim implementation.
The interface is very simple. Reclaim a given amount of memory. Either
the kernel will reclaim less memory or it will over reclaim. In case
of reclaiming less memory, the user space can retry given there is
enough reclaimable memory. For the over reclaim case, the user space
will backoff for a longer time. How are the internal reclaim
implementation details exposed?

> My past experience tells
> me that this is not a great thing for maintainability of neither kernel
> nor the userspace part.
>
> All that being said, we really should consider whether the proposed
> interface is trying to work around existing limitations in the reclaim
> or the interface. If this is the former then I do not think we should be
> adding it. If the later then we should discuss on how to improve our
> existing interfaces (or their implementations) to be better usable and
> allow your usecase to work better.

It is the limitation of the interface. My concern is in fixing the
interface we might convolute the memory.high interface making it more
burden to maintain than simply adding a new interface.

>
> What is your take on that Johannes?
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Sept. 22, 2020, 4:55 p.m. UTC | #8

On Tue 22-09-20 08:54:25, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 21-09-20 10:50:14, Shakeel Butt wrote:
[...]
> > > Let me add one more point. Even if the high limit reclaim is swift, it
> > > can still take 100s of usecs. Most of our jobs are anon-only and we
> > > use zswap. Compressing a page can take a couple usec, so 100s of usecs
> > > in limit reclaim is normal. For latency sensitive jobs, this amount of
> > > hiccups do matters.
> >
> > Understood. But isn't this an implementation detail of zswap? Can it
> > offload some of the heavy lifting to a different context and reduce the
> > general overhead?
> >
> 
> Are you saying doing the compression asynchronously? Similar to how
> the disk-based swap triggers the writeback and puts the page back to
> LRU, so the next time reclaim sees it, it will be instantly reclaimed?
> Or send the batch of pages to be compressed to a different CPU and
> wait for the completion?

Yes.

[...]

> > You are right that misconfigured limits can result in problems. But such
> > a configuration should be quite easy to spot which is not the case for
> > targetted reclaim calls which do not leave any footprints behind.
> > Existing interfaces are trying to not expose internal implementation
> > details as much as well. You are proposing a very targeted interface to
> > fine control the memory reclaim. There is a risk that userspace will
> > start depending on a specific reclaim implementation/behavior and future
> > changes would be prone to regressions in workloads relying on that. So
> > effectively, any user space memory reclaimer would need to be tuned to a
> > specific implementation of the memory reclaim.
> 
> I don't see the exposure of internal memory reclaim implementation.
> The interface is very simple. Reclaim a given amount of memory. Either
> the kernel will reclaim less memory or it will over reclaim. In case
> of reclaiming less memory, the user space can retry given there is
> enough reclaimable memory. For the over reclaim case, the user space
> will backoff for a longer time. How are the internal reclaim
> implementation details exposed?

In an ideal world yes. A feedback mechanism will be independent on the
particular implementation. But the reality tends to disagree quite
often. Once we provide a tool there will be users using it to the best
of their knowlege. Very often as a hammer. This is what the history of
kernel regressions and "we have to revert an obvious fix because
userspace depends on an undocumented behavior which happened to work for
some time" has thought us in a hard way.

I really do not want to deal with reports where a new heuristic in the
memory reclaim will break something just because the reclaim takes
slightly longer or over/under reclaims differently so the existing
assumptions break and the overall balancing from userspace breaks.

This might be a shiny exception of course. And please note that I am not
saying that the interface is completely wrong or unacceptable. I just
want to be absolutely sure we cannot move forward with the existing API
space that we have.

So far I have learned that you are primarily working around an
implementation detail in the zswap which is doing the swapout path
directly in the pageout path. That sounds like a very bad reason to add
a new interface. You are right that there are likely other usecases to
like this new interface - mostly to emulate drop_caches - but I believe
those are quite misguided as well and we should work harder to help
them out to use the existing APIs. Last but not least the memcg
background reclaim is something that should be possible without a new
interface.

Shakeel Butt Sept. 22, 2020, 6:10 p.m. UTC | #9

On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 08:54:25, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 21-09-20 10:50:14, Shakeel Butt wrote:
> [...]
> > > > Let me add one more point. Even if the high limit reclaim is swift, it
> > > > can still take 100s of usecs. Most of our jobs are anon-only and we
> > > > use zswap. Compressing a page can take a couple usec, so 100s of usecs
> > > > in limit reclaim is normal. For latency sensitive jobs, this amount of
> > > > hiccups do matters.
> > >
> > > Understood. But isn't this an implementation detail of zswap? Can it
> > > offload some of the heavy lifting to a different context and reduce the
> > > general overhead?
> > >
> >
> > Are you saying doing the compression asynchronously? Similar to how
> > the disk-based swap triggers the writeback and puts the page back to
> > LRU, so the next time reclaim sees it, it will be instantly reclaimed?
> > Or send the batch of pages to be compressed to a different CPU and
> > wait for the completion?
>
> Yes.
>

Adding Minchan, if he has more experience/opinion on async swap on zram/zswap.

> [...]
>
> > > You are right that misconfigured limits can result in problems. But such
> > > a configuration should be quite easy to spot which is not the case for
> > > targetted reclaim calls which do not leave any footprints behind.
> > > Existing interfaces are trying to not expose internal implementation
> > > details as much as well. You are proposing a very targeted interface to
> > > fine control the memory reclaim. There is a risk that userspace will
> > > start depending on a specific reclaim implementation/behavior and future
> > > changes would be prone to regressions in workloads relying on that. So
> > > effectively, any user space memory reclaimer would need to be tuned to a
> > > specific implementation of the memory reclaim.
> >
> > I don't see the exposure of internal memory reclaim implementation.
> > The interface is very simple. Reclaim a given amount of memory. Either
> > the kernel will reclaim less memory or it will over reclaim. In case
> > of reclaiming less memory, the user space can retry given there is
> > enough reclaimable memory. For the over reclaim case, the user space
> > will backoff for a longer time. How are the internal reclaim
> > implementation details exposed?
>
> In an ideal world yes. A feedback mechanism will be independent on the
> particular implementation. But the reality tends to disagree quite
> often. Once we provide a tool there will be users using it to the best
> of their knowlege. Very often as a hammer. This is what the history of
> kernel regressions and "we have to revert an obvious fix because
> userspace depends on an undocumented behavior which happened to work for
> some time" has thought us in a hard way.
>
> I really do not want to deal with reports where a new heuristic in the
> memory reclaim will break something just because the reclaim takes
> slightly longer or over/under reclaims differently so the existing
> assumptions break and the overall balancing from userspace breaks.
>
> This might be a shiny exception of course. And please note that I am not
> saying that the interface is completely wrong or unacceptable. I just
> want to be absolutely sure we cannot move forward with the existing API
> space that we have.
>
> So far I have learned that you are primarily working around an
> implementation detail in the zswap which is doing the swapout path
> directly in the pageout path.

Wait how did you reach this conclusion? I have explicitly said that we
are not using uswapd like functionality in production. We are using
this interface for proactive reclaim and proactive reclaim is not a
workaround for implementation detail in the zswap.

> That sounds like a very bad reason to add
> a new interface. You are right that there are likely other usecases to
> like this new interface - mostly to emulate drop_caches - but I believe
> those are quite misguided as well and we should work harder to help
> them out to use the existing APIs.

I am not really understanding your concern specific for the new API.
All of your concerns (user expectation of reclaim time or over/under
reclaim) are still possible with the existing API i.e. memory.high.

> Last but not least the memcg
> background reclaim is something that should be possible without a new
> interface.

So, it comes down to adding more functionality/semantics to
memory.high or introducing a new simple interface. I am fine with
either of one but IMO convoluted memory.high might have a higher
maintenance cost.

I can send the patch to add the functionality in the memory.high but I
would like to get Johannes's opinion first.

Shakeel

Michal Hocko Sept. 22, 2020, 6:31 p.m. UTC | #10

On Tue 22-09-20 11:10:17, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > So far I have learned that you are primarily working around an
> > implementation detail in the zswap which is doing the swapout path
> > directly in the pageout path.
> 
> Wait how did you reach this conclusion? I have explicitly said that we
> are not using uswapd like functionality in production. We are using
> this interface for proactive reclaim and proactive reclaim is not a
> workaround for implementation detail in the zswap.

Hmm, I must have missed the distinction between the two you have
mentioned. Correct me if I am wrong but "latency sensitive" workload is
the one that cannot use the high limit, right. For some reason I thought
that your pro-active reclaim usecase is also not compatible with the
throttling imposed by the high limit. Hence my conclusion above.

Shakeel Butt Sept. 22, 2020, 6:56 p.m. UTC | #11

On Tue, Sep 22, 2020 at 11:31 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 11:10:17, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > So far I have learned that you are primarily working around an
> > > implementation detail in the zswap which is doing the swapout path
> > > directly in the pageout path.
> >
> > Wait how did you reach this conclusion? I have explicitly said that we
> > are not using uswapd like functionality in production. We are using
> > this interface for proactive reclaim and proactive reclaim is not a
> > workaround for implementation detail in the zswap.
>
> Hmm, I must have missed the distinction between the two you have
> mentioned. Correct me if I am wrong but "latency sensitive" workload is
> the one that cannot use the high limit, right.

Yes.

> For some reason I thought
> that your pro-active reclaim usecase is also not compatible with the
> throttling imposed by the high limit. Hence my conclusion above.
>

For proactive reclaim use-case, it is more about the weirdness of
using memory.high interface for proactive reclaim.

Let's suppose I want to reclaim 20 MiB from a job. To use memory.high,
I have to read memory.current and subtract 20MiB from it and then
write that to memory.high and once that is done, I have to set
memory.high to the previous value (job's original high limit).

There is a time window where the allocation of the target job can hit
the temporary memory.high which will cause uninteresting MEMCG_HIGH
event, PSI pressure and can potentially over reclaim. Also there is a
race between reading memory.current and setting the temporary
memory.high. There are many non-deterministic  variables added to the
request of reclaiming 20MiB from a job.

Shakeel

Michal Hocko Sept. 22, 2020, 7:08 p.m. UTC | #12

On Tue 22-09-20 11:10:17, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Last but not least the memcg
> > background reclaim is something that should be possible without a new
> > interface.
> 
> So, it comes down to adding more functionality/semantics to
> memory.high or introducing a new simple interface. I am fine with
> either of one but IMO convoluted memory.high might have a higher
> maintenance cost.

One idea would be to schedule a background worker (which work on behalf
on the memcg) to do the high limit reclaim with high limit target as
soon as the high limit is reached. There would be one work item for each
memcg. Userspace would recheck the high limit on return to the userspace
and do the reclaim if the excess is larger than a threshold, and sleep
as the fallback.

Excessive consumers would get throttled if the background work cannot
keep up with the charge pace and most of them would return without doing
any reclaim because there is somebody working on their behalf - and is
accounted for that.

The semantic of high limit would be preserved IMHO because high limit is
actively throttled. Where that work is done shouldn't matter as long as
it is accounted properly and memcg cannot outsource all the work to the
rest of the system.

Would something like that (with many details to be sorted out of course)
be feasible?

If we do not want to change the existing semantic of high and want a new
api then I think having another limit for the background reclaim then
that would make more sense to me. It would resemble the global reclaim
and kswapd model and something that would be easier to reason about.
Comparing to echo $N > reclaim which might mean to reclaim any number
pages around N.

Yang Shi Sept. 22, 2020, 8:02 p.m. UTC | #13

On Tue, Sep 22, 2020 at 12:09 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 11:10:17, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Last but not least the memcg
> > > background reclaim is something that should be possible without a new
> > > interface.
> >
> > So, it comes down to adding more functionality/semantics to
> > memory.high or introducing a new simple interface. I am fine with
> > either of one but IMO convoluted memory.high might have a higher
> > maintenance cost.
>
> One idea would be to schedule a background worker (which work on behalf
> on the memcg) to do the high limit reclaim with high limit target as
> soon as the high limit is reached. There would be one work item for each
> memcg. Userspace would recheck the high limit on return to the userspace
> and do the reclaim if the excess is larger than a threshold, and sleep
> as the fallback.
>
> Excessive consumers would get throttled if the background work cannot
> keep up with the charge pace and most of them would return without doing
> any reclaim because there is somebody working on their behalf - and is
> accounted for that.
>
> The semantic of high limit would be preserved IMHO because high limit is
> actively throttled. Where that work is done shouldn't matter as long as
> it is accounted properly and memcg cannot outsource all the work to the
> rest of the system.
>
> Would something like that (with many details to be sorted out of course)
> be feasible?

This is exactly how our "per-memcg kswapd" works. The missing piece is
how to account the background worker (it is a kernel work thread)
properly as what we discussed before. You mentioned such work is WIP
in earlier email of this thread, I think once this is done the
per-memcg background worker could be supported easily.

>
> If we do not want to change the existing semantic of high and want a new
> api then I think having another limit for the background reclaim then
> that would make more sense to me. It would resemble the global reclaim
> and kswapd model and something that would be easier to reason about.
> Comparing to echo $N > reclaim which might mean to reclaim any number
> pages around N.
> --
> Michal Hocko
> SUSE Labs

Shakeel Butt Sept. 22, 2020, 10:38 p.m. UTC | #14

On Tue, Sep 22, 2020 at 12:09 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 22-09-20 11:10:17, Shakeel Butt wrote:
> > On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Last but not least the memcg
> > > background reclaim is something that should be possible without a new
> > > interface.
> >
> > So, it comes down to adding more functionality/semantics to
> > memory.high or introducing a new simple interface. I am fine with
> > either of one but IMO convoluted memory.high might have a higher
> > maintenance cost.
>
> One idea would be to schedule a background worker (which work on behalf
> on the memcg) to do the high limit reclaim with high limit target as
> soon as the high limit is reached. There would be one work item for each
> memcg. Userspace would recheck the high limit on return to the userspace
> and do the reclaim if the excess is larger than a threshold, and sleep
> as the fallback.
>
> Excessive consumers would get throttled if the background work cannot
> keep up with the charge pace and most of them would return without doing
> any reclaim because there is somebody working on their behalf - and is
> accounted for that.
>
> The semantic of high limit would be preserved IMHO because high limit is
> actively throttled. Where that work is done shouldn't matter as long as
> it is accounted properly and memcg cannot outsource all the work to the
> rest of the system.
>
> Would something like that (with many details to be sorted out of course)
> be feasible?
>

Well what about the proactive reclaim use-case? You are targeting only
uswapd/background-reclaim use-case.

> If we do not want to change the existing semantic of high and want a new
> api then I think having another limit for the background reclaim then
> that would make more sense to me. It would resemble the global reclaim
> and kswapd model and something that would be easier to reason about.
> Comparing to echo $N > reclaim which might mean to reclaim any number
> pages around N.
> --

I am not really against the approach you are proposing but "echo $N >
reclaim" allows more flexibility and enables more use-cases.

Johannes Weiner Sept. 28, 2020, 9:02 p.m. UTC | #15

Hello,

I apologize for the late reply. The proposed interface has been an
ongoing topic and area of experimentation within Facebook as well,
which makes it a bit difficult to respond with certainty here.

I agree with both your usecases. They apply to us as well. We
currently make two small changes to our kernel to solve them. They
work okay-ish in our production environment, but they aren't quite
there yet, and not ready for upstream.

Some thoughts and comments below.

On Wed, Sep 09, 2020 at 02:57:52PM -0700, Shakeel Butt wrote:
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> 
> Use cases:
> ----------
> 
> 1) Per-memcg uswapd:
> 
> Usually applications consists of combination of latency sensitive and
> latency tolerant tasks. For example, tasks serving user requests vs
> tasks doing data backup for a database application. At the moment the
> kernel does not differentiate between such tasks when the application
> hits the memcg limits. So, potentially a latency sensitive user facing
> task can get stuck in high reclaim and be throttled by the kernel.
> 
> Similarly there are cases of single process applications having two set
> of thread pools where threads from one pool have high scheduling
> priority and low latency requirement. One concrete example from our
> production is the VMM which have high priority low latency thread pool
> for the VCPUs while separate thread pool for stats reporting, I/O
> emulation, health checks and other managerial operations. The kernel
> memory reclaim does not differentiate between VCPU thread or a
> non-latency sensitive thread and a VCPU thread can get stuck in high
> reclaim.
> 
> One way to resolve this issue is to preemptively trigger the memory
> reclaim from a latency tolerant task (uswapd) when the application is
> near the limits. Finding 'near the limits' situation is an orthogonal
> problem.

I don't think a userspace implementation is suitable for this purpose.

Kswapd-style background reclaim is beneficial to probably 99% of all
workloads. Because doing reclaim inside the execution stream of the
workload itself is so unnecessary in a multi-CPU environment, whether
the workload is particularly latency sensitive or only cares about
overall throughput. In most cases, spare cores are available to do
this work concurrently, and the buffer memory required between the
workload and the async reclaimer tends to be negligible.

Requiring non-trivial userspace participation for such a basic
optimization does not seem like a good idea to me. We'd probably end
up with four or five hyperscalers having four or five different
implementations, and not much user coverage beyond that.

I floated this patch before:
https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/

It's blocked on infrastructure work in the CPU controller that allows
accounting CPU cycles spent on behalf of other cgroups. But we need
this functionality in other places as well - network, async filesystem
encryption, various other stuff bounced to workers.

> 2) Proactive reclaim:
> 
> This is a similar to the previous use-case, the difference is instead of
> waiting for the application to be near its limit to trigger memory
> reclaim, continuously pressuring the memcg to reclaim a small amount of
> memory. This gives more accurate and uptodate workingset estimation as
> the LRUs are continuously sorted and can potentially provide more
> deterministic memory overcommit behavior. The memory overcommit
> controller can provide more proactive response to the changing behavior
> of the running applications instead of being reactive.

This is an important usecase for us as well. And we use it not just to
keep the LRUs warm, but to actively sample the workingset size - the
true amount of memory required, trimmed of all its unused cache and
cold pages that can be read back from disk on demand.

For this purpose, we're essentially using memory.high right now.

The only modification we make here is adding a memory.high.tmp variant
that takes a timeout argument in addition to the limit. This ensures
we don't leave an unsafe limit behind if the userspace daemon crashes.

We have experienced some of the problems you describe below with it.

> Benefit of user space solution:
> -------------------------------
> 
> 1) More flexible on who should be charged for the cpu of the memory
> reclaim. For proactive reclaim, it makes more sense to centralized the
> overhead while for uswapd, it makes more sense for the application to
> pay for the cpu of the memory reclaim.

Agreed on both counts.

> 2) More flexible on dedicating the resources (like cpu). The memory
> overcommit controller can balance the cost between the cpu usage and
> the memory reclaimed.

This could use some elaboration, I think.

> 3) Provides a way to the applications to keep their LRUs sorted, so,
> under memory pressure better reclaim candidates are selected. This also
> gives more accurate and uptodate notion of working set for an
> application.

That's a valid argument for proactive reclaim, and I agree with
it. But not necessarily an argument for which part of the proactive
reclaim logic should be in-kernel and which should be in userspace.

> Questions:
> ----------
> 
> 1) Why memory.high is not enough?
> 
> memory.high can be used to trigger reclaim in a memcg and can
> potentially be used for proactive reclaim as well as uswapd use cases.
> However there is a big negative in using memory.high. It can potentially
> introduce high reclaim stalls in the target application as the
> allocations from the processes or the threads of the application can hit
> the temporary memory.high limit.

That's something we have run into as well. Async memory.high reclaim
helps, but when proactive reclaim does bigger steps and lowers the
limit below the async reclaim buffer, the workload can still enter
direct reclaim. This is undesirable.

> Another issue with memory.high is that it is not delegatable. To
> actually use this interface for uswapd, the application has to introduce
> another layer of cgroup on whose memory.high it has write access.

Fair enough.

I would generalize that and say that limiting the maximum container
size and driving proactive reclaim are separate jobs, with separate
goals, happening at different layers of the system. Sharing a single
control knob for that can be a coordination nightmare.

> 2) Why uswapd safe from self induced reclaim?
> 
> This is very similar to the scenario of oomd under global memory
> pressure. We can use the similar mechanisms to protect uswapd from self
> induced reclaim i.e. memory.min and mlock.

Agreed.

> Interface options:
> ------------------
> 
> Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> trigger reclaim in the target memory cgroup.

This gets the assigning and attribution of targeted reclaim work right
(although it doesn't solve kswapd cycle attribution yet).

However, it also ditches the limit semantics, which themselves aren't
actually a problem. And indeed I would argue have some upsides.

As per above, the kernel knows best when and how much to reclaim to
match the allocation rate, since it's in control of the allocation
path. To do proactive reclaim with the memory.reclaim interface, you
would need to monitor memory consumption closely. Workloads may not
allocate anything for hours, and then suddenly allocate gigabytes
within seconds. A sudden onset of streaming reads through the
filesystem could destroy the workingset measurements, whereas a limit
would catch it and do drop-behind (and thus workingset sampling) at
the exact rate of allocations.

Again I believe something that may be doable as a hyperscale operator,
but likely too fragile to get wider applications beyond that.

My take is that a proactive reclaim feature, whose goal is never to
thrash or punish but to keep the LRUs warm and the workingset trimmed,
would ideally have:

- a pressure or size target specified by userspace but with
  enforcement driven inside the kernel from the allocation path

- the enforcement work NOT be done synchronously by the workload
  (something I'd argue we want for *all* memory limits)

- the enforcement work ACCOUNTED to the cgroup, though, since it's the
  cgroup's memory allocations causing the work (again something I'd
  argue we want in general)

- a delegatable knob that is independent of setting the maximum size
  of a container, as that expresses a different type of policy

- if size target, self-limiting (ha) enforcement on a pressure
  threshold or stop enforcement when the userspace component dies

Thoughts?

Michal Hocko Sept. 29, 2020, 3:04 p.m. UTC | #16

On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
[...]
> My take is that a proactive reclaim feature, whose goal is never to
> thrash or punish but to keep the LRUs warm and the workingset trimmed,
> would ideally have:
> 
> - a pressure or size target specified by userspace but with
>   enforcement driven inside the kernel from the allocation path
> 
> - the enforcement work NOT be done synchronously by the workload
>   (something I'd argue we want for *all* memory limits)
> 
> - the enforcement work ACCOUNTED to the cgroup, though, since it's the
>   cgroup's memory allocations causing the work (again something I'd
>   argue we want in general)
> 
> - a delegatable knob that is independent of setting the maximum size
>   of a container, as that expresses a different type of policy
> 
> - if size target, self-limiting (ha) enforcement on a pressure
>   threshold or stop enforcement when the userspace component dies
> 
> Thoughts?

Agreed with above points. What do you think about
http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz. I assume
that you do not want to override memory.high to implement this because
that tends to be tricky from the configuration POV as you mentioned
above. But a new limit (memory.middle for a lack of a better name) to
define the background reclaim sounds like a good fit with above points.

Johannes Weiner Sept. 29, 2020, 9:53 p.m. UTC | #17

On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> [...]
> > My take is that a proactive reclaim feature, whose goal is never to
> > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > would ideally have:
> > 
> > - a pressure or size target specified by userspace but with
> >   enforcement driven inside the kernel from the allocation path
> > 
> > - the enforcement work NOT be done synchronously by the workload
> >   (something I'd argue we want for *all* memory limits)
> > 
> > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> >   cgroup's memory allocations causing the work (again something I'd
> >   argue we want in general)
> > 
> > - a delegatable knob that is independent of setting the maximum size
> >   of a container, as that expresses a different type of policy
> > 
> > - if size target, self-limiting (ha) enforcement on a pressure
> >   threshold or stop enforcement when the userspace component dies
> > 
> > Thoughts?
> 
> Agreed with above points. What do you think about
> http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz.

I definitely agree with what you wrote in this email for background
reclaim. Indeed, your description sounds like what I proposed in
https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/
- what's missing from that patch is proper work attribution.

> I assume that you do not want to override memory.high to implement
> this because that tends to be tricky from the configuration POV as
> you mentioned above. But a new limit (memory.middle for a lack of a
> better name) to define the background reclaim sounds like a good fit
> with above points.

I can see that with a new memory.middle you could kind of sort of do
both - background reclaim and proactive reclaim.

That said, I do see advantages in keeping them separate:

1. Background reclaim is essentially an allocation optimization that
   we may want to provide per default, just like kswapd.

   Kswapd is tweakable of course, but I think actually few users do,
   and it works pretty well out of the box. It would be nice to
   provide the same thing on a per-cgroup basis per default and not
   ask users to make decisions that we are generally better at making.

2. Proactive reclaim may actually be better configured through a
   pressure threshold rather than a size target.

   As per above, the goal is not to be punitive or containing. The
   goal is to keep the LRUs warm and move the colder pages to disk.

   But how aggressively do you run reclaim for this purpose? What
   target value should a user write to such a memory.middle file?

   For one, it depends on the job. A batch job, or a less important
   background job, may tolerate higher paging overhead than an
   interactive job. That means more of its pages could be trimmed from
   RAM and reloaded on-demand from disk.

   But also, it depends on the storage device. If you move a workload
   from a machine with a slow disk to a machine with a fast disk, you
   can page more data in the same amount of time. That means while
   your workload tolerances stays the same, the faster the disk, the
   more aggressively you can do reclaim and offload memory.

   So again, what should a user write to such a control file?

   Of course, you can approximate an optimal target size for the
   workload. You can run a manual workingset analysis with page_idle,
   damon, or similar, determine a hot/cold cutoff based on what you
   know about the storage characteristics, then echo a number of pages
   or a size target into a cgroup file and let kernel do the reclaim
   accordingly. The drawbacks are that the kernel LRU may do a
   different hot/cold classification than you did and evict the wrong
   pages, the storage device latencies may vary based on overall IO
   pattern, and two equally warm pages may have very different paging
   overhead depending on whether readahead can avert a major fault or
   not. So it's easy to overshoot the tolerance target and disrupt the
   workload, or undershoot and have stale LRU data, waste memory etc.

   You can also do a feedback loop, where you guess an optimal size,
   then adjust based on how much paging overhead the workload is
   experiencing, i.e. memory pressure. The drawbacks are that you have
   to monitor pressure closely and react quickly when the workload is
   expanding, as it can be potentially sensitive to latencies in the
   usec range. This can be tricky to do from userspace.

   So instead of asking users for a target size whose suitability
   heavily depends on the kernel's LRU implementation, the readahead
   code, the IO device's capability and general load, why not directly
   ask the user for a pressure level that the workload is comfortable
   with and which captures all of the above factors implicitly? Then
   let the kernel do this feedback loop from a per-cgroup worker.

Shakeel Butt Sept. 30, 2020, 3:26 p.m. UTC | #18

Hi Johannes,

On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello,
>
> I apologize for the late reply. The proposed interface has been an
> ongoing topic and area of experimentation within Facebook as well,
> which makes it a bit difficult to respond with certainty here.
>
> I agree with both your usecases. They apply to us as well. We
> currently make two small changes to our kernel to solve them. They
> work okay-ish in our production environment, but they aren't quite
> there yet, and not ready for upstream.
>
> Some thoughts and comments below.
>
> On Wed, Sep 09, 2020 at 02:57:52PM -0700, Shakeel Butt wrote:
> > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> >
> > Use cases:
> > ----------
> >
> > 1) Per-memcg uswapd:
> >
> > Usually applications consists of combination of latency sensitive and
> > latency tolerant tasks. For example, tasks serving user requests vs
> > tasks doing data backup for a database application. At the moment the
> > kernel does not differentiate between such tasks when the application
> > hits the memcg limits. So, potentially a latency sensitive user facing
> > task can get stuck in high reclaim and be throttled by the kernel.
> >
> > Similarly there are cases of single process applications having two set
> > of thread pools where threads from one pool have high scheduling
> > priority and low latency requirement. One concrete example from our
> > production is the VMM which have high priority low latency thread pool
> > for the VCPUs while separate thread pool for stats reporting, I/O
> > emulation, health checks and other managerial operations. The kernel
> > memory reclaim does not differentiate between VCPU thread or a
> > non-latency sensitive thread and a VCPU thread can get stuck in high
> > reclaim.
> >
> > One way to resolve this issue is to preemptively trigger the memory
> > reclaim from a latency tolerant task (uswapd) when the application is
> > near the limits. Finding 'near the limits' situation is an orthogonal
> > problem.
>
> I don't think a userspace implementation is suitable for this purpose.
>
> Kswapd-style background reclaim is beneficial to probably 99% of all
> workloads. Because doing reclaim inside the execution stream of the
> workload itself is so unnecessary in a multi-CPU environment, whether
> the workload is particularly latency sensitive or only cares about
> overall throughput. In most cases, spare cores are available to do
> this work concurrently, and the buffer memory required between the
> workload and the async reclaimer tends to be negligible.
>
> Requiring non-trivial userspace participation for such a basic
> optimization does not seem like a good idea to me. We'd probably end
> up with four or five hyperscalers having four or five different
> implementations, and not much user coverage beyond that.
>

I understand your point and having an out of the box kernel-based
solution would be more helpful for the users.

> I floated this patch before:
> https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/
>
> It's blocked on infrastructure work in the CPU controller that allows
> accounting CPU cycles spent on behalf of other cgroups. But we need
> this functionality in other places as well - network, async filesystem
> encryption, various other stuff bounced to workers.
>
> > 2) Proactive reclaim:
> >
> > This is a similar to the previous use-case, the difference is instead of
> > waiting for the application to be near its limit to trigger memory
> > reclaim, continuously pressuring the memcg to reclaim a small amount of
> > memory. This gives more accurate and uptodate workingset estimation as
> > the LRUs are continuously sorted and can potentially provide more
> > deterministic memory overcommit behavior. The memory overcommit
> > controller can provide more proactive response to the changing behavior
> > of the running applications instead of being reactive.
>
> This is an important usecase for us as well. And we use it not just to
> keep the LRUs warm, but to actively sample the workingset size - the
> true amount of memory required, trimmed of all its unused cache and
> cold pages that can be read back from disk on demand.
>
> For this purpose, we're essentially using memory.high right now.
>
> The only modification we make here is adding a memory.high.tmp variant
> that takes a timeout argument in addition to the limit. This ensures
> we don't leave an unsafe limit behind if the userspace daemon crashes.
>
> We have experienced some of the problems you describe below with it.
>
> > Benefit of user space solution:
> > -------------------------------
> >
> > 1) More flexible on who should be charged for the cpu of the memory
> > reclaim. For proactive reclaim, it makes more sense to centralized the
> > overhead while for uswapd, it makes more sense for the application to
> > pay for the cpu of the memory reclaim.
>
> Agreed on both counts.
>
> > 2) More flexible on dedicating the resources (like cpu). The memory
> > overcommit controller can balance the cost between the cpu usage and
> > the memory reclaimed.
>
> This could use some elaboration, I think.
>

My point was from the resource planning perspective i.e. have
flexibility on the amount of resources (CPU) to dedicate for proactive
reclaim.

> > 3) Provides a way to the applications to keep their LRUs sorted, so,
> > under memory pressure better reclaim candidates are selected. This also
> > gives more accurate and uptodate notion of working set for an
> > application.
>
> That's a valid argument for proactive reclaim, and I agree with
> it. But not necessarily an argument for which part of the proactive
> reclaim logic should be in-kernel and which should be in userspace.
>
> > Questions:
> > ----------
> >
> > 1) Why memory.high is not enough?
> >
> > memory.high can be used to trigger reclaim in a memcg and can
> > potentially be used for proactive reclaim as well as uswapd use cases.
> > However there is a big negative in using memory.high. It can potentially
> > introduce high reclaim stalls in the target application as the
> > allocations from the processes or the threads of the application can hit
> > the temporary memory.high limit.
>
> That's something we have run into as well. Async memory.high reclaim
> helps, but when proactive reclaim does bigger steps and lowers the
> limit below the async reclaim buffer, the workload can still enter
> direct reclaim. This is undesirable.
>
> > Another issue with memory.high is that it is not delegatable. To
> > actually use this interface for uswapd, the application has to introduce
> > another layer of cgroup on whose memory.high it has write access.
>
> Fair enough.
>
> I would generalize that and say that limiting the maximum container
> size and driving proactive reclaim are separate jobs, with separate
> goals, happening at different layers of the system. Sharing a single
> control knob for that can be a coordination nightmare.
>

Agreed.

> > 2) Why uswapd safe from self induced reclaim?
> >
> > This is very similar to the scenario of oomd under global memory
> > pressure. We can use the similar mechanisms to protect uswapd from self
> > induced reclaim i.e. memory.min and mlock.
>
> Agreed.
>
> > Interface options:
> > ------------------
> >
> > Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
> > trigger reclaim in the target memory cgroup.
>
> This gets the assigning and attribution of targeted reclaim work right
> (although it doesn't solve kswapd cycle attribution yet).
>
> However, it also ditches the limit semantics, which themselves aren't
> actually a problem. And indeed I would argue have some upsides.
>
> As per above, the kernel knows best when and how much to reclaim to
> match the allocation rate, since it's in control of the allocation
> path. To do proactive reclaim with the memory.reclaim interface, you
> would need to monitor memory consumption closely.

To calculate the amount to reclaim with the memory.reclaim interface
in production, we actually use two sources of information, refault
rate and idle age histogram (extracted from a more efficient version
of Page Idle Tracking).

> Workloads may not
> allocate anything for hours, and then suddenly allocate gigabytes
> within seconds. A sudden onset of streaming reads through the
> filesystem could destroy the workingset measurements, whereas a limit
> would catch it and do drop-behind (and thus workingset sampling) at
> the exact rate of allocations.
>
> Again I believe something that may be doable as a hyperscale operator,
> but likely too fragile to get wider applications beyond that.
>
> My take is that a proactive reclaim feature, whose goal is never to
> thrash or punish but to keep the LRUs warm and the workingset trimmed,
> would ideally have:
>
> - a pressure or size target specified by userspace but with
>   enforcement driven inside the kernel from the allocation path
>
> - the enforcement work NOT be done synchronously by the workload
>   (something I'd argue we want for *all* memory limits)
>
> - the enforcement work ACCOUNTED to the cgroup, though, since it's the
>   cgroup's memory allocations causing the work (again something I'd
>   argue we want in general)

For this point I think we want more flexibility to control the
resources we want to dedicate for proactive reclaim. One particular
example from our production is the batch jobs with high memory
footprint. These jobs don't have enough CPU quota but we do want to
proactively reclaim from them. We would prefer to dedicate some amount
of CPU to proactively reclaim from them independent of their own CPU
quota.

>
> - a delegatable knob that is independent of setting the maximum size
>   of a container, as that expresses a different type of policy
>
> - if size target, self-limiting (ha) enforcement on a pressure
>   threshold or stop enforcement when the userspace component dies
>
> Thoughts?

Shakeel Butt Sept. 30, 2020, 3:45 p.m. UTC | #19

On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> > On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> > [...]
> > > My take is that a proactive reclaim feature, whose goal is never to
> > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > would ideally have:
> > >
> > > - a pressure or size target specified by userspace but with
> > >   enforcement driven inside the kernel from the allocation path
> > >
> > > - the enforcement work NOT be done synchronously by the workload
> > >   (something I'd argue we want for *all* memory limits)
> > >
> > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > >   cgroup's memory allocations causing the work (again something I'd
> > >   argue we want in general)
> > >
> > > - a delegatable knob that is independent of setting the maximum size
> > >   of a container, as that expresses a different type of policy
> > >
> > > - if size target, self-limiting (ha) enforcement on a pressure
> > >   threshold or stop enforcement when the userspace component dies
> > >
> > > Thoughts?
> >
> > Agreed with above points. What do you think about
> > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz.
>
> I definitely agree with what you wrote in this email for background
> reclaim. Indeed, your description sounds like what I proposed in
> https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/
> - what's missing from that patch is proper work attribution.
>
> > I assume that you do not want to override memory.high to implement
> > this because that tends to be tricky from the configuration POV as
> > you mentioned above. But a new limit (memory.middle for a lack of a
> > better name) to define the background reclaim sounds like a good fit
> > with above points.
>
> I can see that with a new memory.middle you could kind of sort of do
> both - background reclaim and proactive reclaim.
>
> That said, I do see advantages in keeping them separate:
>
> 1. Background reclaim is essentially an allocation optimization that
>    we may want to provide per default, just like kswapd.
>
>    Kswapd is tweakable of course, but I think actually few users do,
>    and it works pretty well out of the box. It would be nice to
>    provide the same thing on a per-cgroup basis per default and not
>    ask users to make decisions that we are generally better at making.
>
> 2. Proactive reclaim may actually be better configured through a
>    pressure threshold rather than a size target.
>
>    As per above, the goal is not to be punitive or containing. The
>    goal is to keep the LRUs warm and move the colder pages to disk.
>
>    But how aggressively do you run reclaim for this purpose? What
>    target value should a user write to such a memory.middle file?
>
>    For one, it depends on the job. A batch job, or a less important
>    background job, may tolerate higher paging overhead than an
>    interactive job. That means more of its pages could be trimmed from
>    RAM and reloaded on-demand from disk.
>
>    But also, it depends on the storage device. If you move a workload
>    from a machine with a slow disk to a machine with a fast disk, you
>    can page more data in the same amount of time. That means while
>    your workload tolerances stays the same, the faster the disk, the
>    more aggressively you can do reclaim and offload memory.
>
>    So again, what should a user write to such a control file?
>
>    Of course, you can approximate an optimal target size for the
>    workload. You can run a manual workingset analysis with page_idle,
>    damon, or similar, determine a hot/cold cutoff based on what you
>    know about the storage characteristics, then echo a number of pages
>    or a size target into a cgroup file and let kernel do the reclaim
>    accordingly. The drawbacks are that the kernel LRU may do a
>    different hot/cold classification than you did and evict the wrong
>    pages, the storage device latencies may vary based on overall IO
>    pattern, and two equally warm pages may have very different paging
>    overhead depending on whether readahead can avert a major fault or
>    not. So it's easy to overshoot the tolerance target and disrupt the
>    workload, or undershoot and have stale LRU data, waste memory etc.
>
>    You can also do a feedback loop, where you guess an optimal size,
>    then adjust based on how much paging overhead the workload is
>    experiencing, i.e. memory pressure. The drawbacks are that you have
>    to monitor pressure closely and react quickly when the workload is
>    expanding, as it can be potentially sensitive to latencies in the
>    usec range. This can be tricky to do from userspace.
>

This is actually what we do in our production i.e. feedback loop to
adjust the next iteration of proactive reclaim.

We eliminated the IO or slow disk issues you mentioned by only
focusing on anon memory and doing zswap.

>    So instead of asking users for a target size whose suitability
>    heavily depends on the kernel's LRU implementation, the readahead
>    code, the IO device's capability and general load, why not directly
>    ask the user for a pressure level that the workload is comfortable
>    with and which captures all of the above factors implicitly? Then
>    let the kernel do this feedback loop from a per-cgroup worker.

I am assuming here by pressure level you are referring to the PSI like
interface e.g. allowing the users to tell about their jobs that X
amount of stalls in a fixed time window is tolerable.

Seems promising though I would like flexibility for giving the
resource to the per-cgroup worker.

Are you planning to work on this or should I give it a try?

Johannes Weiner Oct. 1, 2020, 2:31 p.m. UTC | #20

On Wed, Sep 30, 2020 at 08:45:17AM -0700, Shakeel Butt wrote:
> On Tue, Sep 29, 2020 at 2:55 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Sep 29, 2020 at 05:04:44PM +0200, Michal Hocko wrote:
> > > On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
> > > [...]
> > > > My take is that a proactive reclaim feature, whose goal is never to
> > > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > > would ideally have:
> > > >
> > > > - a pressure or size target specified by userspace but with
> > > >   enforcement driven inside the kernel from the allocation path
> > > >
> > > > - the enforcement work NOT be done synchronously by the workload
> > > >   (something I'd argue we want for *all* memory limits)
> > > >
> > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > > >   cgroup's memory allocations causing the work (again something I'd
> > > >   argue we want in general)
> > > >
> > > > - a delegatable knob that is independent of setting the maximum size
> > > >   of a container, as that expresses a different type of policy
> > > >
> > > > - if size target, self-limiting (ha) enforcement on a pressure
> > > >   threshold or stop enforcement when the userspace component dies
> > > >
> > > > Thoughts?
> > >
> > > Agreed with above points. What do you think about
> > > http://lkml.kernel.org/r/20200922190859.GH12990@dhcp22.suse.cz.
> >
> > I definitely agree with what you wrote in this email for background
> > reclaim. Indeed, your description sounds like what I proposed in
> > https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/
> > - what's missing from that patch is proper work attribution.
> >
> > > I assume that you do not want to override memory.high to implement
> > > this because that tends to be tricky from the configuration POV as
> > > you mentioned above. But a new limit (memory.middle for a lack of a
> > > better name) to define the background reclaim sounds like a good fit
> > > with above points.
> >
> > I can see that with a new memory.middle you could kind of sort of do
> > both - background reclaim and proactive reclaim.
> >
> > That said, I do see advantages in keeping them separate:
> >
> > 1. Background reclaim is essentially an allocation optimization that
> >    we may want to provide per default, just like kswapd.
> >
> >    Kswapd is tweakable of course, but I think actually few users do,
> >    and it works pretty well out of the box. It would be nice to
> >    provide the same thing on a per-cgroup basis per default and not
> >    ask users to make decisions that we are generally better at making.
> >
> > 2. Proactive reclaim may actually be better configured through a
> >    pressure threshold rather than a size target.
> >
> >    As per above, the goal is not to be punitive or containing. The
> >    goal is to keep the LRUs warm and move the colder pages to disk.
> >
> >    But how aggressively do you run reclaim for this purpose? What
> >    target value should a user write to such a memory.middle file?
> >
> >    For one, it depends on the job. A batch job, or a less important
> >    background job, may tolerate higher paging overhead than an
> >    interactive job. That means more of its pages could be trimmed from
> >    RAM and reloaded on-demand from disk.
> >
> >    But also, it depends on the storage device. If you move a workload
> >    from a machine with a slow disk to a machine with a fast disk, you
> >    can page more data in the same amount of time. That means while
> >    your workload tolerances stays the same, the faster the disk, the
> >    more aggressively you can do reclaim and offload memory.
> >
> >    So again, what should a user write to such a control file?
> >
> >    Of course, you can approximate an optimal target size for the
> >    workload. You can run a manual workingset analysis with page_idle,
> >    damon, or similar, determine a hot/cold cutoff based on what you
> >    know about the storage characteristics, then echo a number of pages
> >    or a size target into a cgroup file and let kernel do the reclaim
> >    accordingly. The drawbacks are that the kernel LRU may do a
> >    different hot/cold classification than you did and evict the wrong
> >    pages, the storage device latencies may vary based on overall IO
> >    pattern, and two equally warm pages may have very different paging
> >    overhead depending on whether readahead can avert a major fault or
> >    not. So it's easy to overshoot the tolerance target and disrupt the
> >    workload, or undershoot and have stale LRU data, waste memory etc.
> >
> >    You can also do a feedback loop, where you guess an optimal size,
> >    then adjust based on how much paging overhead the workload is
> >    experiencing, i.e. memory pressure. The drawbacks are that you have
> >    to monitor pressure closely and react quickly when the workload is
> >    expanding, as it can be potentially sensitive to latencies in the
> >    usec range. This can be tricky to do from userspace.
> >
> 
> This is actually what we do in our production i.e. feedback loop to
> adjust the next iteration of proactive reclaim.

That's what we do also right now. It works reasonably well, the only
two pain points are/have been the reaction time under quick workload
expansion and inadvertently forcing the workload into direct reclaim.

> We eliminated the IO or slow disk issues you mentioned by only
> focusing on anon memory and doing zswap.

Interesting, may I ask how the file cache is managed in this setup?

> >    So instead of asking users for a target size whose suitability
> >    heavily depends on the kernel's LRU implementation, the readahead
> >    code, the IO device's capability and general load, why not directly
> >    ask the user for a pressure level that the workload is comfortable
> >    with and which captures all of the above factors implicitly? Then
> >    let the kernel do this feedback loop from a per-cgroup worker.
> 
> I am assuming here by pressure level you are referring to the PSI like
> interface e.g. allowing the users to tell about their jobs that X
> amount of stalls in a fixed time window is tolerable.

Right, essentially the same parameters that psi poll() would take.

Johannes Weiner Oct. 1, 2020, 3:10 p.m. UTC | #21

Hello Shakeel,

On Wed, Sep 30, 2020 at 08:26:26AM -0700, Shakeel Butt wrote:
> On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Workloads may not
> > allocate anything for hours, and then suddenly allocate gigabytes
> > within seconds. A sudden onset of streaming reads through the
> > filesystem could destroy the workingset measurements, whereas a limit
> > would catch it and do drop-behind (and thus workingset sampling) at
> > the exact rate of allocations.
> >
> > Again I believe something that may be doable as a hyperscale operator,
> > but likely too fragile to get wider applications beyond that.
> >
> > My take is that a proactive reclaim feature, whose goal is never to
> > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > would ideally have:
> >
> > - a pressure or size target specified by userspace but with
> >   enforcement driven inside the kernel from the allocation path
> >
> > - the enforcement work NOT be done synchronously by the workload
> >   (something I'd argue we want for *all* memory limits)
> >
> > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> >   cgroup's memory allocations causing the work (again something I'd
> >   argue we want in general)
> 
> For this point I think we want more flexibility to control the
> resources we want to dedicate for proactive reclaim. One particular
> example from our production is the batch jobs with high memory
> footprint. These jobs don't have enough CPU quota but we do want to
> proactively reclaim from them. We would prefer to dedicate some amount
> of CPU to proactively reclaim from them independent of their own CPU
> quota.

Would it not work to add headroom for this reclaim overhead to the CPU
quota of the job?

The reason I'm asking is because reclaim is only one side of the
proactive reclaim medal. The other side is taking faults and having to
do IO and/or decompression (zswap, compressed btrfs) on the workload
side. And that part is unavoidably consuming CPU and IO quota of the
workload. So I wonder how much this can generally be separated out.

It's certainly something we've been thinking about as well. Currently,
because we use memory.high, we have all the reclaim work being done by
a privileged daemon outside the cgroup, and the workload pressure only
stems from the refault side.

But that means a workload is consuming privileged CPU cycles, and the
amount varies depending on the memory access patterns - how many
rotations the reclaim scanner is doing etc.

So I do wonder whether this "cost of business" of running a workload
with a certain memory footprint should be accounted to the workload
itself. Because at the end of the day, the CPU you have available will
dictate how much memory you need, and both of these axes affect how
you can schedule this job in a shared compute pool. Do neighboring
jobs on the same host leave you either the memory for your colder
pages, or the CPU (and IO) to trim them off?

For illustration, compare extreme examples of this.

	A) A workload that has its executable/libraries and a fixed
	   set of hot heap pages. Proactive reclaim will be relatively
	   slow and cheap - a couple of deactivations/rotations.

	B) A workload that does high-speed streaming IO and generates
	   a lot of drop-behind cache; or a workload that has a huge
	   virtual anon set with lots of allocations and MADV_FREEing
	   going on. Proactive reclaim will be fast and expensive.

Even at the same memory target size, these two types of jobs have very
different requirements toward the host environment they can run on.

It seems to me that this is cost that should be captured in the job's
overall resource footprint.

Shakeel Butt Oct. 5, 2020, 9:59 p.m. UTC | #22

Hi Johannes,

On Thu, Oct 1, 2020 at 8:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Shakeel,
>
> On Wed, Sep 30, 2020 at 08:26:26AM -0700, Shakeel Butt wrote:
> > On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > Workloads may not
> > > allocate anything for hours, and then suddenly allocate gigabytes
> > > within seconds. A sudden onset of streaming reads through the
> > > filesystem could destroy the workingset measurements, whereas a limit
> > > would catch it and do drop-behind (and thus workingset sampling) at
> > > the exact rate of allocations.
> > >
> > > Again I believe something that may be doable as a hyperscale operator,
> > > but likely too fragile to get wider applications beyond that.
> > >
> > > My take is that a proactive reclaim feature, whose goal is never to
> > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > would ideally have:
> > >
> > > - a pressure or size target specified by userspace but with
> > >   enforcement driven inside the kernel from the allocation path
> > >
> > > - the enforcement work NOT be done synchronously by the workload
> > >   (something I'd argue we want for *all* memory limits)
> > >
> > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > >   cgroup's memory allocations causing the work (again something I'd
> > >   argue we want in general)
> >
> > For this point I think we want more flexibility to control the
> > resources we want to dedicate for proactive reclaim. One particular
> > example from our production is the batch jobs with high memory
> > footprint. These jobs don't have enough CPU quota but we do want to
> > proactively reclaim from them. We would prefer to dedicate some amount
> > of CPU to proactively reclaim from them independent of their own CPU
> > quota.
>
> Would it not work to add headroom for this reclaim overhead to the CPU
> quota of the job?
>
> The reason I'm asking is because reclaim is only one side of the
> proactive reclaim medal. The other side is taking faults and having to
> do IO and/or decompression (zswap, compressed btrfs) on the workload
> side. And that part is unavoidably consuming CPU and IO quota of the
> workload. So I wonder how much this can generally be separated out.
>
> It's certainly something we've been thinking about as well. Currently,
> because we use memory.high, we have all the reclaim work being done by
> a privileged daemon outside the cgroup, and the workload pressure only
> stems from the refault side.
>
> But that means a workload is consuming privileged CPU cycles, and the
> amount varies depending on the memory access patterns - how many
> rotations the reclaim scanner is doing etc.
>
> So I do wonder whether this "cost of business" of running a workload
> with a certain memory footprint should be accounted to the workload
> itself. Because at the end of the day, the CPU you have available will
> dictate how much memory you need, and both of these axes affect how
> you can schedule this job in a shared compute pool. Do neighboring
> jobs on the same host leave you either the memory for your colder
> pages, or the CPU (and IO) to trim them off?
>
> For illustration, compare extreme examples of this.
>
>         A) A workload that has its executable/libraries and a fixed
>            set of hot heap pages. Proactive reclaim will be relatively
>            slow and cheap - a couple of deactivations/rotations.
>
>         B) A workload that does high-speed streaming IO and generates
>            a lot of drop-behind cache; or a workload that has a huge
>            virtual anon set with lots of allocations and MADV_FREEing
>            going on. Proactive reclaim will be fast and expensive.
>
> Even at the same memory target size, these two types of jobs have very
> different requirements toward the host environment they can run on.
>
> It seems to me that this is cost that should be captured in the job's
> overall resource footprint.

I understand your point but from the usability perspective, I am
finding it hard to deploy/use.

As you said, the proactive reclaim cost will be different for
different types of workload but I do not expect the job owners telling
me how much headroom their jobs need.

I would have to start with a fixed headroom for a job, have to monitor
the resource usage of the proactive reclaim for it and dynamically
adjust the headroom to not steal the CPU from the job (I am assuming
there is no isolation between job and proactive reclaim).

This seems very hard to use as compared to setting aside a fixed
amount of CPU for proactive reclaim system wide. Please correct me if
I am misunderstanding something.

Shakeel Butt Oct. 6, 2020, 4:55 p.m. UTC | #23

On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[snip]
> > >    So instead of asking users for a target size whose suitability
> > >    heavily depends on the kernel's LRU implementation, the readahead
> > >    code, the IO device's capability and general load, why not directly
> > >    ask the user for a pressure level that the workload is comfortable
> > >    with and which captures all of the above factors implicitly? Then
> > >    let the kernel do this feedback loop from a per-cgroup worker.
> >
> > I am assuming here by pressure level you are referring to the PSI like
> > interface e.g. allowing the users to tell about their jobs that X
> > amount of stalls in a fixed time window is tolerable.
>
> Right, essentially the same parameters that psi poll() would take.

I thought a bit more on the semantics of the psi usage for the
proactive reclaim.

Suppose I have a top level cgroup A on which I want to enable
proactive reclaim. Which memory psi events should the proactive
reclaim should consider?

The simplest would be the memory.psi at 'A'. However memory.psi is
hierarchical and I would not really want the pressure due limits in
children of 'A' to impact the proactive reclaim. PSI due to refaults
and slow IO should be included or maybe only those which are caused by
the proactive reclaim itself. I am undecided on the PSI due to
compaction. PSI due to global reclaim for 'A' is even more
complicated. This is a stall due to reclaiming from the system
including self. It might not really cause more refaults and IOs for
'A'. Should proactive reclaim ignore the pressure due to global
pressure when tuning its aggressiveness.

Am I overthinking here?

Johannes Weiner Oct. 8, 2020, 2:53 p.m. UTC | #24

On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote:
> On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> [snip]
> > > >    So instead of asking users for a target size whose suitability
> > > >    heavily depends on the kernel's LRU implementation, the readahead
> > > >    code, the IO device's capability and general load, why not directly
> > > >    ask the user for a pressure level that the workload is comfortable
> > > >    with and which captures all of the above factors implicitly? Then
> > > >    let the kernel do this feedback loop from a per-cgroup worker.
> > >
> > > I am assuming here by pressure level you are referring to the PSI like
> > > interface e.g. allowing the users to tell about their jobs that X
> > > amount of stalls in a fixed time window is tolerable.
> >
> > Right, essentially the same parameters that psi poll() would take.
> 
> I thought a bit more on the semantics of the psi usage for the
> proactive reclaim.
> 
> Suppose I have a top level cgroup A on which I want to enable
> proactive reclaim. Which memory psi events should the proactive
> reclaim should consider?
> 
> The simplest would be the memory.psi at 'A'. However memory.psi is
> hierarchical and I would not really want the pressure due limits in
> children of 'A' to impact the proactive reclaim.

I don't think pressure from limits down the tree can be separated out,
generally. All events are accounted recursively as well. Of course, we
remember the reclaim level for evicted entries - but if there is
reclaim triggered at A and A/B concurrently, the distribution of who
ends up reclaiming the physical pages in A/B is pretty arbitrary/racy.

If A/B decides to do its own proactive reclaim with the sublimit, and
ends up consuming the pressure budget assigned to proactive reclaim in
A, there isn't much that can be done.

It's also possible that proactive reclaim in A keeps A/B from hitting
its limit in the first place.

I have to say, the configuration doesn't really strike me as sensible,
though. Limits make sense for doing fixed partitioning: A gets 4G, A/B
gets 2G out of that. But if you do proactive reclaim on A you're
essentially saying A as a whole is auto-sizing dynamically based on
its memory access pattern. I'm not sure what it means to then start
doing fixed partitions in the sublevel.

> PSI due to refaults and slow IO should be included or maybe only
> those which are caused by the proactive reclaim itself. I am
> undecided on the PSI due to compaction. PSI due to global reclaim
> for 'A' is even more complicated. This is a stall due to reclaiming
> from the system including self. It might not really cause more
> refaults and IOs for 'A'. Should proactive reclaim ignore the
> pressure due to global pressure when tuning its aggressiveness.

Yeah, I think they should all be included, because ultimately what
matters is what the workload can tolerate without sacrificing
performance.

Proactive reclaim can destroy THPs, so the cost of recreating them
should be reflected. Otherwise you can easily overpressurize.

For global reclaim, if you say you want a workload pressurized to X
percent in order to drive the LRUs and chop off all cold pages the
workload can live without, it doesn't matter who does the work. If
there is an abundance of physical memory, it's going to be proactive
reclaim. If physical memory is already tight enough that global
reclaim does it for you, there is nothing to be done in addition, and
proactive reclaim should hang back. Otherwise you can again easily
overpressurize the workload.

Johannes Weiner Oct. 8, 2020, 3:14 p.m. UTC | #25

On Mon, Oct 05, 2020 at 02:59:10PM -0700, Shakeel Butt wrote:
> Hi Johannes,
> 
> On Thu, Oct 1, 2020 at 8:12 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Hello Shakeel,
> >
> > On Wed, Sep 30, 2020 at 08:26:26AM -0700, Shakeel Butt wrote:
> > > On Mon, Sep 28, 2020 at 2:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > Workloads may not
> > > > allocate anything for hours, and then suddenly allocate gigabytes
> > > > within seconds. A sudden onset of streaming reads through the
> > > > filesystem could destroy the workingset measurements, whereas a limit
> > > > would catch it and do drop-behind (and thus workingset sampling) at
> > > > the exact rate of allocations.
> > > >
> > > > Again I believe something that may be doable as a hyperscale operator,
> > > > but likely too fragile to get wider applications beyond that.
> > > >
> > > > My take is that a proactive reclaim feature, whose goal is never to
> > > > thrash or punish but to keep the LRUs warm and the workingset trimmed,
> > > > would ideally have:
> > > >
> > > > - a pressure or size target specified by userspace but with
> > > >   enforcement driven inside the kernel from the allocation path
> > > >
> > > > - the enforcement work NOT be done synchronously by the workload
> > > >   (something I'd argue we want for *all* memory limits)
> > > >
> > > > - the enforcement work ACCOUNTED to the cgroup, though, since it's the
> > > >   cgroup's memory allocations causing the work (again something I'd
> > > >   argue we want in general)
> > >
> > > For this point I think we want more flexibility to control the
> > > resources we want to dedicate for proactive reclaim. One particular
> > > example from our production is the batch jobs with high memory
> > > footprint. These jobs don't have enough CPU quota but we do want to
> > > proactively reclaim from them. We would prefer to dedicate some amount
> > > of CPU to proactively reclaim from them independent of their own CPU
> > > quota.
> >
> > Would it not work to add headroom for this reclaim overhead to the CPU
> > quota of the job?
> >
> > The reason I'm asking is because reclaim is only one side of the
> > proactive reclaim medal. The other side is taking faults and having to
> > do IO and/or decompression (zswap, compressed btrfs) on the workload
> > side. And that part is unavoidably consuming CPU and IO quota of the
> > workload. So I wonder how much this can generally be separated out.
> >
> > It's certainly something we've been thinking about as well. Currently,
> > because we use memory.high, we have all the reclaim work being done by
> > a privileged daemon outside the cgroup, and the workload pressure only
> > stems from the refault side.
> >
> > But that means a workload is consuming privileged CPU cycles, and the
> > amount varies depending on the memory access patterns - how many
> > rotations the reclaim scanner is doing etc.
> >
> > So I do wonder whether this "cost of business" of running a workload
> > with a certain memory footprint should be accounted to the workload
> > itself. Because at the end of the day, the CPU you have available will
> > dictate how much memory you need, and both of these axes affect how
> > you can schedule this job in a shared compute pool. Do neighboring
> > jobs on the same host leave you either the memory for your colder
> > pages, or the CPU (and IO) to trim them off?
> >
> > For illustration, compare extreme examples of this.
> >
> >         A) A workload that has its executable/libraries and a fixed
> >            set of hot heap pages. Proactive reclaim will be relatively
> >            slow and cheap - a couple of deactivations/rotations.
> >
> >         B) A workload that does high-speed streaming IO and generates
> >            a lot of drop-behind cache; or a workload that has a huge
> >            virtual anon set with lots of allocations and MADV_FREEing
> >            going on. Proactive reclaim will be fast and expensive.
> >
> > Even at the same memory target size, these two types of jobs have very
> > different requirements toward the host environment they can run on.
> >
> > It seems to me that this is cost that should be captured in the job's
> > overall resource footprint.
> 
> I understand your point but from the usability perspective, I am
> finding it hard to deploy/use.
> 
> As you said, the proactive reclaim cost will be different for
> different types of workload but I do not expect the job owners telling
> me how much headroom their jobs need.

Isn't that the same for all work performed by the kernel? Instead of
proactive reclaim, it could just be regular reclaim due to a limit,
whose required headroom depends on the workload's allocation rate.

We wouldn't question whether direct reclaim cycles should be charged
to the cgroup. I'm not quite sure why proactive reclaim is different -
it's the same work done earlier.

> I would have to start with a fixed headroom for a job, have to monitor
> the resource usage of the proactive reclaim for it and dynamically
> adjust the headroom to not steal the CPU from the job (I am assuming
> there is no isolation between job and proactive reclaim).
> 
> This seems very hard to use as compared to setting aside a fixed
> amount of CPU for proactive reclaim system wide. Please correct me if
> I am misunderstanding something.

I see your point, but I don't know how a fixed system-wide pool is
easier to configure if you don't know the constituent consumers. How
much would you set aside?

A shared resource outside the natural cgroup hierarchy also triggers
my priority inversion alarm bells. How do you prevent a lower priority
job from consuming a disproportionate share of this pool? And as a
result cause the reclaim in higher priority groups to slow down, which
causes their memory footprint to expand and their LRUs to go stale.

It also still leaves the question around IO budget. Even if you manage
to not eat into the CPU budget of the job, you'd still eat into the IO
budget of the job, and that's harder to separate out.

Shakeel Butt Oct. 8, 2020, 3:55 p.m. UTC | #26

On Thu, Oct 8, 2020 at 7:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote:
> > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > [snip]
> > > > >    So instead of asking users for a target size whose suitability
> > > > >    heavily depends on the kernel's LRU implementation, the readahead
> > > > >    code, the IO device's capability and general load, why not directly
> > > > >    ask the user for a pressure level that the workload is comfortable
> > > > >    with and which captures all of the above factors implicitly? Then
> > > > >    let the kernel do this feedback loop from a per-cgroup worker.
> > > >
> > > > I am assuming here by pressure level you are referring to the PSI like
> > > > interface e.g. allowing the users to tell about their jobs that X
> > > > amount of stalls in a fixed time window is tolerable.
> > >
> > > Right, essentially the same parameters that psi poll() would take.
> >
> > I thought a bit more on the semantics of the psi usage for the
> > proactive reclaim.
> >
> > Suppose I have a top level cgroup A on which I want to enable
> > proactive reclaim. Which memory psi events should the proactive
> > reclaim should consider?
> >
> > The simplest would be the memory.psi at 'A'. However memory.psi is
> > hierarchical and I would not really want the pressure due limits in
> > children of 'A' to impact the proactive reclaim.
>
> I don't think pressure from limits down the tree can be separated out,
> generally. All events are accounted recursively as well. Of course, we
> remember the reclaim level for evicted entries - but if there is
> reclaim triggered at A and A/B concurrently, the distribution of who
> ends up reclaiming the physical pages in A/B is pretty arbitrary/racy.
>
> If A/B decides to do its own proactive reclaim with the sublimit, and
> ends up consuming the pressure budget assigned to proactive reclaim in
> A, there isn't much that can be done.
>
> It's also possible that proactive reclaim in A keeps A/B from hitting
> its limit in the first place.
>
> I have to say, the configuration doesn't really strike me as sensible,
> though. Limits make sense for doing fixed partitioning: A gets 4G, A/B
> gets 2G out of that. But if you do proactive reclaim on A you're
> essentially saying A as a whole is auto-sizing dynamically based on
> its memory access pattern. I'm not sure what it means to then start
> doing fixed partitions in the sublevel.
>

Think of the scenario where there is an infrastructure owner and the
large number of job owners. The aim of the infra owner is to reduce
cost by stuffing as many jobs as possible on the same machine while
job owners want consistent performance.

The job owners usually have meta jobs i.e. a set of small jobs that
run on the same machines and they manage these sub-jobs themselves.

The infra owner wants to do proactive reclaim to trim the current jobs
without impacting their performance and more importantly to have
enough memory to land new jobs (We have learned the hard way that
depending on global reclaim for memory overcommit is really bad for
isolation).

In the above scenario the configuration you mentioned might not be
sensible is really possible. This is exactly what we have in prod. You
can also get the idea why I am asking for flexibility for the cost of
proactive reclaim.

Johannes Weiner Oct. 8, 2020, 9:09 p.m. UTC | #27

On Thu, Oct 08, 2020 at 08:55:57AM -0700, Shakeel Butt wrote:
> On Thu, Oct 8, 2020 at 7:55 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Oct 06, 2020 at 09:55:43AM -0700, Shakeel Butt wrote:
> > > On Thu, Oct 1, 2020 at 7:33 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > [snip]
> > > > > >    So instead of asking users for a target size whose suitability
> > > > > >    heavily depends on the kernel's LRU implementation, the readahead
> > > > > >    code, the IO device's capability and general load, why not directly
> > > > > >    ask the user for a pressure level that the workload is comfortable
> > > > > >    with and which captures all of the above factors implicitly? Then
> > > > > >    let the kernel do this feedback loop from a per-cgroup worker.
> > > > >
> > > > > I am assuming here by pressure level you are referring to the PSI like
> > > > > interface e.g. allowing the users to tell about their jobs that X
> > > > > amount of stalls in a fixed time window is tolerable.
> > > >
> > > > Right, essentially the same parameters that psi poll() would take.
> > >
> > > I thought a bit more on the semantics of the psi usage for the
> > > proactive reclaim.
> > >
> > > Suppose I have a top level cgroup A on which I want to enable
> > > proactive reclaim. Which memory psi events should the proactive
> > > reclaim should consider?
> > >
> > > The simplest would be the memory.psi at 'A'. However memory.psi is
> > > hierarchical and I would not really want the pressure due limits in
> > > children of 'A' to impact the proactive reclaim.
> >
> > I don't think pressure from limits down the tree can be separated out,
> > generally. All events are accounted recursively as well. Of course, we
> > remember the reclaim level for evicted entries - but if there is
> > reclaim triggered at A and A/B concurrently, the distribution of who
> > ends up reclaiming the physical pages in A/B is pretty arbitrary/racy.
> >
> > If A/B decides to do its own proactive reclaim with the sublimit, and
> > ends up consuming the pressure budget assigned to proactive reclaim in
> > A, there isn't much that can be done.
> >
> > It's also possible that proactive reclaim in A keeps A/B from hitting
> > its limit in the first place.
> >
> > I have to say, the configuration doesn't really strike me as sensible,
> > though. Limits make sense for doing fixed partitioning: A gets 4G, A/B
> > gets 2G out of that. But if you do proactive reclaim on A you're
> > essentially saying A as a whole is auto-sizing dynamically based on
> > its memory access pattern. I'm not sure what it means to then start
> > doing fixed partitions in the sublevel.
> >
> 
> Think of the scenario where there is an infrastructure owner and the
> large number of job owners. The aim of the infra owner is to reduce
> cost by stuffing as many jobs as possible on the same machine while
> job owners want consistent performance.
> 
> The job owners usually have meta jobs i.e. a set of small jobs that
> run on the same machines and they manage these sub-jobs themselves.
>
> The infra owner wants to do proactive reclaim to trim the current jobs
> without impacting their performance and more importantly to have
> enough memory to land new jobs (We have learned the hard way that
> depending on global reclaim for memory overcommit is really bad for
> isolation).
>
> In the above scenario the configuration you mentioned might not be
> sensible is really possible. This is exactly what we have in prod.

I apologize if my statement was worded too broadly. I fully understand
your motivation and understand the sub job structure. It's more about
at which level to run proactive reclaim when there are sub-domains.

You said you're already using a feedback loop to adjust proactive
reclaim based on refault rates. How do you deal with this issue today
of one subgroup potentially having higher refaults due to a limit?

It appears that as soon as the subgroups can age independently, you
also need to treat them independently for proactive reclaim. Because
one group hitting its pressure limit says nothing about its sibling.

If you apply equal reclaim on them both based on the independently
pressured subjob, you'll under-reclaim the siblings.

If you apply equal reclaim on them both based on the unpressured
siblings alone, you'll over-pressurize the one with its own limit.

This seems independent of the exact metric you're using, and more
about at which level you apply pressure, and whether reclaim
subdomains created through a hard limit can be treated as part of a
larger shared pool or not.

memcg: introduce per-memcg reclaim interface

Commit Message

Comments

Patch