mbox series

[0/3] memcg: Slow down swap allocation as the available space gets depleted

Message ID 20200417010617.927266-1-kuba@kernel.org (mailing list archive)
Headers show
Series memcg: Slow down swap allocation as the available space gets depleted | expand

Message

Jakub Kicinski April 17, 2020, 1:06 a.m. UTC
Tejun describes the problem as follows:

When swap runs out, there's an abrupt change in system behavior -
the anonymous memory suddenly becomes unmanageable which readily
breaks any sort of memory isolation and can bring down the whole
system. To avoid that, oomd [1] monitors free swap space and triggers
kills when it drops below the specific threshold (e.g. 15%).

While this works, it's far from ideal:
 - Depending on IO performance and total swap size, a given
   headroom might not be enough or too much.
 - oomd has to monitor swap depletion in addition to the usual
   pressure metrics and it currently doesn't consider memory.swap.max.

Solve this by adapting the same approach that memory.high uses -
slow down allocation as the resource gets depleted turning the
depletion behavior from abrupt cliff one to gradual degradation
observable through memory pressure metric.

[1] https://github.com/facebookincubator/oomd

Jakub Kicinski (3):
  mm: prepare for swap over-high accounting and penalty calculation
  mm: move penalty delay clamping out of calculate_high_delay()
  mm: automatically penalize tasks with high swap use

 include/linux/memcontrol.h |   4 +
 mm/memcontrol.c            | 166 ++++++++++++++++++++++++++++---------
 2 files changed, 131 insertions(+), 39 deletions(-)

Comments

Shakeel Butt April 17, 2020, 4:11 p.m. UTC | #1
On Thu, Apr 16, 2020 at 6:06 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Tejun describes the problem as follows:
>
> When swap runs out, there's an abrupt change in system behavior -
> the anonymous memory suddenly becomes unmanageable which readily
> breaks any sort of memory isolation and can bring down the whole
> system.

Can you please add more info on this abrupt change in system behavior
and what do you mean by anon memory becoming unmanageable?

Once the system is in global reclaim and doing swapping the memory
isolation is already broken. Here I am assuming you are talking about
memcg limit reclaim and memcg limits are overcommitted. Shouldn't
running out of swap will trigger the OOM earlier which should be
better than impacting the whole system.

> To avoid that, oomd [1] monitors free swap space and triggers
> kills when it drops below the specific threshold (e.g. 15%).
>
> While this works, it's far from ideal:
>  - Depending on IO performance and total swap size, a given
>    headroom might not be enough or too much.
>  - oomd has to monitor swap depletion in addition to the usual
>    pressure metrics and it currently doesn't consider memory.swap.max.
>
> Solve this by adapting the same approach that memory.high uses -
> slow down allocation as the resource gets depleted turning the
> depletion behavior from abrupt cliff one to gradual degradation
> observable through memory pressure metric.
>
> [1] https://github.com/facebookincubator/oomd
>
> Jakub Kicinski (3):
>   mm: prepare for swap over-high accounting and penalty calculation
>   mm: move penalty delay clamping out of calculate_high_delay()
>   mm: automatically penalize tasks with high swap use
>
>  include/linux/memcontrol.h |   4 +
>  mm/memcontrol.c            | 166 ++++++++++++++++++++++++++++---------
>  2 files changed, 131 insertions(+), 39 deletions(-)
>
> --
> 2.25.2
>
Tejun Heo April 17, 2020, 4:23 p.m. UTC | #2
Hello,

On Fri, Apr 17, 2020 at 09:11:33AM -0700, Shakeel Butt wrote:
> On Thu, Apr 16, 2020 at 6:06 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > Tejun describes the problem as follows:
> >
> > When swap runs out, there's an abrupt change in system behavior -
> > the anonymous memory suddenly becomes unmanageable which readily
> > breaks any sort of memory isolation and can bring down the whole
> > system.
> 
> Can you please add more info on this abrupt change in system behavior
> and what do you mean by anon memory becoming unmanageable?

In the sense that anonymous memory becomes essentially memlocked.

> Once the system is in global reclaim and doing swapping the memory
> isolation is already broken. Here I am assuming you are talking about

There currently are issues with anonymous memory management which makes them
different / worse than page cache but I don't follow why swapping
necessarily means that isolation is broken. Page refaults don't indicate
that memory isolation is broken after all.

> memcg limit reclaim and memcg limits are overcommitted. Shouldn't
> running out of swap will trigger the OOM earlier which should be
> better than impacting the whole system.

The primary scenario which was being considered was undercommitted
protections but I don't think that makes any relevant differences.

This is exactly similar to delay injection for memory.high. What's desired
is slowing down the workload as the available resource is depleted so that
the resource shortage presents as gradual degradation of performance and
matching increase in resource PSI. This allows the situation to be detected
and handled from userland while avoiding sudden and unpredictable behavior
changes.

Thanks.
Shakeel Butt April 17, 2020, 5:18 p.m. UTC | #3
Hi Tejun,

On Fri, Apr 17, 2020 at 9:23 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Apr 17, 2020 at 09:11:33AM -0700, Shakeel Butt wrote:
> > On Thu, Apr 16, 2020 at 6:06 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > Tejun describes the problem as follows:
> > >
> > > When swap runs out, there's an abrupt change in system behavior -
> > > the anonymous memory suddenly becomes unmanageable which readily
> > > breaks any sort of memory isolation and can bring down the whole
> > > system.
> >
> > Can you please add more info on this abrupt change in system behavior
> > and what do you mean by anon memory becoming unmanageable?
>
> In the sense that anonymous memory becomes essentially memlocked.
>
> > Once the system is in global reclaim and doing swapping the memory
> > isolation is already broken. Here I am assuming you are talking about
>
> There currently are issues with anonymous memory management which makes them
> different / worse than page cache but I don't follow why swapping
> necessarily means that isolation is broken. Page refaults don't indicate
> that memory isolation is broken after all.
>

Sorry, I meant the performance isolation. Direct reclaim does not
really differentiate who to stall and whose CPU to use.

> > memcg limit reclaim and memcg limits are overcommitted. Shouldn't
> > running out of swap will trigger the OOM earlier which should be
> > better than impacting the whole system.
>
> The primary scenario which was being considered was undercommitted
> protections but I don't think that makes any relevant differences.
>

What is undercommitted protections? Does it mean there is still swap
available on the system but the memcg is hitting its swap limit?

> This is exactly similar to delay injection for memory.high. What's desired
> is slowing down the workload as the available resource is depleted so that
> the resource shortage presents as gradual degradation of performance and
> matching increase in resource PSI. This allows the situation to be detected
> and handled from userland while avoiding sudden and unpredictable behavior
> changes.
>

Let me try to understand this with an example. Memcg 'A' has
memory.high = 100 MiB, memory.max = 150 MiB and memory.swap.max = 50
MiB. When A's usage goes over 100 MiB, it will reclaim the anon, file
and kmem. The anon will go to swap and increase its swap usage until
it hits the limit. Now the 'A' reclaim_high has fewer things (file &
kmem) to reclaim but the mem_cgroup_handle_over_high() will keep A's
increase in usage in check.

So, my question is: should the slowdown by memory.high depends on the
reclaimable memory? If there is no reclaimable memory and the job hits
memory.high, should the kernel slow it down to crawl until the PSI
monitor comes and decides what to do. If I understand correctly, the
problem is the kernel slow down is not successful when reclaimable
memory is very low. Please correct me if I am wrong.

Shakeel
Tejun Heo April 17, 2020, 5:36 p.m. UTC | #4
Hello,

On Fri, Apr 17, 2020 at 10:18:15AM -0700, Shakeel Butt wrote:
> > There currently are issues with anonymous memory management which makes them
> > different / worse than page cache but I don't follow why swapping
> > necessarily means that isolation is broken. Page refaults don't indicate
> > that memory isolation is broken after all.
> 
> Sorry, I meant the performance isolation. Direct reclaim does not
> really differentiate who to stall and whose CPU to use.

Can you please elaborate concrete scenarios? I'm having a hard time seeing
differences from page cache.

> > > memcg limit reclaim and memcg limits are overcommitted. Shouldn't
> > > running out of swap will trigger the OOM earlier which should be
> > > better than impacting the whole system.
> >
> > The primary scenario which was being considered was undercommitted
> > protections but I don't think that makes any relevant differences.
> >
> 
> What is undercommitted protections? Does it mean there is still swap
> available on the system but the memcg is hitting its swap limit?

Hahaha, I assumed you were talking about memory.high/max and was saying that
the primary scenarios that were being considered was usage of memory.low
interacting with swap. Again, can you please give an concrete example so
that we don't misunderstand each other?

> > This is exactly similar to delay injection for memory.high. What's desired
> > is slowing down the workload as the available resource is depleted so that
> > the resource shortage presents as gradual degradation of performance and
> > matching increase in resource PSI. This allows the situation to be detected
> > and handled from userland while avoiding sudden and unpredictable behavior
> > changes.
> >
> 
> Let me try to understand this with an example. Memcg 'A' has

Ah, you already went there. Great.

> memory.high = 100 MiB, memory.max = 150 MiB and memory.swap.max = 50
> MiB. When A's usage goes over 100 MiB, it will reclaim the anon, file
> and kmem. The anon will go to swap and increase its swap usage until
> it hits the limit. Now the 'A' reclaim_high has fewer things (file &
> kmem) to reclaim but the mem_cgroup_handle_over_high() will keep A's
> increase in usage in check.
> 
> So, my question is: should the slowdown by memory.high depends on the
> reclaimable memory? If there is no reclaimable memory and the job hits
> memory.high, should the kernel slow it down to crawl until the PSI
> monitor comes and decides what to do. If I understand correctly, the
> problem is the kernel slow down is not successful when reclaimable
> memory is very low. Please correct me if I am wrong.

In combination with memory.high, swap slowdown may not be necessary because
memory.high's slow down mechanism is already there to handle "can't swap"
scenario whether that's because swap is disabled wholesale, limited or
depleted. However, please consider the following scenario.

cgroup A has memory.low protection and no other restrictions. cgroup B has
no protection and has access to swap. When B's memory starts bloating and
gets the system under memory contention, it'll start consuming swap until it
can't. When swap becomes depleted for B, there's nothing holding it back and
B will start eating into A's protection.

The proposed mechanism just plugs another vector for the same condition
where anonymous memory management breaks down because they can no longer be
reclaimed due to swap unavailability.

Thanks.
Shakeel Butt April 17, 2020, 5:51 p.m. UTC | #5
On Fri, Apr 17, 2020 at 10:36 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Apr 17, 2020 at 10:18:15AM -0700, Shakeel Butt wrote:
> > > There currently are issues with anonymous memory management which makes them
> > > different / worse than page cache but I don't follow why swapping
> > > necessarily means that isolation is broken. Page refaults don't indicate
> > > that memory isolation is broken after all.
> >
> > Sorry, I meant the performance isolation. Direct reclaim does not
> > really differentiate who to stall and whose CPU to use.
>
> Can you please elaborate concrete scenarios? I'm having a hard time seeing
> differences from page cache.
>

Oh I was talking about the global reclaim here. In global reclaim, any
task can be throttled (throttle_direct_reclaim()). Memory freed by
using the CPU of high priority low latency jobs can be stolen by low
priority batch jobs.

> > > > memcg limit reclaim and memcg limits are overcommitted. Shouldn't
> > > > running out of swap will trigger the OOM earlier which should be
> > > > better than impacting the whole system.
> > >
> > > The primary scenario which was being considered was undercommitted
> > > protections but I don't think that makes any relevant differences.
> > >
> >
> > What is undercommitted protections? Does it mean there is still swap
> > available on the system but the memcg is hitting its swap limit?
>
> Hahaha, I assumed you were talking about memory.high/max and was saying that
> the primary scenarios that were being considered was usage of memory.low
> interacting with swap. Again, can you please give an concrete example so
> that we don't misunderstand each other?
>
> > > This is exactly similar to delay injection for memory.high. What's desired
> > > is slowing down the workload as the available resource is depleted so that
> > > the resource shortage presents as gradual degradation of performance and
> > > matching increase in resource PSI. This allows the situation to be detected
> > > and handled from userland while avoiding sudden and unpredictable behavior
> > > changes.
> > >
> >
> > Let me try to understand this with an example. Memcg 'A' has
>
> Ah, you already went there. Great.
>
> > memory.high = 100 MiB, memory.max = 150 MiB and memory.swap.max = 50
> > MiB. When A's usage goes over 100 MiB, it will reclaim the anon, file
> > and kmem. The anon will go to swap and increase its swap usage until
> > it hits the limit. Now the 'A' reclaim_high has fewer things (file &
> > kmem) to reclaim but the mem_cgroup_handle_over_high() will keep A's
> > increase in usage in check.
> >
> > So, my question is: should the slowdown by memory.high depends on the
> > reclaimable memory? If there is no reclaimable memory and the job hits
> > memory.high, should the kernel slow it down to crawl until the PSI
> > monitor comes and decides what to do. If I understand correctly, the
> > problem is the kernel slow down is not successful when reclaimable
> > memory is very low. Please correct me if I am wrong.
>
> In combination with memory.high, swap slowdown may not be necessary because
> memory.high's slow down mechanism is already there to handle "can't swap"
> scenario whether that's because swap is disabled wholesale, limited or
> depleted. However, please consider the following scenario.
>
> cgroup A has memory.low protection and no other restrictions. cgroup B has
> no protection and has access to swap. When B's memory starts bloating and
> gets the system under memory contention, it'll start consuming swap until it
> can't. When swap becomes depleted for B, there's nothing holding it back and
> B will start eating into A's protection.
>

In this example does 'B' have memory.high and memory.max set and by A
having no other restrictions, I am assuming you meant unlimited high
and max for A? Can 'A' use memory.min?

> The proposed mechanism just plugs another vector for the same condition
> where anonymous memory management breaks down because they can no longer be
> reclaimed due to swap unavailability.
>
> Thanks.
>
> --
> tejun
Tejun Heo April 17, 2020, 7:35 p.m. UTC | #6
Hello,

On Fri, Apr 17, 2020 at 10:51:10AM -0700, Shakeel Butt wrote:
> > Can you please elaborate concrete scenarios? I'm having a hard time seeing
> > differences from page cache.
> 
> Oh I was talking about the global reclaim here. In global reclaim, any
> task can be throttled (throttle_direct_reclaim()). Memory freed by
> using the CPU of high priority low latency jobs can be stolen by low
> priority batch jobs.

I'm still having a hard time following this thread of discussion, most
likely because my knoweldge of mm is fleeting at best. Can you please ELI5
why the above is specifically relevant to this discussion?

I'm gonna list two things that come to my mind just in case that'd help
reducing the back and forth.

* With protection based configurations, protected cgroups wouldn't usually
  go into direct reclaim themselves all that much.

* We do have holes in accounting CPU cycles used by reclaim to the orgins,
  which, for example, prevents making memory.high reclaim async and lets
  memory pressure contaminate cpu isolation possibly to a significant degree
  on lower core count machines in some scenarios, but that's a separate
  issue we need to address in the future.

> > cgroup A has memory.low protection and no other restrictions. cgroup B has
> > no protection and has access to swap. When B's memory starts bloating and
> > gets the system under memory contention, it'll start consuming swap until it
> > can't. When swap becomes depleted for B, there's nothing holding it back and
> > B will start eating into A's protection.
> >
> 
> In this example does 'B' have memory.high and memory.max set and by A

B doesn't have anything set.

> having no other restrictions, I am assuming you meant unlimited high
> and max for A? Can 'A' use memory.min?

Sure, it can but 1. the purpose of the example is illustrating the
imcompleteness of the existing mechanism 2. there's a big difference between
letting the machine hit the wall and waiting for the kernel OOM to trigger
and being able to monitor the situation as it gradually develops and respond
to it, which is the whole point of the low/high mechanisms.

Thanks.
Shakeel Butt April 17, 2020, 9:51 p.m. UTC | #7
On Fri, Apr 17, 2020 at 12:35 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Apr 17, 2020 at 10:51:10AM -0700, Shakeel Butt wrote:
> > > Can you please elaborate concrete scenarios? I'm having a hard time seeing
> > > differences from page cache.
> >
> > Oh I was talking about the global reclaim here. In global reclaim, any
> > task can be throttled (throttle_direct_reclaim()). Memory freed by
> > using the CPU of high priority low latency jobs can be stolen by low
> > priority batch jobs.
>
> I'm still having a hard time following this thread of discussion, most
> likely because my knoweldge of mm is fleeting at best. Can you please ELI5
> why the above is specifically relevant to this discussion?
>

No, it is not relevant to this discussion "now". The mention of
performance isolation in my first email was mostly due to my lack of
understanding about what problem this patch series is trying to solve.
So, let's skip this topic.

> I'm gonna list two things that come to my mind just in case that'd help
> reducing the back and forth.
>
> * With protection based configurations, protected cgroups wouldn't usually
>   go into direct reclaim themselves all that much.
>
> * We do have holes in accounting CPU cycles used by reclaim to the orgins,
>   which, for example, prevents making memory.high reclaim async and lets
>   memory pressure contaminate cpu isolation possibly to a significant degree
>   on lower core count machines in some scenarios, but that's a separate
>   issue we need to address in the future.
>

I have an opinion on the above but I will restrain as those are not
relevant to the patch series.

> > > cgroup A has memory.low protection and no other restrictions. cgroup B has
> > > no protection and has access to swap. When B's memory starts bloating and
> > > gets the system under memory contention, it'll start consuming swap until it
> > > can't. When swap becomes depleted for B, there's nothing holding it back and
> > > B will start eating into A's protection.
> > >
> >
> > In this example does 'B' have memory.high and memory.max set and by A
>
> B doesn't have anything set.
>
> > having no other restrictions, I am assuming you meant unlimited high
> > and max for A? Can 'A' use memory.min?
>
> Sure, it can but 1. the purpose of the example is illustrating the
> imcompleteness of the existing mechanism

I understand but is this a real world configuration people use and do
we want to support the scenario where without setting high/max, the
kernel still guarantees the isolation.

> 2. there's a big difference between
> letting the machine hit the wall and waiting for the kernel OOM to trigger
> and being able to monitor the situation as it gradually develops and respond
> to it, which is the whole point of the low/high mechanisms.
>

I am not really against the proposed solution. What I am trying to see
is if this problem is more general than an anon/swap-full problem and
if a more general solution is possible. To me it seems like, whenever
a large portion of reclaimable memory (anon, file or kmem) becomes
non-reclaimable abruptly, the memory isolation can be broken. You gave
the anon/swap-full example, let me see if I can come up with file and
kmem examples (with similar A & B).

1) B has a lot of page cache but temporarily gets pinned for rdma or
something and the system gets low on memory. B can attack A's low
protected memory as B's page cache is not reclaimable temporarily.

2) B has a lot of dentries/inodes but someone has taken a write lock
on shrinker_rwsem and got stuck in allocation/reclaim or CPU
preempted. B can attack A's low protected memory as B's slabs are not
reclaimable temporarily.

I think the aim is to slow down B enough to give the PSI monitor a
chance to act before either B targets A's protected memory or the
kernel triggers oom-kill.

My question is do we really want to solve the issue without limiting B
through high/max? Also isn't fine grained PSI monitoring along with
limiting B through memory.[high|max] general enough to solve all three
example scenarios?

thanks,
Shakeel
Tejun Heo April 17, 2020, 10:59 p.m. UTC | #8
Hello, Shakeel.

On Fri, Apr 17, 2020 at 02:51:09PM -0700, Shakeel Butt wrote:
> > > In this example does 'B' have memory.high and memory.max set and by A
> >
> > B doesn't have anything set.
> >
> > > having no other restrictions, I am assuming you meant unlimited high
> > > and max for A? Can 'A' use memory.min?
> >
> > Sure, it can but 1. the purpose of the example is illustrating the
> > imcompleteness of the existing mechanism
> 
> I understand but is this a real world configuration people use and do
> we want to support the scenario where without setting high/max, the
> kernel still guarantees the isolation.

Yes, that's the configuration we're deploying fleet-wide and at least the
direction I'm gonna be pushing towards for reasons of generality and ease of
use.

Here's an example to illustrate the point - consider distros or upstream
desktop environments wanting to provide basic resource configuration to
protect user sessions and critical system services needed for user
interaction by default. That is something which is clearly and immediately
useful but also is extremely challenging to achieve with limits.

There are no universally good enough upper limits. Any one number is gonna
be both too high to guarantee protection and too low for use cases which
legitimately need that much memory. That's because the upper limits aren't
work-conserving and have a high chance of doing harm when misconfigured
making figuring out the correct configuration almost impossible with
per-use-case manual tuning.

The whole idea behind memory.low and related efforts is resolving that
problem by making memory control more work-conserving and forgiving, so that
users can say something like "I want the user session to have at least 25%
memory protected if needed and possible" and get most of the benefits of
carefully crafted configuration. We're already deploying such configuration
and it works well enough for a wide variety of workloads.

> > 2. there's a big difference between
> > letting the machine hit the wall and waiting for the kernel OOM to trigger
> > and being able to monitor the situation as it gradually develops and respond
> > to it, which is the whole point of the low/high mechanisms.
> 
> I am not really against the proposed solution. What I am trying to see
> is if this problem is more general than an anon/swap-full problem and
> if a more general solution is possible. To me it seems like, whenever
> a large portion of reclaimable memory (anon, file or kmem) becomes
> non-reclaimable abruptly, the memory isolation can be broken. You gave
> the anon/swap-full example, let me see if I can come up with file and
> kmem examples (with similar A & B).
> 
> 1) B has a lot of page cache but temporarily gets pinned for rdma or
> something and the system gets low on memory. B can attack A's low
> protected memory as B's page cache is not reclaimable temporarily.
> 
> 2) B has a lot of dentries/inodes but someone has taken a write lock
> on shrinker_rwsem and got stuck in allocation/reclaim or CPU
> preempted. B can attack A's low protected memory as B's slabs are not
> reclaimable temporarily.
> 
> I think the aim is to slow down B enough to give the PSI monitor a
> chance to act before either B targets A's protected memory or the
> kernel triggers oom-kill.
> 
> My question is do we really want to solve the issue without limiting B
> through high/max? Also isn't fine grained PSI monitoring along with
> limiting B through memory.[high|max] general enough to solve all three
> example scenarios?

Yes, we definitely want to solve the issue without involving high and max. I
hope that part is clear now. As for whether we want to cover niche cases
such as RDMA pinning a large swath of page cache, I don't know, maybe? But I
don't think that's a problem with a comparable importance especially given
that in both cases you listed the problem is temporary and the workload
wouldn't have the ability to keep expanding undeterred.

Thanks.
Shakeel Butt April 20, 2020, 4:12 p.m. UTC | #9
Hi Tejun,

On Fri, Apr 17, 2020 at 3:59 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Shakeel.
>
> On Fri, Apr 17, 2020 at 02:51:09PM -0700, Shakeel Butt wrote:
> > > > In this example does 'B' have memory.high and memory.max set and by A
> > >
> > > B doesn't have anything set.
> > >
> > > > having no other restrictions, I am assuming you meant unlimited high
> > > > and max for A? Can 'A' use memory.min?
> > >
> > > Sure, it can but 1. the purpose of the example is illustrating the
> > > imcompleteness of the existing mechanism
> >
> > I understand but is this a real world configuration people use and do
> > we want to support the scenario where without setting high/max, the
> > kernel still guarantees the isolation.
>
> Yes, that's the configuration we're deploying fleet-wide and at least the
> direction I'm gonna be pushing towards for reasons of generality and ease of
> use.
>
> Here's an example to illustrate the point - consider distros or upstream
> desktop environments wanting to provide basic resource configuration to
> protect user sessions and critical system services needed for user
> interaction by default. That is something which is clearly and immediately
> useful but also is extremely challenging to achieve with limits.
>
> There are no universally good enough upper limits. Any one number is gonna
> be both too high to guarantee protection and too low for use cases which
> legitimately need that much memory. That's because the upper limits aren't
> work-conserving and have a high chance of doing harm when misconfigured
> making figuring out the correct configuration almost impossible with
> per-use-case manual tuning.
>
> The whole idea behind memory.low and related efforts is resolving that
> problem by making memory control more work-conserving and forgiving, so that
> users can say something like "I want the user session to have at least 25%
> memory protected if needed and possible" and get most of the benefits of
> carefully crafted configuration. We're already deploying such configuration
> and it works well enough for a wide variety of workloads.
>

I got the high level vision but I am very skeptical that in terms of
memory and performance isolation this can provide anything better than
best effort QoS which might be good enough for desktop users. However,
for a server environment where multiple latency sensitive interactive
jobs are co-hosted with multiple batch jobs and the machine's memory
may be over-committed, this is a recipe for disaster. The only
scenario where I think it might work is if there is only one job
running on the machine.

I do agree that finding the right upper limit is a challenge. For us,
we have two types of users, first, who knows exactly how much
resources they want and second ask us to set the limits appropriately.
We have a ML/history based central system to dynamically set and
adjust limits for jobs of such users.

Coming back to this path series, to me, it seems like the patch series
is contrary to the vision you are presenting. Though the users are not
setting memory.[high|max] but they are setting swap.max and this
series is asking to set one more tunable i.e. swap.high. The approach
more consistent with the presented vision is to throttle or slow down
the allocators when the system swap is near full and there is no need
to set swap.max or swap.high.

thanks,
Shakeel
Tejun Heo April 20, 2020, 4:47 p.m. UTC | #10
Hello,

On Mon, Apr 20, 2020 at 09:12:54AM -0700, Shakeel Butt wrote:
> I got the high level vision but I am very skeptical that in terms of
> memory and performance isolation this can provide anything better than
> best effort QoS which might be good enough for desktop users. However,

I don't see that big a gap between desktop and server use cases. There sure
are some tolerance differences but for majority of use cases that is a
permeable boundary. I believe I can see where you're coming from and think
that it'd be difficult to convince you out of the skepticism without
concretely demonstrating the contrary, which we're actively working on.

A directional point I wanna emphasize tho is that siloing these solutions
into special "professional" only use is an easy pitfall which often obscures
bigger possibilities and leads to developmental dead-ends and obsolescence.
I believe it's a tendency which should be actively resisted and fought
against. Servers really aren't all that special.

> for a server environment where multiple latency sensitive interactive
> jobs are co-hosted with multiple batch jobs and the machine's memory
> may be over-committed, this is a recipe for disaster. The only
> scenario where I think it might work is if there is only one job
> running on the machine.

Obviously, you can't overcommit on any resources for critical latency
sensitive workloads whether one or multiple, but there also are other types
of workloads which can be flexible with resource availability.

> I do agree that finding the right upper limit is a challenge. For us,
> we have two types of users, first, who knows exactly how much
> resources they want and second ask us to set the limits appropriately.
> We have a ML/history based central system to dynamically set and
> adjust limits for jobs of such users.
> 
> Coming back to this path series, to me, it seems like the patch series
> is contrary to the vision you are presenting. Though the users are not
> setting memory.[high|max] but they are setting swap.max and this
> series is asking to set one more tunable i.e. swap.high. The approach
> more consistent with the presented vision is to throttle or slow down
> the allocators when the system swap is near full and there is no need
> to set swap.max or swap.high.

It's a piece of the puzzle to make memory protection work comprehensively.
You can argue that the fact swap isn't protection based is against the
direction but I find that argument rather facetious as swap is quite
different resource from memory and it's not like I'm saying limits shouldn't
be used at all. There sure still are missing pieces - ie. slowing down on
global depletion, but that doesn't mean swap.high isn't useful.

Thanks.
Michal Hocko April 20, 2020, 5:03 p.m. UTC | #11
On Mon 20-04-20 12:47:40, Tejun Heo wrote:
[...]
> > Coming back to this path series, to me, it seems like the patch series
> > is contrary to the vision you are presenting. Though the users are not
> > setting memory.[high|max] but they are setting swap.max and this
> > series is asking to set one more tunable i.e. swap.high. The approach
> > more consistent with the presented vision is to throttle or slow down
> > the allocators when the system swap is near full and there is no need
> > to set swap.max or swap.high.

I have the same impression as Shakeel here. The overall information we
have here is really scarce.
 
> It's a piece of the puzzle to make memory protection work comprehensively.
> You can argue that the fact swap isn't protection based is against the
> direction but I find that argument rather facetious as swap is quite
> different resource from memory and it's not like I'm saying limits shouldn't
> be used at all. There sure still are missing pieces - ie. slowing down on
> global depletion, but that doesn't mean swap.high isn't useful.

I have asked about the semantic of this know already and didn't really
get any real answer. So how does swap.high fit into high limit semantic
when it doesn't act as a limit. Considering that we cannot reclaim swap
space I find this really hard to grasp.

We definitely need more information here!
Tejun Heo April 20, 2020, 5:06 p.m. UTC | #12
Hello,

On Mon, Apr 20, 2020 at 07:03:18PM +0200, Michal Hocko wrote:
> I have asked about the semantic of this know already and didn't really
> get any real answer. So how does swap.high fit into high limit semantic
> when it doesn't act as a limit. Considering that we cannot reclaim swap
> space I find this really hard to grasp.

memory.high slow down is for the case when memory reclaim can't be depended
upon for throttling, right? This is the same. Swap can't be reclaimed so the
backpressure is applied by slowing down the source, the same way memory.high
does.

It fits together with memory.low in that it prevents runaway anon allocation
when swap can't be allocated anymore. It's addressing the same problem that
memory.high slowdown does. It's just a different vector.

Thanks.
Michal Hocko April 21, 2020, 11:06 a.m. UTC | #13
On Mon 20-04-20 13:06:50, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 20, 2020 at 07:03:18PM +0200, Michal Hocko wrote:
> > I have asked about the semantic of this know already and didn't really
> > get any real answer. So how does swap.high fit into high limit semantic
> > when it doesn't act as a limit. Considering that we cannot reclaim swap
> > space I find this really hard to grasp.
> 
> memory.high slow down is for the case when memory reclaim can't be depended
> upon for throttling, right? This is the same. Swap can't be reclaimed so the
> backpressure is applied by slowing down the source, the same way memory.high
> does.

Hmm, but the two differ quite considerably that we do not reclaim any
swap which means that while no reclaimable memory at all is pretty much
the corner case (essentially OOM) the no reclaimable swap is always in
that state. So whenever you hit the high limit there is no other way
then rely on userspace to unmap swap backed memory or increase the limit.
Without that there is always throttling. The question also is what do
you want to throttle in that case? Any swap backed allocation or swap
based reclaim? The patch throttles any allocations unless I am
misreading. This means that also any other !swap backed allocations get
throttled as soon as the swap quota is reached. Is this really desirable
behavior? I would find it quite surprising to say the least.

I am also not sure about the isolation aspect. Because an external
memory pressure might have pushed out memory to the swap and then the
workload is throttled based on an external event. Compare that to the
memory.high throttling which is not directly affected by the external
pressure.

There is also an aspect of non-determinism. There is no control over
the file vs. swap backed reclaim decision for memcgs. That means that
behavior is going to be very dependent on the internal implementation of
the reclaim. More swapping is going to fill up swap quota quicker.

> It fits together with memory.low in that it prevents runaway anon allocation
> when swap can't be allocated anymore. It's addressing the same problem that
> memory.high slowdown does. It's just a different vector.

I suspect that the problem is more related to the swap being handled as
a separate resource. And it is still not clear to me why it is easier
for you to tune swap.high than memory.high. You have said that you do
not want to set up memory.high because it is harder to tune but I do
not see why swap is easier in this regards. Maybe it is just that the
swap is almost never used so a bad estimate is much easier to tolerate
and you really do care about runaways?
Johannes Weiner April 21, 2020, 2:27 p.m. UTC | #14
On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote:
> On Mon 20-04-20 13:06:50, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Apr 20, 2020 at 07:03:18PM +0200, Michal Hocko wrote:
> > > I have asked about the semantic of this know already and didn't really
> > > get any real answer. So how does swap.high fit into high limit semantic
> > > when it doesn't act as a limit. Considering that we cannot reclaim swap
> > > space I find this really hard to grasp.
> > 
> > memory.high slow down is for the case when memory reclaim can't be depended
> > upon for throttling, right? This is the same. Swap can't be reclaimed so the
> > backpressure is applied by slowing down the source, the same way memory.high
> > does.
> 
> Hmm, but the two differ quite considerably that we do not reclaim any
> swap which means that while no reclaimable memory at all is pretty much
> the corner case (essentially OOM) the no reclaimable swap is always in
> that state. So whenever you hit the high limit there is no other way
> then rely on userspace to unmap swap backed memory or increase the limit.

This is similar to memory.high. The memory.high throttling kicks in
when reclaim is NOT keeping up with allocation rate. There may be some
form of reclaim going on, but it's not bucking the trend, so you also
rely on userspace to free memory voluntarily or increase the limit -
or, of course, the throttling sleeps to grow until oomd kicks in.

> Without that there is always throttling. The question also is what do
> you want to throttle in that case? Any swap backed allocation or swap
> based reclaim? The patch throttles any allocations unless I am
> misreading. This means that also any other !swap backed allocations get
> throttled as soon as the swap quota is reached. Is this really desirable
> behavior? I would find it quite surprising to say the least.

When cache or slab allocations enter reclaim, they also swap.

We *could* be looking whether there are actual anon pages on the LRU
lists at this point. But I don't think it matters in practice, please
read on below.

> I am also not sure about the isolation aspect. Because an external
> memory pressure might have pushed out memory to the swap and then the
> workload is throttled based on an external event. Compare that to the
> memory.high throttling which is not directly affected by the external
> pressure.

Neither memory.high nor swap.high isolate from external pressure. They
are put on cgroups so they don't cause pressure on other cgroups. Swap
is required when either your footprint grows or your available space
shrinks. That's why it behaves like that.

That being said, I think we're getting lost in the implementation
details before we have established what the purpose of this all
is. Let's talk about this first.

Just imagine we had a really slow swap device. Some spinning disk that
is terrible at random IO. From a performance point of view, this would
obviously suck. But from a resource management point of view, this is
actually pretty useful in slowing down a workload that is growing
unsustainably. This is so useful, in fact, that Virtuozzo implemented
virtual swap devices that are artificially slow to emulate this type
of "punishment".

A while ago, we didn't have any swap configured. We set memory.high
and things were good: when things would go wrong and the workload
expanded beyond reclaim capabilities, memory.high would inject sleeps
until oomd would take care of the workload.

Remember that the point is to avoid the kernel OOM killer and do OOM
handling in userspace. That's the difference between memory.high and
memory.max as well.

However, in many cases we now want to overcommit more aggressively
than memory.high would allow us. For this purpose, we're switching to
memory.low, to only enforce limits when *physical* memory is
short. And we've added swap to have some buffer zone at the edge of
this aggressive overcommit.

But swap has been a good news, bad news situation. The good news is
that we have really fast swap, so if the workload is only temporarily
a bit over RAM capacity, we can swap a few colder anon pages to tide
the workload over, without the workload even noticing. This is
fantastic from a performance point of view. It effectively increases
our amount of available memory or the workingset sizes we can support.

But the bad news is also that we have really fast swap. If we have a
misbehaving workload that has a malloc() problem, we can *exhaust*
swap space very, very quickly. Where we previously had those nice
gradual slowdowns from memory.high when reclaim was failing, we now
have very powerful reclaim that can swap at hundreds of megabytes per
second - until swap is suddenly full and reclaim abruptly falls apart.

So while fast swap is an enhancement to our memory capacity, it
doesn't reliably act as that overcommit crumble zone that memory.high
or slower swap devices used to give us.

Should we replace those fast SSDs with crappy disks instead to achieve
this effect? Or add a slow disk as a secondary swap device once the
fast one is full? That would give us the desired effect, but obviously
it would be kind of silly.

That's where swap.high comes in. It gives us the performance of a fast
drive during temporary dips into the overcommit buffer, while also
providing that large rubber band kind of slowdown of a slow drive when
the workload is expanding at an unsustainable trend.

> There is also an aspect of non-determinism. There is no control over
> the file vs. swap backed reclaim decision for memcgs. That means that
> behavior is going to be very dependent on the internal implementation of
> the reclaim. More swapping is going to fill up swap quota quicker.

Haha, I mean that implies that reclaim is arbitrary. While it's
certainly not perfect, we're trying to reclaim the pages that are
least likely to be used again in the future. There is noise in this
heuristic, obviously, but it's still going to correlate with reality
and provide some level of determinism.

The same is true for memory.high, btw. Depending on how effective
reclaim is, we're going to throttle more or less. That's also going to
fluctuate somewhat around implementation changes.

> > It fits together with memory.low in that it prevents runaway anon allocation
> > when swap can't be allocated anymore. It's addressing the same problem that
> > memory.high slowdown does. It's just a different vector.
> 
> I suspect that the problem is more related to the swap being handled as
> a separate resource. And it is still not clear to me why it is easier
> for you to tune swap.high than memory.high. You have said that you do
> not want to set up memory.high because it is harder to tune but I do
> not see why swap is easier in this regards. Maybe it is just that the
> swap is almost never used so a bad estimate is much easier to tolerate
> and you really do care about runaways?

You hit the nail on the head.

We don't want memory.high (in most cases) because we want to utilize
memory to the absolute maximum.

Obviously, the same isn't true for swap because there is no DaX and
most workloads can't run when 80% of their workingset are on swap.

They're not interchangeable resources.

So yes, swap only needs to be roughly sized, and we want to catch
runaways. Sometimes they are caught by the IO slowdown, but for some
access patterns the IO is too efficient and we need a bit of help when
we're coming up against that wall. And we don't really care about not
utilizing swap capacity to the absolute max.

[ Hopefully that also answers your implementation questions above a
  bit better. We could be more specific about which allocations to
  slow down, only slow down if there are actual anon pages etc. But
  our goal isn't to emulate a realistic version of a slow swap device,
  we just want the overcommit crumble zone they provide. ]
Tejun Heo April 21, 2020, 3:20 p.m. UTC | #15
Hello,

On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote:
> I suspect that the problem is more related to the swap being handled as
> a separate resource. And it is still not clear to me why it is easier
> for you to tune swap.high than memory.high. You have said that you do
> not want to set up memory.high because it is harder to tune but I do
> not see why swap is easier in this regards. Maybe it is just that the
> swap is almost never used so a bad estimate is much easier to tolerate
> and you really do care about runaways?

Johannes responded a lot better. I'm just gonna add a bit here.

Swap is intertwined with memory but is a very different resource from
memory. You can't seriously equate primary and secondary storages. We never
want to underutilize memory but we never want to completely fill up
secondary storage. They're exactly the opposite in that sense. It's not that
protection schemes can't apply to swap but that such level of dynamic
control isn't required because simple upper limit is useful and easy enough.

Another backing point I want to raise is that the abrupt transition which
happens on swap depletion is a real problem that userspace has been trying
to work around. memory.low based protection and oomd is an obvious example
but not the only one. earlyoom[1] is an independent project which predates
all these things and kills when swap runs low to protect the system from
going down the gutter.

In this respect, both oomd and earlyoom basically do the same thing but
they're racing against the kernel filling up the space. Once the swap space
is gone, the programs themselves might not be able to make reasonable
forward progress. The only measure they can currently employ is polling more
frequently and killing ealier so that swap space never actually runs out,
but it's a silly and losing game as the underyling device gets faster and
faster.

Note that at least fedora is considering including either earlyoom or oomd
by default. The problem addressed by swap.high is real and immediate.

Thanks.
Michal Hocko April 21, 2020, 4:11 p.m. UTC | #16
On Tue 21-04-20 10:27:46, Johannes Weiner wrote:
> On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote:
[...]
> > I am also not sure about the isolation aspect. Because an external
> > memory pressure might have pushed out memory to the swap and then the
> > workload is throttled based on an external event. Compare that to the
> > memory.high throttling which is not directly affected by the external
> > pressure.
> 
> Neither memory.high nor swap.high isolate from external pressure.

I didn't say they do. What I am saying is that an external pressure
might punish swap.high memcg because the external memory pressure would
eat up the quota and trigger the throttling.

It is fair to say that this externally triggered interference is already
possible with swap.max as well though. It would likely be just more
verbose because of the oom killer intervention rather than a slowdown.

> They
> are put on cgroups so they don't cause pressure on other cgroups. Swap
> is required when either your footprint grows or your available space
> shrinks. That's why it behaves like that.
> 
> That being said, I think we're getting lost in the implementation
> details before we have established what the purpose of this all
> is. Let's talk about this first.

Thanks for describing it in the length. I have a better picture of the
intention (this should have been in the changelog ideally). I can see
how the swap consumption throttling might be useful but I still dislike the
proposed implementation. Mostly because of throttling of all allocations
regardless whether they can contribute to the swap consumption or not.
Johannes Weiner April 21, 2020, 4:56 p.m. UTC | #17
On Tue, Apr 21, 2020 at 06:11:38PM +0200, Michal Hocko wrote:
> On Tue 21-04-20 10:27:46, Johannes Weiner wrote:
> > On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote:
> [...]
> > > I am also not sure about the isolation aspect. Because an external
> > > memory pressure might have pushed out memory to the swap and then the
> > > workload is throttled based on an external event. Compare that to the
> > > memory.high throttling which is not directly affected by the external
> > > pressure.
> > 
> > Neither memory.high nor swap.high isolate from external pressure.
> 
> I didn't say they do. What I am saying is that an external pressure
> might punish swap.high memcg because the external memory pressure would
> eat up the quota and trigger the throttling.

External pressure could also push a cgroup into a swap device that
happens to be very slow and cause the cgroup to be throttled that way.

But that effect is actually not undesirable. External pressure means
that something more important runs and needs the memory of something
less important (otherwise, memory.low would deflect this intrusion).

So we're punishing/deprioritizing the right cgroup here. The one that
isn't protected from memory pressure.

> It is fair to say that this externally triggered interference is already
> possible with swap.max as well though. It would likely be just more
> verbose because of the oom killer intervention rather than a slowdown.

Right.

> > They
> > are put on cgroups so they don't cause pressure on other cgroups. Swap
> > is required when either your footprint grows or your available space
> > shrinks. That's why it behaves like that.
> > 
> > That being said, I think we're getting lost in the implementation
> > details before we have established what the purpose of this all
> > is. Let's talk about this first.
> 
> Thanks for describing it in the length. I have a better picture of the
> intention (this should have been in the changelog ideally). I can see
> how the swap consumption throttling might be useful but I still dislike the
> proposed implementation. Mostly because of throttling of all allocations
> regardless whether they can contribute to the swap consumption or not.

I mean, even if they're not swappable, they can still contribute to
swap consumption that wouldn't otherwise have been there. Each new
page that comes in displaces another page at the end of the big LRU
pipeline and pushes it into the mouth of reclaim - which may swap. So
*every* allocation has a certain probability of increasing swap usage.

The fact that we have reached swap.high is a good hint that reclaim
has indeed been swapping quite aggressively to accomodate incoming
allocations, and probably will continue to do so.

We could check whether there are NO anon pages left in a workload, but
that's such an extreme and short-lived case that it probably wouldn't
make a difference in practice.

We could try to come up with a model that calculates a probabilty of
each new allocation to cause swap. Whether that new allocation itself
is swapbacked would of course be a factor, but there are other factors
as well: the millions of existing LRU pages, the reclaim decisions we
will make, swappiness and so forth.

Of course, I agree with you, if all you have coming in is cache
allocations, you'd *eventually* run out of pages to swap.

However, 10G of new active cache allocations can still cause 10G of
already allocated anon pages to get swapped out. For example if a
malloc() leak happened *before* the regular cache workingset is
established. We cannot retro-actively throttle those anon pages, we
can only keep new allocations from pushing old ones into swap.
Shakeel Butt April 21, 2020, 7:09 p.m. UTC | #18
Hi Johannes,

On Tue, Apr 21, 2020 at 7:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[snip]
>

The following is a very good description and it gave me an idea of how
you (FB) are approaching the memory overcommit problem. The approach
you are taking is very different from ours and I would like to pick
your brain on the why (sorry this might be a bit tangent to the
series).

Please correct me if I am wrong, your memory overcommit strategy is to
let the jobs use memory as much as they want but when the system is
low on memory, slow down everyone (to not let the kernel oom-killer
trigger) and let the userspace oomd take care of releasing the
pressure.

We run multiple latency sensitive jobs along with multiple batch jobs
on the machine. Overcommitting the memory on such machines, we learn
that the battle is already lost when the system starts doing direct
reclaim. Direct reclaim does not differentiate between the reclaimers.
We could have tried the "slow down" approach but our latency sensitive
jobs prefer to die and let the load-balancer handover the request to
some other instance of the job than to stall the request for
non-deterministic time. We could have tried the PSI-like monitor to
trigger oom-kills when latency sensitive jobs start seeing the stalls
but that would be less work-conserving and  non-deterministic behavior
(i.e. sometimes more oom-kills and sometimes more memory
overcommitted). The approach we took was to do proactive reclaim along
with a very low latency refault medium (in-memory compression).

Now as you mentioned, you are trying to be a bit more aggressive in
the memory overcommit and I can see the writing on the wall that you
will be stuffing more jobs of different types on a machine, why do you
think the "slow down" approach will be able to provide the performance
isolation guarantees?

Couple of questions inlined.

> Just imagine we had a really slow swap device. Some spinning disk that
> is terrible at random IO. From a performance point of view, this would
> obviously suck. But from a resource management point of view, this is
> actually pretty useful in slowing down a workload that is growing
> unsustainably. This is so useful, in fact, that Virtuozzo implemented
> virtual swap devices that are artificially slow to emulate this type
> of "punishment".
>
> A while ago, we didn't have any swap configured. We set memory.high
> and things were good: when things would go wrong and the workload
> expanded beyond reclaim capabilities, memory.high would inject sleeps
> until oomd would take care of the workload.
>
> Remember that the point is to avoid the kernel OOM killer and do OOM
> handling in userspace. That's the difference between memory.high and
> memory.max as well.
>
> However, in many cases we now want to overcommit more aggressively
> than memory.high would allow us. For this purpose, we're switching to
> memory.low, to only enforce limits when *physical* memory is
> short. And we've added swap to have some buffer zone at the edge of
> this aggressive overcommit.
>
> But swap has been a good news, bad news situation. The good news is
> that we have really fast swap, so if the workload is only temporarily
> a bit over RAM capacity, we can swap a few colder anon pages to tide
> the workload over, without the workload even noticing. This is
> fantastic from a performance point of view. It effectively increases
> our amount of available memory or the workingset sizes we can support.
>
> But the bad news is also that we have really fast swap. If we have a
> misbehaving workload that has a malloc() problem, we can *exhaust*
> swap space very, very quickly. Where we previously had those nice
> gradual slowdowns from memory.high when reclaim was failing, we now
> have very powerful reclaim that can swap at hundreds of megabytes per
> second - until swap is suddenly full and reclaim abruptly falls apart.

I think the concern is kernel oom-killer will be invoked too early and
not giving the chance to oomd. I am wondering if the PSI polling
interface is usable here as it can give events in milliseconds. Will
that be too noisy?

>
> So while fast swap is an enhancement to our memory capacity, it
> doesn't reliably act as that overcommit crumble zone that memory.high
> or slower swap devices used to give us.
>
> Should we replace those fast SSDs with crappy disks instead to achieve
> this effect? Or add a slow disk as a secondary swap device once the
> fast one is full? That would give us the desired effect, but obviously
> it would be kind of silly.
>
> That's where swap.high comes in. It gives us the performance of a fast
> drive during temporary dips into the overcommit buffer, while also
> providing that large rubber band kind of slowdown of a slow drive when
> the workload is expanding at an unsustainable trend.
>

BTW can you explain why is the system level low swap slowdown not
sufficient and a per-cgroup swap.high is needed? Or maybe you want to
slow down only specific cgroups.

> > There is also an aspect of non-determinism. There is no control over
> > the file vs. swap backed reclaim decision for memcgs. That means that
> > behavior is going to be very dependent on the internal implementation of
> > the reclaim. More swapping is going to fill up swap quota quicker.
>
> Haha, I mean that implies that reclaim is arbitrary. While it's
> certainly not perfect, we're trying to reclaim the pages that are
> least likely to be used again in the future. There is noise in this
> heuristic, obviously, but it's still going to correlate with reality
> and provide some level of determinism.
>
> The same is true for memory.high, btw. Depending on how effective
> reclaim is, we're going to throttle more or less. That's also going to
> fluctuate somewhat around implementation changes.
>
> > > It fits together with memory.low in that it prevents runaway anon allocation
> > > when swap can't be allocated anymore. It's addressing the same problem that
> > > memory.high slowdown does. It's just a different vector.
> >
> > I suspect that the problem is more related to the swap being handled as
> > a separate resource. And it is still not clear to me why it is easier
> > for you to tune swap.high than memory.high. You have said that you do
> > not want to set up memory.high because it is harder to tune but I do
> > not see why swap is easier in this regards. Maybe it is just that the
> > swap is almost never used so a bad estimate is much easier to tolerate
> > and you really do care about runaways?
>
> You hit the nail on the head.
>
> We don't want memory.high (in most cases) because we want to utilize
> memory to the absolute maximum.
>
> Obviously, the same isn't true for swap because there is no DaX and
> most workloads can't run when 80% of their workingset are on swap.
>
> They're not interchangeable resources.
>

What do you mean by not interchangeable? If I keep the hot memory (or
workingset) of a job in DRAM and cold memory in swap and control the
rate of refaults by controlling the definition of cold memory then I
am using the DRAM and swap interchangeably and transparently to the
job (that is what we actually do).

I am also wondering if you guys explored the in-memory compression
based swap medium and if there are any reasons to not follow that
route.

Oh you mentioned DAX, that brings to mind a very interesting topic.
Are you guys exploring the idea of using PMEM as a cheap slow memory?
It is byte-addressable, so, regarding memcg accounting, will you treat
it as a memory or a separate resource like swap in v2? How does your
memory overcommit model work with such a type of memory?

thanks,
Shakeel
Johannes Weiner April 21, 2020, 9:59 p.m. UTC | #19
On Tue, Apr 21, 2020 at 12:09:27PM -0700, Shakeel Butt wrote:
> Hi Johannes,
> 
> On Tue, Apr 21, 2020 at 7:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> [snip]
> >
> 
> The following is a very good description and it gave me an idea of how
> you (FB) are approaching the memory overcommit problem. The approach
> you are taking is very different from ours and I would like to pick
> your brain on the why (sorry this might be a bit tangent to the
> series).
> 
> Please correct me if I am wrong, your memory overcommit strategy is to
> let the jobs use memory as much as they want but when the system is
> low on memory, slow down everyone (to not let the kernel oom-killer
> trigger) and let the userspace oomd take care of releasing the
> pressure.
> 
> We run multiple latency sensitive jobs along with multiple batch jobs
> on the machine. Overcommitting the memory on such machines, we learn
> that the battle is already lost when the system starts doing direct
> reclaim. Direct reclaim does not differentiate between the reclaimers.
> We could have tried the "slow down" approach but our latency sensitive
> jobs prefer to die and let the load-balancer handover the request to
> some other instance of the job than to stall the request for
> non-deterministic time. We could have tried the PSI-like monitor to
> trigger oom-kills when latency sensitive jobs start seeing the stalls
> but that would be less work-conserving and  non-deterministic behavior
> (i.e. sometimes more oom-kills and sometimes more memory
> overcommitted). The approach we took was to do proactive reclaim along
> with a very low latency refault medium (in-memory compression).
>
> Now as you mentioned, you are trying to be a bit more aggressive in
> the memory overcommit and I can see the writing on the wall that you
> will be stuffing more jobs of different types on a machine, why do you
> think the "slow down" approach will be able to provide the performance
> isolation guarantees?

We do control very aggressive batch jobs to the extent where they have
negligible latency impact on interactive services running on the same
hosts. All the tools to do that are upstream and/or public, but it's
still pretty new stuff (memory.low, io.cost, cpu headroom control,
freezer) and they need to be put together just right.

We're working on a demo application that showcases how it all fits
together and hope to be ready to publish it soon.

> > Just imagine we had a really slow swap device. Some spinning disk that
> > is terrible at random IO. From a performance point of view, this would
> > obviously suck. But from a resource management point of view, this is
> > actually pretty useful in slowing down a workload that is growing
> > unsustainably. This is so useful, in fact, that Virtuozzo implemented
> > virtual swap devices that are artificially slow to emulate this type
> > of "punishment".
> >
> > A while ago, we didn't have any swap configured. We set memory.high
> > and things were good: when things would go wrong and the workload
> > expanded beyond reclaim capabilities, memory.high would inject sleeps
> > until oomd would take care of the workload.
> >
> > Remember that the point is to avoid the kernel OOM killer and do OOM
> > handling in userspace. That's the difference between memory.high and
> > memory.max as well.
> >
> > However, in many cases we now want to overcommit more aggressively
> > than memory.high would allow us. For this purpose, we're switching to
> > memory.low, to only enforce limits when *physical* memory is
> > short. And we've added swap to have some buffer zone at the edge of
> > this aggressive overcommit.
> >
> > But swap has been a good news, bad news situation. The good news is
> > that we have really fast swap, so if the workload is only temporarily
> > a bit over RAM capacity, we can swap a few colder anon pages to tide
> > the workload over, without the workload even noticing. This is
> > fantastic from a performance point of view. It effectively increases
> > our amount of available memory or the workingset sizes we can support.
> >
> > But the bad news is also that we have really fast swap. If we have a
> > misbehaving workload that has a malloc() problem, we can *exhaust*
> > swap space very, very quickly. Where we previously had those nice
> > gradual slowdowns from memory.high when reclaim was failing, we now
> > have very powerful reclaim that can swap at hundreds of megabytes per
> > second - until swap is suddenly full and reclaim abruptly falls apart.
> 
> I think the concern is kernel oom-killer will be invoked too early and
> not giving the chance to oomd.

Yes.

> I am wondering if the PSI polling interface is usable here as it can
> give events in milliseconds. Will that be too noisy?

Yes, it would be hard to sample OOM pressure reliably from CPU bound
reclaim alone. The difference between successful and failing reclaim
isn't all that big in terms of CPU cycles.

> > So while fast swap is an enhancement to our memory capacity, it
> > doesn't reliably act as that overcommit crumble zone that memory.high
> > or slower swap devices used to give us.
> >
> > Should we replace those fast SSDs with crappy disks instead to achieve
> > this effect? Or add a slow disk as a secondary swap device once the
> > fast one is full? That would give us the desired effect, but obviously
> > it would be kind of silly.
> >
> > That's where swap.high comes in. It gives us the performance of a fast
> > drive during temporary dips into the overcommit buffer, while also
> > providing that large rubber band kind of slowdown of a slow drive when
> > the workload is expanding at an unsustainable trend.
> 
> BTW can you explain why is the system level low swap slowdown not
> sufficient and a per-cgroup swap.high is needed? Or maybe you want to
> slow down only specific cgroups.

Yes, that's exactly it. We have a hostcritical.slice cgroup that hosts
oomd and sshd etc., and we want this cgroup to be able to allocate as
quickly as possible. It's more important than the nominal workload.

It needs a headroom of fast swap space after everybody else is already
getting throttled on their swap consumption.

> > > I suspect that the problem is more related to the swap being handled as
> > > a separate resource. And it is still not clear to me why it is easier
> > > for you to tune swap.high than memory.high. You have said that you do
> > > not want to set up memory.high because it is harder to tune but I do
> > > not see why swap is easier in this regards. Maybe it is just that the
> > > swap is almost never used so a bad estimate is much easier to tolerate
> > > and you really do care about runaways?
> >
> > You hit the nail on the head.
> >
> > We don't want memory.high (in most cases) because we want to utilize
> > memory to the absolute maximum.
> >
> > Obviously, the same isn't true for swap because there is no DaX and
> > most workloads can't run when 80% of their workingset are on swap.
> >
> > They're not interchangeable resources.
> >
> 
> What do you mean by not interchangeable? If I keep the hot memory (or
> workingset) of a job in DRAM and cold memory in swap and control the
> rate of refaults by controlling the definition of cold memory then I
> am using the DRAM and swap interchangeably and transparently to the
> job (that is what we actually do).

Right, that's a more precise definition than my randomly chosen "80%"
number above. There are parts of a workload's memory access curve
(where x is distinct data accessed and y is the access frequency) that
don't need to stay in RAM permanently and can be got on-demand from
secondary storage without violating the workload's throughput/latency
requirements. For that part, RAM, swap, disk can be interchangeable.

I'm was specifically talking about the other half of that curve, and
meant to imply that that's usually bigger than 20%. Usually ;-)

I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't
matter whether it gets it in ram or in swap. There is a line somewhere
in between, and it'll vary with workload requirements, access patterns
and IO speed. But no workload can actually run with 10G of swap and 0
bytes worth of direct access memory, right?

Since you said before you're using combined memory+swap limits, I'm
assuming that you configure the resource as interchangeable, but still
have some form of determining where that cutoff line is between them -
either by tuning proactive reclaim toward that line or having OOM kill
policies when the line is crossed and latencies are violated?

> I am also wondering if you guys explored the in-memory compression
> based swap medium and if there are any reasons to not follow that
> route.

We played around with it, but I'm ambivalent about it.

You need to identify that perfect "warm" middle section of the
workingset curve that is 1) cold enough to not need permanent direct
access memory, yet 2) warm enough to justify allocating RAM to it.

A lot of our workloads have a distinguishable hot set and various
amounts of fairly cold data during stable states, with not too much
middle ground in between where compressed swap would really shine.

Do you use compressed swap fairly universally, or more specifically
for certain workloads?

> Oh you mentioned DAX, that brings to mind a very interesting topic.
> Are you guys exploring the idea of using PMEM as a cheap slow memory?
> It is byte-addressable, so, regarding memcg accounting, will you treat
> it as a memory or a separate resource like swap in v2? How does your
> memory overcommit model work with such a type of memory?

I think we (the kernel MM community, not we as in FB) are still some
ways away from having dynamic/transparent data placement for pmem the
same way we have for RAM. But I expect the kernel's high-level default
strategy to be similar: order virtual memory (the data) by access
frequency and distribute across physical memory/storage accordingly.

(With pmem being divided into volatile space and filesystem space,
where volatile space holds colder anon pages (and, if there is still a
disk, disk cache), and the sizing decisions between them being similar
as the ones we use for swap and filesystem today).

I expect cgroup policy to be separate, because to users the
performance difference matters. We won't want greedy batch
applications displacing latency sensitive ones from RAM into pmem,
just like we don't want this displacement into secondary storage
today. Other than that, there isn't too much difference to users,
because paging is already transparent - an mmapped() file looks the
same whether it's backed by RAM, by disk or by pmem. The difference is
access latencies and the aggregate throughput loss they add up to. So
I could see pmem cgroup limits and protections (for the volatile space
portion) the same way we have RAM limits and protections.

But yeah, I think this is going a bit off topic ;-)
Shakeel Butt April 21, 2020, 10:39 p.m. UTC | #20
On Tue, Apr 21, 2020 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[snip]
>
> We do control very aggressive batch jobs to the extent where they have
> negligible latency impact on interactive services running on the same
> hosts. All the tools to do that are upstream and/or public, but it's
> still pretty new stuff (memory.low, io.cost, cpu headroom control,
> freezer) and they need to be put together just right.
>
> We're working on a demo application that showcases how it all fits
> together and hope to be ready to publish it soon.
>

That would be awesome.

>
[snip]
> >
> > What do you mean by not interchangeable? If I keep the hot memory (or
> > workingset) of a job in DRAM and cold memory in swap and control the
> > rate of refaults by controlling the definition of cold memory then I
> > am using the DRAM and swap interchangeably and transparently to the
> > job (that is what we actually do).
>
> Right, that's a more precise definition than my randomly chosen "80%"
> number above. There are parts of a workload's memory access curve
> (where x is distinct data accessed and y is the access frequency) that
> don't need to stay in RAM permanently and can be got on-demand from
> secondary storage without violating the workload's throughput/latency
> requirements. For that part, RAM, swap, disk can be interchangeable.
>
> I'm was specifically talking about the other half of that curve, and
> meant to imply that that's usually bigger than 20%. Usually ;-)
>
> I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't
> matter whether it gets it in ram or in swap. There is a line somewhere
> in between, and it'll vary with workload requirements, access patterns
> and IO speed. But no workload can actually run with 10G of swap and 0
> bytes worth of direct access memory, right?

Yes.

>
> Since you said before you're using combined memory+swap limits, I'm
> assuming that you configure the resource as interchangeable, but still
> have some form of determining where that cutoff line is between them -
> either by tuning proactive reclaim toward that line or having OOM kill
> policies when the line is crossed and latencies are violated?
>

Yes, more specifically tuning proactive reclaim towards that line. We
define that line in terms of acceptable refault rate for the job. The
acceptable refault rate is measured through re-use and idle page
histograms (these histograms are collected through our internal
implementation of Page Idle Tracking). I am planning to upstream and
open-source these.

> > I am also wondering if you guys explored the in-memory compression
> > based swap medium and if there are any reasons to not follow that
> > route.
>
> We played around with it, but I'm ambivalent about it.
>
> You need to identify that perfect "warm" middle section of the
> workingset curve that is 1) cold enough to not need permanent direct
> access memory, yet 2) warm enough to justify allocating RAM to it.
>
> A lot of our workloads have a distinguishable hot set and various
> amounts of fairly cold data during stable states, with not too much
> middle ground in between where compressed swap would really shine.
>
> Do you use compressed swap fairly universally, or more specifically
> for certain workloads?
>

Yes, we are using it fairly universally. There are few exceptions like
user space net and storage drivers.

> > Oh you mentioned DAX, that brings to mind a very interesting topic.
> > Are you guys exploring the idea of using PMEM as a cheap slow memory?
> > It is byte-addressable, so, regarding memcg accounting, will you treat
> > it as a memory or a separate resource like swap in v2? How does your
> > memory overcommit model work with such a type of memory?
>
> I think we (the kernel MM community, not we as in FB) are still some
> ways away from having dynamic/transparent data placement for pmem the
> same way we have for RAM. But I expect the kernel's high-level default
> strategy to be similar: order virtual memory (the data) by access
> frequency and distribute across physical memory/storage accordingly.
>
> (With pmem being divided into volatile space and filesystem space,
> where volatile space holds colder anon pages (and, if there is still a
> disk, disk cache), and the sizing decisions between them being similar
> as the ones we use for swap and filesystem today).
>
> I expect cgroup policy to be separate, because to users the
> performance difference matters. We won't want greedy batch
> applications displacing latency sensitive ones from RAM into pmem,
> just like we don't want this displacement into secondary storage
> today. Other than that, there isn't too much difference to users,
> because paging is already transparent - an mmapped() file looks the
> same whether it's backed by RAM, by disk or by pmem. The difference is
> access latencies and the aggregate throughput loss they add up to. So
> I could see pmem cgroup limits and protections (for the volatile space
> portion) the same way we have RAM limits and protections.
>
> But yeah, I think this is going a bit off topic ;-)

That's really interesting. Thanks for appeasing my curiosity.

thanks,
Shakeel
Michal Hocko April 22, 2020, 1:26 p.m. UTC | #21
On Tue 21-04-20 12:56:01, Johannes Weiner wrote:
> On Tue, Apr 21, 2020 at 06:11:38PM +0200, Michal Hocko wrote:
> > On Tue 21-04-20 10:27:46, Johannes Weiner wrote:
> > > On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote:
> > [...]
> > > > I am also not sure about the isolation aspect. Because an external
> > > > memory pressure might have pushed out memory to the swap and then the
> > > > workload is throttled based on an external event. Compare that to the
> > > > memory.high throttling which is not directly affected by the external
> > > > pressure.
> > > 
> > > Neither memory.high nor swap.high isolate from external pressure.
> > 
> > I didn't say they do. What I am saying is that an external pressure
> > might punish swap.high memcg because the external memory pressure would
> > eat up the quota and trigger the throttling.
> 
> External pressure could also push a cgroup into a swap device that
> happens to be very slow and cause the cgroup to be throttled that way.

Yes but it would get throttled at the fault time when the swapped out
memory is needed. Unless the anon workload actively doesn't fit into
memory then refaults are not that common. Compare that to a continuous
throttling because your memory has been pushed out to swap and you
cannot do much about that without being slowed down to crawling.

> But that effect is actually not undesirable. External pressure means
> that something more important runs and needs the memory of something
> less important (otherwise, memory.low would deflect this intrusion).
> 
> So we're punishing/deprioritizing the right cgroup here. The one that
> isn't protected from memory pressure.
> 
> > It is fair to say that this externally triggered interference is already
> > possible with swap.max as well though. It would likely be just more
> > verbose because of the oom killer intervention rather than a slowdown.
> 
> Right.
> 
> > > They
> > > are put on cgroups so they don't cause pressure on other cgroups. Swap
> > > is required when either your footprint grows or your available space
> > > shrinks. That's why it behaves like that.
> > > 
> > > That being said, I think we're getting lost in the implementation
> > > details before we have established what the purpose of this all
> > > is. Let's talk about this first.
> > 
> > Thanks for describing it in the length. I have a better picture of the
> > intention (this should have been in the changelog ideally). I can see
> > how the swap consumption throttling might be useful but I still dislike the
> > proposed implementation. Mostly because of throttling of all allocations
> > regardless whether they can contribute to the swap consumption or not.
> 
> I mean, even if they're not swappable, they can still contribute to
> swap consumption that wouldn't otherwise have been there. Each new
> page that comes in displaces another page at the end of the big LRU
> pipeline and pushes it into the mouth of reclaim - which may swap. So
> *every* allocation has a certain probability of increasing swap usage.

You are right of course and this makes an reasonable implementation of
swap.high far from trivial. I would even dare to say that an optimal
implementation is impossible because the throttling cannot be done in
the reclaim context (at least not in your case where you rely on the
global reclaim).

> The fact that we have reached swap.high is a good hint that reclaim
> has indeed been swapping quite aggressively to accomodate incoming
> allocations, and probably will continue to do so.

You can fill up a swap space even without an aggressive reclaim so I
wouldn't make any assumptions just based on the amount of swapped out
memory.
 
> We could check whether there are NO anon pages left in a workload, but
> that's such an extreme and short-lived case that it probably wouldn't
> make a difference in practice.
>
> We could try to come up with a model that calculates a probabilty of
> each new allocation to cause swap. Whether that new allocation itself
> is swapbacked would of course be a factor, but there are other factors
> as well: the millions of existing LRU pages, the reclaim decisions we
> will make, swappiness and so forth.

Yeah, an optimal solution likely doesn't exist. Some portion of
get_scan_count could be used to have at least some clue on whether
swap out is likely.
 
> Of course, I agree with you, if all you have coming in is cache
> allocations, you'd *eventually* run out of pages to swap.
>
> However, 10G of new active cache allocations can still cause 10G of
> already allocated anon pages to get swapped out. For example if a
> malloc() leak happened *before* the regular cache workingset is
> established. We cannot retro-actively throttle those anon pages, we
> can only keep new allocations from pushing old ones into swap.

Yes and this is the fundamental problem we have here as I have mentioned
above as well. Throttling and swapout are simply not bound together. So
we can only guess. And that guessing is a concern because opinions on
that might differ. For example I really dislike the huge hammer to
throttle for all charges but I do see how reasonable people might
disagree on this matter.

That being said I believe our discussion is missing an important part.
There is no description of the swap.high semantic. What can user expect
when using it?
Johannes Weiner April 22, 2020, 2:15 p.m. UTC | #22
On Wed, Apr 22, 2020 at 03:26:32PM +0200, Michal Hocko wrote:
> That being said I believe our discussion is missing an important part.
> There is no description of the swap.high semantic. What can user expect
> when using it?

Good point, we should include that in cgroup-v2.rst. How about this?

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index bcc80269bb6a..49e8733a9d8a 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1370,6 +1370,17 @@ PAGE_SIZE multiple when read back.
 	The total amount of swap currently being used by the cgroup
 	and its descendants.
 
+  memory.swap.high
+	A read-write single value file which exists on non-root
+	cgroups.  The default is "max".
+
+	Swap usage throttle limit.  If a cgroup's swap usage exceeds
+	this limit, allocations inside the cgroup will be throttled.
+
+	This slows down expansion of the group's memory footprint as
+	it runs out of assigned swap space. Compare to memory.swap.max,
+	which stops swapping abruptly and can provoke kernel OOM kills.
+
   memory.swap.max
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
Michal Hocko April 22, 2020, 3:43 p.m. UTC | #23
On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 03:26:32PM +0200, Michal Hocko wrote:
> > That being said I believe our discussion is missing an important part.
> > There is no description of the swap.high semantic. What can user expect
> > when using it?
> 
> Good point, we should include that in cgroup-v2.rst. How about this?
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index bcc80269bb6a..49e8733a9d8a 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1370,6 +1370,17 @@ PAGE_SIZE multiple when read back.
>  	The total amount of swap currently being used by the cgroup
>  	and its descendants.
>  
> +  memory.swap.high
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is "max".
> +
> +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> +	this limit, allocations inside the cgroup will be throttled.

Hm, so this doesn't talk about which allocatios are affected. This is
good for potential future changes but I am not sure this is useful to
make any educated guess about the actual effects. One could expect that
only those allocations which could contribute to future memory.swap
usage. I fully realize that we do not want to be very specific but we
want to provide something useful I believe. I am sorry but I do not have
a good suggestion on how to make this better. Mostly because I still
struggle on how this should behave to be sane.

I am also missing some information about what the user can actually do
about this situation and call out explicitly that the throttling is
not going away until the swap usage is shrunk and the kernel is not
capable of doing that on its own without a help from the userspace. This
is really different from memory.high which has means to deal with the
excess and shrink it down in most cases. The following would clarify it
for me
	"Once the limit is exceeded it is expected that the userspace
	 is going to act and either free up the swapped out space
	 or tune the limit based on needs. The kernel itself is not
	 able to do that on its own.
	"

> +
> +	This slows down expansion of the group's memory footprint as
> +	it runs out of assigned swap space. Compare to memory.swap.max,
> +	which stops swapping abruptly and can provoke kernel OOM kills.
> +
>    memory.swap.max
>  	A read-write single value file which exists on non-root
>  	cgroups.  The default is "max".
Johannes Weiner April 22, 2020, 5:13 p.m. UTC | #24
On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote:
> On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
> > On Wed, Apr 22, 2020 at 03:26:32PM +0200, Michal Hocko wrote:
> > > That being said I believe our discussion is missing an important part.
> > > There is no description of the swap.high semantic. What can user expect
> > > when using it?
> > 
> > Good point, we should include that in cgroup-v2.rst. How about this?
> > 
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index bcc80269bb6a..49e8733a9d8a 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1370,6 +1370,17 @@ PAGE_SIZE multiple when read back.
> >  	The total amount of swap currently being used by the cgroup
> >  	and its descendants.
> >  
> > +  memory.swap.high
> > +	A read-write single value file which exists on non-root
> > +	cgroups.  The default is "max".
> > +
> > +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> > +	this limit, allocations inside the cgroup will be throttled.
> 
> Hm, so this doesn't talk about which allocatios are affected. This is
> good for potential future changes but I am not sure this is useful to
> make any educated guess about the actual effects. One could expect that
> only those allocations which could contribute to future memory.swap
> usage. I fully realize that we do not want to be very specific but we
> want to provide something useful I believe. I am sorry but I do not have
> a good suggestion on how to make this better. Mostly because I still
> struggle on how this should behave to be sane.

I honestly don't really follow you here. Why is it not helpful to say
all allocations will slow down when condition X is met? We do the same
for memory.high.

> I am also missing some information about what the user can actually do
> about this situation and call out explicitly that the throttling is
> not going away until the swap usage is shrunk and the kernel is not
> capable of doing that on its own without a help from the userspace. This
> is really different from memory.high which has means to deal with the
> excess and shrink it down in most cases. The following would clarify it

I think we may be talking past each other. The user can do the same
thing as in any OOM situation: wait for the kill.

Swap being full is an OOM situation.

Yes, that does not match the kernel's internal definition of an OOM
situation. But we've already established that kernel OOM killing has a
different objective (memory deadlock avoidance) than userspace OOM
killing (quality of life)[1]

[1] https://lkml.org/lkml/2019/8/4/15

As Tejun said, things like earlyoom and oomd already kill based on
swap exhaustion, no further questions asked. Reclaim has been running
for a while, it went after all the low-hanging fruit: it doesn't swap
as long as there is easy cache; it also didn't just swap a little, it
filled up all of swap; and the pages in swap are all cold too, because
refaults would free that space again.

The workingset is hugely oversized for the available capacity, and
nobody has any interest in sticking around to see what tricks reclaim
still has up its sleeves (hint: nothing good). From here on out, it's
all thrashing and pain. The kernel might not OOM kill yet, but the quality
of life expectancy for a workload with full swap is trending toward zero.

We've been killing based on swap exhaustion as a stand-alone trigger
for several years now and it's never been the wrong call.

All swap.high does is acknowledge that swap-full is a common OOM
situation from a userspace view, and helps it handle that situation.

Just like memory.high acknowledges that if reclaim fails per kernel
definition, it's an OOM situation from a kernel view, and it helps
userspace handle that.

> for me
> 	"Once the limit is exceeded it is expected that the userspace
> 	 is going to act and either free up the swapped out space
> 	 or tune the limit based on needs. The kernel itself is not
> 	 able to do that on its own.
> 	"

I mean, in rare cases, maybe userspace can do some loadshedding and be
smart about it. But we certainly don't expect it to. Just like we
don't expect it to when memory.high starts injecting sleeps. We expect
the workload to die, usually.
Michal Hocko April 22, 2020, 6:49 p.m. UTC | #25
On Wed 22-04-20 13:13:28, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote:
> > On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
[...]
> > > +	Swap usage throttle limit.  If a cgroup's swap usage exceeds
> > > +	this limit, allocations inside the cgroup will be throttled.
> > 
> > Hm, so this doesn't talk about which allocatios are affected. This is
> > good for potential future changes but I am not sure this is useful to
> > make any educated guess about the actual effects. One could expect that
> > only those allocations which could contribute to future memory.swap
> > usage. I fully realize that we do not want to be very specific but we
> > want to provide something useful I believe. I am sorry but I do not have
> > a good suggestion on how to make this better. Mostly because I still
> > struggle on how this should behave to be sane.
> 
> I honestly don't really follow you here. Why is it not helpful to say
> all allocations will slow down when condition X is met?

This might be just me and I definitely do not want to pick on words here
but your wording was not specific on which allocations. You can very
well interpret that as really all allocations but I wouldn't be surprised
if some would interpret it in a way that the kernel doesn't throttle
unnecessarily and if allocations cannot really contribute to more swap
then why should they be throttled.

> We do the same for memory.high.
>
> > I am also missing some information about what the user can actually do
> > about this situation and call out explicitly that the throttling is
> > not going away until the swap usage is shrunk and the kernel is not
> > capable of doing that on its own without a help from the userspace. This
> > is really different from memory.high which has means to deal with the
> > excess and shrink it down in most cases. The following would clarify it
> 
> I think we may be talking past each other. The user can do the same
> thing as in any OOM situation: wait for the kill.

That assumes that reaching swap.high is going to converge to the OOM
eventually. And that is far from the general case. There might be a
lot of other reclaimable memory to reclaim and stay in the current
state.
 
[...]

> > for me
> > 	"Once the limit is exceeded it is expected that the userspace
> > 	 is going to act and either free up the swapped out space
> > 	 or tune the limit based on needs. The kernel itself is not
> > 	 able to do that on its own.
> > 	"
> 
> I mean, in rare cases, maybe userspace can do some loadshedding and be
> smart about it. But we certainly don't expect it to.

I really didn't mean to suggest any clever swap management.  All I
wanted so say and have documented is that users of swap.high should
be aware of the fact that kernel is not able to do much to reduce the
throttling. This is really different from memory.high where the kernel
pro-actively tries to keep the memory usage below the watermark.  So a
certain level of userspace cooperation is really needed unless you can
tolerate a workload to be throttled to the end of times.

So let me be clear here. This is a very tricky interface to use and the
more verbose we can be the better.
Johannes Weiner April 23, 2020, 3 p.m. UTC | #26
On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote:
> On Wed 22-04-20 13:13:28, Johannes Weiner wrote:
> > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote:
> > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
> > > I am also missing some information about what the user can actually do
> > > about this situation and call out explicitly that the throttling is
> > > not going away until the swap usage is shrunk and the kernel is not
> > > capable of doing that on its own without a help from the userspace. This
> > > is really different from memory.high which has means to deal with the
> > > excess and shrink it down in most cases. The following would clarify it
> > 
> > I think we may be talking past each other. The user can do the same
> > thing as in any OOM situation: wait for the kill.
> 
> That assumes that reaching swap.high is going to converge to the OOM
> eventually. And that is far from the general case. There might be a
> lot of other reclaimable memory to reclaim and stay in the current
> state.

No, that's really the general case. And that's based on what users
widely experience, including us at FB. When swap is full, it's over.
Multiple parties have independently reached this conclusion.

This will be the default assumption in major distributions soon:
https://fedoraproject.org/wiki/Changes/EnableEarlyoom

> > > for me
> > > 	"Once the limit is exceeded it is expected that the userspace
> > > 	 is going to act and either free up the swapped out space
> > > 	 or tune the limit based on needs. The kernel itself is not
> > > 	 able to do that on its own.
> > > 	"
> > 
> > I mean, in rare cases, maybe userspace can do some loadshedding and be
> > smart about it. But we certainly don't expect it to.
> 
> I really didn't mean to suggest any clever swap management.  All I
> wanted so say and have documented is that users of swap.high should
> be aware of the fact that kernel is not able to do much to reduce the
> throttling. This is really different from memory.high where the kernel
> pro-actively tries to keep the memory usage below the watermark.  So a
> certain level of userspace cooperation is really needed unless you can
> tolerate a workload to be throttled to the end of times.

That's exactly what happens with memory.high. We've seen this. The
workload can go into a crawl and just stay there.

It's not unlike disabling the oom killer in cgroup1 without anybody
handling it. With memory.high, workloads *might* recover, but you have
to handle the ones that don't. Again, we inject sleeps into
memory.high when reclaim *is not* pushing back the workload anymore,
when reclaim is *failing*. The state isn't as stable as with
oom_control=0, but these indefinite hangs really happen in practice.
Realistically, you cannot use memory.high without an OOM manager.

The assymetry you see between memory.high and swap.high comes from the
page cache. memory.high can set a stop to the mindless expansion of
the file cache and remove *unused* cache pages from the application's
workingset. It cannot permanently remove used cache pages, they'll
just refault. So unused cache is where reclaim is useful.

Once the workload expands its set of *used* pages past memory.high, we
are talking about indefinite slowdowns / OOM situations. Because at
that point, reclaim cannot push the workload back and everything will
be okay: the pages it takes off mean refaults and continued reclaim,
i.e. throttling. You get slowed down either way, and whether you
reclaim or sleep() is - to the workload - an accounting difference.

Reclaim does NOT have the power to help the workload get better. It
can only do amputations to protect the rest of the system, but it
cannot reduce the number of pages the workload is trying to access.

The only sustainable way out of such a throttling situation is either
an OOM kill or the workload voluntarily shrinking and reducing the
total number of pages it uses. And doesn't that sound familiar? :-)

The actual, observable effects of memory.high and swap.high semantics
are much more similar than you think they are: When the workload's
true workingset (not throwaway cache) grows past capacity (memory or
swap), we slow down further expansion until it either changes its mind
and shrinks, or userspace OOM handling takes care of it.
Michal Hocko April 24, 2020, 3:05 p.m. UTC | #27
On Thu 23-04-20 11:00:15, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote:
> > On Wed 22-04-20 13:13:28, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote:
> > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
> > > > I am also missing some information about what the user can actually do
> > > > about this situation and call out explicitly that the throttling is
> > > > not going away until the swap usage is shrunk and the kernel is not
> > > > capable of doing that on its own without a help from the userspace. This
> > > > is really different from memory.high which has means to deal with the
> > > > excess and shrink it down in most cases. The following would clarify it
> > > 
> > > I think we may be talking past each other. The user can do the same
> > > thing as in any OOM situation: wait for the kill.
> > 
> > That assumes that reaching swap.high is going to converge to the OOM
> > eventually. And that is far from the general case. There might be a
> > lot of other reclaimable memory to reclaim and stay in the current
> > state.
> 
> No, that's really the general case. And that's based on what users
> widely experience, including us at FB. When swap is full, it's over.
> Multiple parties have independently reached this conclusion.

But we are talking about two things. You seem to be focusing on the full
swap (quota) while I am talking about swap.high which doesn't imply
that the quota/full swap is going to be reached soon.

[...]

> The assymetry you see between memory.high and swap.high comes from the
> page cache. memory.high can set a stop to the mindless expansion of
> the file cache and remove *unused* cache pages from the application's
> workingset. It cannot permanently remove used cache pages, they'll
> just refault. So unused cache is where reclaim is useful.

Exactly! And I have seen memory.high being used to throttle huge page
cache producers to not disrupt other workloads.
 
> Once the workload expands its set of *used* pages past memory.high, we
> are talking about indefinite slowdowns / OOM situations. Because at
> that point, reclaim cannot push the workload back and everything will
> be okay: the pages it takes off mean refaults and continued reclaim,
> i.e. throttling. You get slowed down either way, and whether you
> reclaim or sleep() is - to the workload - an accounting difference.
>
> Reclaim does NOT have the power to help the workload get better. It
> can only do amputations to protect the rest of the system, but it
> cannot reduce the number of pages the workload is trying to access.

Yes I do agree with you here and I believe this scenario wasn't really
what the dispute is about. As soon as the real working set doesn't
fit into the high limit and still growing then you are effectively
OOM and either you do handle that from the userspace or you have to
waaaaaaaaait for the kernel oom killer to trigger.

But I believe this scenario is much easier to understand because the
memory consumption is growing. What I find largely unintuitive from the
user POV is that the throttling will remain in place without a userspace
intervention even when there is no runaway.

Let me give you an example. Say you have a peak load which pushes
out a large part of an idle memory to swap. So much it fills up the
swap.high. The peak eventually finishes freeing up its resources.  The
swap situation remains the same because that memory is not refaulted and
we do not pro-actively swap in memory (aka reclaim the swap space). You
are left with throttling even though the overall memcg consumption is
really low. Kernel is currently not able to do anything about that
and the userspace would need to be aware of the situation to fault in
swapped out memory back to get a normal behavior. Do you think this
is something so obvious that people would keep it in mind when using
swap.high?

Anyway, it seems that we are not making progress here. As I've said I
believe that swap.high might lead to a surprising behavior and therefore
I would appreciate more clarity in the documentation. If you see a
problem with that for some reason then I can live with that. This is not
a reason to nack.
Johannes Weiner April 28, 2020, 2:24 p.m. UTC | #28
On Fri, Apr 24, 2020 at 05:05:10PM +0200, Michal Hocko wrote:
> On Thu 23-04-20 11:00:15, Johannes Weiner wrote:
> > On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote:
> > > On Wed 22-04-20 13:13:28, Johannes Weiner wrote:
> > > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote:
> > > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
> > > > > I am also missing some information about what the user can actually do
> > > > > about this situation and call out explicitly that the throttling is
> > > > > not going away until the swap usage is shrunk and the kernel is not
> > > > > capable of doing that on its own without a help from the userspace. This
> > > > > is really different from memory.high which has means to deal with the
> > > > > excess and shrink it down in most cases. The following would clarify it
> > > > 
> > > > I think we may be talking past each other. The user can do the same
> > > > thing as in any OOM situation: wait for the kill.
> > > 
> > > That assumes that reaching swap.high is going to converge to the OOM
> > > eventually. And that is far from the general case. There might be a
> > > lot of other reclaimable memory to reclaim and stay in the current
> > > state.
> > 
> > No, that's really the general case. And that's based on what users
> > widely experience, including us at FB. When swap is full, it's over.
> > Multiple parties have independently reached this conclusion.
> 
> But we are talking about two things. You seem to be focusing on the full
> swap (quota) while I am talking about swap.high which doesn't imply
> that the quota/full swap is going to be reached soon.

Hm, I'm not quite sure I understand. swap.high is supposed to set this
quota. It's supposed to say: the workload has now shown such an
appetite for swap that it's unlikely to survive for much longer - draw
out its death just long enough for userspace OOM handling.

Maybe this is our misunderstanding?

It certainly doesn't make much sense to set swap.high to 0 or
relatively low values. Should we add the above to the doc text?

> > Once the workload expands its set of *used* pages past memory.high, we
> > are talking about indefinite slowdowns / OOM situations. Because at
> > that point, reclaim cannot push the workload back and everything will
> > be okay: the pages it takes off mean refaults and continued reclaim,
> > i.e. throttling. You get slowed down either way, and whether you
> > reclaim or sleep() is - to the workload - an accounting difference.
> >
> > Reclaim does NOT have the power to help the workload get better. It
> > can only do amputations to protect the rest of the system, but it
> > cannot reduce the number of pages the workload is trying to access.
> 
> Yes I do agree with you here and I believe this scenario wasn't really
> what the dispute is about. As soon as the real working set doesn't
> fit into the high limit and still growing then you are effectively
> OOM and either you do handle that from the userspace or you have to
> waaaaaaaaait for the kernel oom killer to trigger.
> 
> But I believe this scenario is much easier to understand because the
> memory consumption is growing. What I find largely unintuitive from the
> user POV is that the throttling will remain in place without a userspace
> intervention even when there is no runaway.
> 
> Let me give you an example. Say you have a peak load which pushes
> out a large part of an idle memory to swap. So much it fills up the
> swap.high. The peak eventually finishes freeing up its resources.  The
> swap situation remains the same because that memory is not refaulted and
> we do not pro-actively swap in memory (aka reclaim the swap space). You
> are left with throttling even though the overall memcg consumption is
> really low. Kernel is currently not able to do anything about that
> and the userspace would need to be aware of the situation to fault in
> swapped out memory back to get a normal behavior. Do you think this
> is something so obvious that people would keep it in mind when using
> swap.high?

Okay, thanks for clarifying, I understand your concern now.

This is not a scenario that swap.high is supposed to handle. It should
*not* be set to an amount of memory that the workload can reasonably
have sitting around idle. For example, if your memory allowance is
10G, it doesn't make sense to have swap.high at 200M or something.

It should be set to "we don't expect healthy workloads to get here".

And now I also understand what you mean by being different to
memory.high. memory.high is definitely *expected* to get hit because
of the cache trimming usecase. We just don't expect the *throttling*
part to get into play unless the workload is truly unhealthy. But I
can see how user expectations toward swap.high could be different.

> Anyway, it seems that we are not making progress here. As I've said I
> believe that swap.high might lead to a surprising behavior and therefore
> I would appreciate more clarity in the documentation. If you see a
> problem with that for some reason then I can live with that. This is not
> a reason to nack.

No, I agree we should document this. How about the following?

  memory.swap.high
       A read-write single value file which exists on non-root
       cgroups.  The default is "max".

       Swap usage throttle limit.  If a cgroup's swap usage exceeds
       this limit, all its further allocations will be throttled to
       allow userspace to implement custom out-of-memory procedures.

       This limit marks a point of no return for the cgroup. It is NOT
       designed to manage the amount of swapping a workload does
       during regular operation. Compare to memory.swap.max, which
       prohibits swapping past a set amount, but lets the cgroup
       continue unimpeded as long as other memory can be reclaimed.
Michal Hocko April 29, 2020, 9:55 a.m. UTC | #29
On Tue 28-04-20 10:24:32, Johannes Weiner wrote:
> On Fri, Apr 24, 2020 at 05:05:10PM +0200, Michal Hocko wrote:
> > On Thu 23-04-20 11:00:15, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote:
> > > > On Wed 22-04-20 13:13:28, Johannes Weiner wrote:
> > > > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote:
> > > > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote:
> > > > > > I am also missing some information about what the user can actually do
> > > > > > about this situation and call out explicitly that the throttling is
> > > > > > not going away until the swap usage is shrunk and the kernel is not
> > > > > > capable of doing that on its own without a help from the userspace. This
> > > > > > is really different from memory.high which has means to deal with the
> > > > > > excess and shrink it down in most cases. The following would clarify it
> > > > > 
> > > > > I think we may be talking past each other. The user can do the same
> > > > > thing as in any OOM situation: wait for the kill.
> > > > 
> > > > That assumes that reaching swap.high is going to converge to the OOM
> > > > eventually. And that is far from the general case. There might be a
> > > > lot of other reclaimable memory to reclaim and stay in the current
> > > > state.
> > > 
> > > No, that's really the general case. And that's based on what users
> > > widely experience, including us at FB. When swap is full, it's over.
> > > Multiple parties have independently reached this conclusion.
> > 
> > But we are talking about two things. You seem to be focusing on the full
> > swap (quota) while I am talking about swap.high which doesn't imply
> > that the quota/full swap is going to be reached soon.
> 
> Hm, I'm not quite sure I understand. swap.high is supposed to set this
> quota. It's supposed to say: the workload has now shown such an
> appetite for swap that it's unlikely to survive for much longer - draw
> out its death just long enough for userspace OOM handling.
> 
> Maybe this is our misunderstanding?

Probably. We already have a quota for swap (swap.max). Workload is not
allowed to swap out when the quota is reached. swap.high is supposed to
act as a preliminary action towards slowing down swap consumption beyond
its limit.

> It certainly doesn't make much sense to set swap.high to 0 or
> relatively low values. Should we add the above to the doc text?
> 
> > > Once the workload expands its set of *used* pages past memory.high, we
> > > are talking about indefinite slowdowns / OOM situations. Because at
> > > that point, reclaim cannot push the workload back and everything will
> > > be okay: the pages it takes off mean refaults and continued reclaim,
> > > i.e. throttling. You get slowed down either way, and whether you
> > > reclaim or sleep() is - to the workload - an accounting difference.
> > >
> > > Reclaim does NOT have the power to help the workload get better. It
> > > can only do amputations to protect the rest of the system, but it
> > > cannot reduce the number of pages the workload is trying to access.
> > 
> > Yes I do agree with you here and I believe this scenario wasn't really
> > what the dispute is about. As soon as the real working set doesn't
> > fit into the high limit and still growing then you are effectively
> > OOM and either you do handle that from the userspace or you have to
> > waaaaaaaaait for the kernel oom killer to trigger.
> > 
> > But I believe this scenario is much easier to understand because the
> > memory consumption is growing. What I find largely unintuitive from the
> > user POV is that the throttling will remain in place without a userspace
> > intervention even when there is no runaway.
> > 
> > Let me give you an example. Say you have a peak load which pushes
> > out a large part of an idle memory to swap. So much it fills up the
> > swap.high. The peak eventually finishes freeing up its resources.  The
> > swap situation remains the same because that memory is not refaulted and
> > we do not pro-actively swap in memory (aka reclaim the swap space). You
> > are left with throttling even though the overall memcg consumption is
> > really low. Kernel is currently not able to do anything about that
> > and the userspace would need to be aware of the situation to fault in
> > swapped out memory back to get a normal behavior. Do you think this
> > is something so obvious that people would keep it in mind when using
> > swap.high?
> 
> Okay, thanks for clarifying, I understand your concern now.

Great that we are on the same page!

[...]

> No, I agree we should document this. How about the following?
> 
>   memory.swap.high
>        A read-write single value file which exists on non-root
>        cgroups.  The default is "max".
> 
>        Swap usage throttle limit.  If a cgroup's swap usage exceeds
>        this limit, all its further allocations will be throttled to
>        allow userspace to implement custom out-of-memory procedures.
> 
>        This limit marks a point of no return for the cgroup. It is NOT
>        designed to manage the amount of swapping a workload does
>        during regular operation. Compare to memory.swap.max, which
>        prohibits swapping past a set amount, but lets the cgroup
>        continue unimpeded as long as other memory can be reclaimed.

OK, this makes the intented use much more clear. I believe that it would
be helpful to also add your note that the value should be set to "we
don't expect healthy workloads to get here".

The usecase is quite narrow and I expect people will start asking about
something to help to manage the swap space somehow and this will not be
a good fit. This would require much more work to achieve a sane semantic
though. I am not aware of usacases at this moment so this is really hard
to argue about. I hope this will not backfire when we reach that point
though.

That being said, I am not a huge fan of the new interface but I can see
how it can be useful. I will not ack the patchset but I will not block
it either.

Thanks for refining the documentation and please make sure that
changelogs in the next version describe the intented usecase as
mentioned in this email thread.