Message ID | 20200417010617.927266-1-kuba@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | memcg: Slow down swap allocation as the available space gets depleted | expand |
On Thu, Apr 16, 2020 at 6:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > Tejun describes the problem as follows: > > When swap runs out, there's an abrupt change in system behavior - > the anonymous memory suddenly becomes unmanageable which readily > breaks any sort of memory isolation and can bring down the whole > system. Can you please add more info on this abrupt change in system behavior and what do you mean by anon memory becoming unmanageable? Once the system is in global reclaim and doing swapping the memory isolation is already broken. Here I am assuming you are talking about memcg limit reclaim and memcg limits are overcommitted. Shouldn't running out of swap will trigger the OOM earlier which should be better than impacting the whole system. > To avoid that, oomd [1] monitors free swap space and triggers > kills when it drops below the specific threshold (e.g. 15%). > > While this works, it's far from ideal: > - Depending on IO performance and total swap size, a given > headroom might not be enough or too much. > - oomd has to monitor swap depletion in addition to the usual > pressure metrics and it currently doesn't consider memory.swap.max. > > Solve this by adapting the same approach that memory.high uses - > slow down allocation as the resource gets depleted turning the > depletion behavior from abrupt cliff one to gradual degradation > observable through memory pressure metric. > > [1] https://github.com/facebookincubator/oomd > > Jakub Kicinski (3): > mm: prepare for swap over-high accounting and penalty calculation > mm: move penalty delay clamping out of calculate_high_delay() > mm: automatically penalize tasks with high swap use > > include/linux/memcontrol.h | 4 + > mm/memcontrol.c | 166 ++++++++++++++++++++++++++++--------- > 2 files changed, 131 insertions(+), 39 deletions(-) > > -- > 2.25.2 >
Hello, On Fri, Apr 17, 2020 at 09:11:33AM -0700, Shakeel Butt wrote: > On Thu, Apr 16, 2020 at 6:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > Tejun describes the problem as follows: > > > > When swap runs out, there's an abrupt change in system behavior - > > the anonymous memory suddenly becomes unmanageable which readily > > breaks any sort of memory isolation and can bring down the whole > > system. > > Can you please add more info on this abrupt change in system behavior > and what do you mean by anon memory becoming unmanageable? In the sense that anonymous memory becomes essentially memlocked. > Once the system is in global reclaim and doing swapping the memory > isolation is already broken. Here I am assuming you are talking about There currently are issues with anonymous memory management which makes them different / worse than page cache but I don't follow why swapping necessarily means that isolation is broken. Page refaults don't indicate that memory isolation is broken after all. > memcg limit reclaim and memcg limits are overcommitted. Shouldn't > running out of swap will trigger the OOM earlier which should be > better than impacting the whole system. The primary scenario which was being considered was undercommitted protections but I don't think that makes any relevant differences. This is exactly similar to delay injection for memory.high. What's desired is slowing down the workload as the available resource is depleted so that the resource shortage presents as gradual degradation of performance and matching increase in resource PSI. This allows the situation to be detected and handled from userland while avoiding sudden and unpredictable behavior changes. Thanks.
Hi Tejun, On Fri, Apr 17, 2020 at 9:23 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Apr 17, 2020 at 09:11:33AM -0700, Shakeel Butt wrote: > > On Thu, Apr 16, 2020 at 6:06 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > Tejun describes the problem as follows: > > > > > > When swap runs out, there's an abrupt change in system behavior - > > > the anonymous memory suddenly becomes unmanageable which readily > > > breaks any sort of memory isolation and can bring down the whole > > > system. > > > > Can you please add more info on this abrupt change in system behavior > > and what do you mean by anon memory becoming unmanageable? > > In the sense that anonymous memory becomes essentially memlocked. > > > Once the system is in global reclaim and doing swapping the memory > > isolation is already broken. Here I am assuming you are talking about > > There currently are issues with anonymous memory management which makes them > different / worse than page cache but I don't follow why swapping > necessarily means that isolation is broken. Page refaults don't indicate > that memory isolation is broken after all. > Sorry, I meant the performance isolation. Direct reclaim does not really differentiate who to stall and whose CPU to use. > > memcg limit reclaim and memcg limits are overcommitted. Shouldn't > > running out of swap will trigger the OOM earlier which should be > > better than impacting the whole system. > > The primary scenario which was being considered was undercommitted > protections but I don't think that makes any relevant differences. > What is undercommitted protections? Does it mean there is still swap available on the system but the memcg is hitting its swap limit? > This is exactly similar to delay injection for memory.high. What's desired > is slowing down the workload as the available resource is depleted so that > the resource shortage presents as gradual degradation of performance and > matching increase in resource PSI. This allows the situation to be detected > and handled from userland while avoiding sudden and unpredictable behavior > changes. > Let me try to understand this with an example. Memcg 'A' has memory.high = 100 MiB, memory.max = 150 MiB and memory.swap.max = 50 MiB. When A's usage goes over 100 MiB, it will reclaim the anon, file and kmem. The anon will go to swap and increase its swap usage until it hits the limit. Now the 'A' reclaim_high has fewer things (file & kmem) to reclaim but the mem_cgroup_handle_over_high() will keep A's increase in usage in check. So, my question is: should the slowdown by memory.high depends on the reclaimable memory? If there is no reclaimable memory and the job hits memory.high, should the kernel slow it down to crawl until the PSI monitor comes and decides what to do. If I understand correctly, the problem is the kernel slow down is not successful when reclaimable memory is very low. Please correct me if I am wrong. Shakeel
Hello, On Fri, Apr 17, 2020 at 10:18:15AM -0700, Shakeel Butt wrote: > > There currently are issues with anonymous memory management which makes them > > different / worse than page cache but I don't follow why swapping > > necessarily means that isolation is broken. Page refaults don't indicate > > that memory isolation is broken after all. > > Sorry, I meant the performance isolation. Direct reclaim does not > really differentiate who to stall and whose CPU to use. Can you please elaborate concrete scenarios? I'm having a hard time seeing differences from page cache. > > > memcg limit reclaim and memcg limits are overcommitted. Shouldn't > > > running out of swap will trigger the OOM earlier which should be > > > better than impacting the whole system. > > > > The primary scenario which was being considered was undercommitted > > protections but I don't think that makes any relevant differences. > > > > What is undercommitted protections? Does it mean there is still swap > available on the system but the memcg is hitting its swap limit? Hahaha, I assumed you were talking about memory.high/max and was saying that the primary scenarios that were being considered was usage of memory.low interacting with swap. Again, can you please give an concrete example so that we don't misunderstand each other? > > This is exactly similar to delay injection for memory.high. What's desired > > is slowing down the workload as the available resource is depleted so that > > the resource shortage presents as gradual degradation of performance and > > matching increase in resource PSI. This allows the situation to be detected > > and handled from userland while avoiding sudden and unpredictable behavior > > changes. > > > > Let me try to understand this with an example. Memcg 'A' has Ah, you already went there. Great. > memory.high = 100 MiB, memory.max = 150 MiB and memory.swap.max = 50 > MiB. When A's usage goes over 100 MiB, it will reclaim the anon, file > and kmem. The anon will go to swap and increase its swap usage until > it hits the limit. Now the 'A' reclaim_high has fewer things (file & > kmem) to reclaim but the mem_cgroup_handle_over_high() will keep A's > increase in usage in check. > > So, my question is: should the slowdown by memory.high depends on the > reclaimable memory? If there is no reclaimable memory and the job hits > memory.high, should the kernel slow it down to crawl until the PSI > monitor comes and decides what to do. If I understand correctly, the > problem is the kernel slow down is not successful when reclaimable > memory is very low. Please correct me if I am wrong. In combination with memory.high, swap slowdown may not be necessary because memory.high's slow down mechanism is already there to handle "can't swap" scenario whether that's because swap is disabled wholesale, limited or depleted. However, please consider the following scenario. cgroup A has memory.low protection and no other restrictions. cgroup B has no protection and has access to swap. When B's memory starts bloating and gets the system under memory contention, it'll start consuming swap until it can't. When swap becomes depleted for B, there's nothing holding it back and B will start eating into A's protection. The proposed mechanism just plugs another vector for the same condition where anonymous memory management breaks down because they can no longer be reclaimed due to swap unavailability. Thanks.
On Fri, Apr 17, 2020 at 10:36 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Apr 17, 2020 at 10:18:15AM -0700, Shakeel Butt wrote: > > > There currently are issues with anonymous memory management which makes them > > > different / worse than page cache but I don't follow why swapping > > > necessarily means that isolation is broken. Page refaults don't indicate > > > that memory isolation is broken after all. > > > > Sorry, I meant the performance isolation. Direct reclaim does not > > really differentiate who to stall and whose CPU to use. > > Can you please elaborate concrete scenarios? I'm having a hard time seeing > differences from page cache. > Oh I was talking about the global reclaim here. In global reclaim, any task can be throttled (throttle_direct_reclaim()). Memory freed by using the CPU of high priority low latency jobs can be stolen by low priority batch jobs. > > > > memcg limit reclaim and memcg limits are overcommitted. Shouldn't > > > > running out of swap will trigger the OOM earlier which should be > > > > better than impacting the whole system. > > > > > > The primary scenario which was being considered was undercommitted > > > protections but I don't think that makes any relevant differences. > > > > > > > What is undercommitted protections? Does it mean there is still swap > > available on the system but the memcg is hitting its swap limit? > > Hahaha, I assumed you were talking about memory.high/max and was saying that > the primary scenarios that were being considered was usage of memory.low > interacting with swap. Again, can you please give an concrete example so > that we don't misunderstand each other? > > > > This is exactly similar to delay injection for memory.high. What's desired > > > is slowing down the workload as the available resource is depleted so that > > > the resource shortage presents as gradual degradation of performance and > > > matching increase in resource PSI. This allows the situation to be detected > > > and handled from userland while avoiding sudden and unpredictable behavior > > > changes. > > > > > > > Let me try to understand this with an example. Memcg 'A' has > > Ah, you already went there. Great. > > > memory.high = 100 MiB, memory.max = 150 MiB and memory.swap.max = 50 > > MiB. When A's usage goes over 100 MiB, it will reclaim the anon, file > > and kmem. The anon will go to swap and increase its swap usage until > > it hits the limit. Now the 'A' reclaim_high has fewer things (file & > > kmem) to reclaim but the mem_cgroup_handle_over_high() will keep A's > > increase in usage in check. > > > > So, my question is: should the slowdown by memory.high depends on the > > reclaimable memory? If there is no reclaimable memory and the job hits > > memory.high, should the kernel slow it down to crawl until the PSI > > monitor comes and decides what to do. If I understand correctly, the > > problem is the kernel slow down is not successful when reclaimable > > memory is very low. Please correct me if I am wrong. > > In combination with memory.high, swap slowdown may not be necessary because > memory.high's slow down mechanism is already there to handle "can't swap" > scenario whether that's because swap is disabled wholesale, limited or > depleted. However, please consider the following scenario. > > cgroup A has memory.low protection and no other restrictions. cgroup B has > no protection and has access to swap. When B's memory starts bloating and > gets the system under memory contention, it'll start consuming swap until it > can't. When swap becomes depleted for B, there's nothing holding it back and > B will start eating into A's protection. > In this example does 'B' have memory.high and memory.max set and by A having no other restrictions, I am assuming you meant unlimited high and max for A? Can 'A' use memory.min? > The proposed mechanism just plugs another vector for the same condition > where anonymous memory management breaks down because they can no longer be > reclaimed due to swap unavailability. > > Thanks. > > -- > tejun
Hello, On Fri, Apr 17, 2020 at 10:51:10AM -0700, Shakeel Butt wrote: > > Can you please elaborate concrete scenarios? I'm having a hard time seeing > > differences from page cache. > > Oh I was talking about the global reclaim here. In global reclaim, any > task can be throttled (throttle_direct_reclaim()). Memory freed by > using the CPU of high priority low latency jobs can be stolen by low > priority batch jobs. I'm still having a hard time following this thread of discussion, most likely because my knoweldge of mm is fleeting at best. Can you please ELI5 why the above is specifically relevant to this discussion? I'm gonna list two things that come to my mind just in case that'd help reducing the back and forth. * With protection based configurations, protected cgroups wouldn't usually go into direct reclaim themselves all that much. * We do have holes in accounting CPU cycles used by reclaim to the orgins, which, for example, prevents making memory.high reclaim async and lets memory pressure contaminate cpu isolation possibly to a significant degree on lower core count machines in some scenarios, but that's a separate issue we need to address in the future. > > cgroup A has memory.low protection and no other restrictions. cgroup B has > > no protection and has access to swap. When B's memory starts bloating and > > gets the system under memory contention, it'll start consuming swap until it > > can't. When swap becomes depleted for B, there's nothing holding it back and > > B will start eating into A's protection. > > > > In this example does 'B' have memory.high and memory.max set and by A B doesn't have anything set. > having no other restrictions, I am assuming you meant unlimited high > and max for A? Can 'A' use memory.min? Sure, it can but 1. the purpose of the example is illustrating the imcompleteness of the existing mechanism 2. there's a big difference between letting the machine hit the wall and waiting for the kernel OOM to trigger and being able to monitor the situation as it gradually develops and respond to it, which is the whole point of the low/high mechanisms. Thanks.
On Fri, Apr 17, 2020 at 12:35 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Apr 17, 2020 at 10:51:10AM -0700, Shakeel Butt wrote: > > > Can you please elaborate concrete scenarios? I'm having a hard time seeing > > > differences from page cache. > > > > Oh I was talking about the global reclaim here. In global reclaim, any > > task can be throttled (throttle_direct_reclaim()). Memory freed by > > using the CPU of high priority low latency jobs can be stolen by low > > priority batch jobs. > > I'm still having a hard time following this thread of discussion, most > likely because my knoweldge of mm is fleeting at best. Can you please ELI5 > why the above is specifically relevant to this discussion? > No, it is not relevant to this discussion "now". The mention of performance isolation in my first email was mostly due to my lack of understanding about what problem this patch series is trying to solve. So, let's skip this topic. > I'm gonna list two things that come to my mind just in case that'd help > reducing the back and forth. > > * With protection based configurations, protected cgroups wouldn't usually > go into direct reclaim themselves all that much. > > * We do have holes in accounting CPU cycles used by reclaim to the orgins, > which, for example, prevents making memory.high reclaim async and lets > memory pressure contaminate cpu isolation possibly to a significant degree > on lower core count machines in some scenarios, but that's a separate > issue we need to address in the future. > I have an opinion on the above but I will restrain as those are not relevant to the patch series. > > > cgroup A has memory.low protection and no other restrictions. cgroup B has > > > no protection and has access to swap. When B's memory starts bloating and > > > gets the system under memory contention, it'll start consuming swap until it > > > can't. When swap becomes depleted for B, there's nothing holding it back and > > > B will start eating into A's protection. > > > > > > > In this example does 'B' have memory.high and memory.max set and by A > > B doesn't have anything set. > > > having no other restrictions, I am assuming you meant unlimited high > > and max for A? Can 'A' use memory.min? > > Sure, it can but 1. the purpose of the example is illustrating the > imcompleteness of the existing mechanism I understand but is this a real world configuration people use and do we want to support the scenario where without setting high/max, the kernel still guarantees the isolation. > 2. there's a big difference between > letting the machine hit the wall and waiting for the kernel OOM to trigger > and being able to monitor the situation as it gradually develops and respond > to it, which is the whole point of the low/high mechanisms. > I am not really against the proposed solution. What I am trying to see is if this problem is more general than an anon/swap-full problem and if a more general solution is possible. To me it seems like, whenever a large portion of reclaimable memory (anon, file or kmem) becomes non-reclaimable abruptly, the memory isolation can be broken. You gave the anon/swap-full example, let me see if I can come up with file and kmem examples (with similar A & B). 1) B has a lot of page cache but temporarily gets pinned for rdma or something and the system gets low on memory. B can attack A's low protected memory as B's page cache is not reclaimable temporarily. 2) B has a lot of dentries/inodes but someone has taken a write lock on shrinker_rwsem and got stuck in allocation/reclaim or CPU preempted. B can attack A's low protected memory as B's slabs are not reclaimable temporarily. I think the aim is to slow down B enough to give the PSI monitor a chance to act before either B targets A's protected memory or the kernel triggers oom-kill. My question is do we really want to solve the issue without limiting B through high/max? Also isn't fine grained PSI monitoring along with limiting B through memory.[high|max] general enough to solve all three example scenarios? thanks, Shakeel
Hello, Shakeel. On Fri, Apr 17, 2020 at 02:51:09PM -0700, Shakeel Butt wrote: > > > In this example does 'B' have memory.high and memory.max set and by A > > > > B doesn't have anything set. > > > > > having no other restrictions, I am assuming you meant unlimited high > > > and max for A? Can 'A' use memory.min? > > > > Sure, it can but 1. the purpose of the example is illustrating the > > imcompleteness of the existing mechanism > > I understand but is this a real world configuration people use and do > we want to support the scenario where without setting high/max, the > kernel still guarantees the isolation. Yes, that's the configuration we're deploying fleet-wide and at least the direction I'm gonna be pushing towards for reasons of generality and ease of use. Here's an example to illustrate the point - consider distros or upstream desktop environments wanting to provide basic resource configuration to protect user sessions and critical system services needed for user interaction by default. That is something which is clearly and immediately useful but also is extremely challenging to achieve with limits. There are no universally good enough upper limits. Any one number is gonna be both too high to guarantee protection and too low for use cases which legitimately need that much memory. That's because the upper limits aren't work-conserving and have a high chance of doing harm when misconfigured making figuring out the correct configuration almost impossible with per-use-case manual tuning. The whole idea behind memory.low and related efforts is resolving that problem by making memory control more work-conserving and forgiving, so that users can say something like "I want the user session to have at least 25% memory protected if needed and possible" and get most of the benefits of carefully crafted configuration. We're already deploying such configuration and it works well enough for a wide variety of workloads. > > 2. there's a big difference between > > letting the machine hit the wall and waiting for the kernel OOM to trigger > > and being able to monitor the situation as it gradually develops and respond > > to it, which is the whole point of the low/high mechanisms. > > I am not really against the proposed solution. What I am trying to see > is if this problem is more general than an anon/swap-full problem and > if a more general solution is possible. To me it seems like, whenever > a large portion of reclaimable memory (anon, file or kmem) becomes > non-reclaimable abruptly, the memory isolation can be broken. You gave > the anon/swap-full example, let me see if I can come up with file and > kmem examples (with similar A & B). > > 1) B has a lot of page cache but temporarily gets pinned for rdma or > something and the system gets low on memory. B can attack A's low > protected memory as B's page cache is not reclaimable temporarily. > > 2) B has a lot of dentries/inodes but someone has taken a write lock > on shrinker_rwsem and got stuck in allocation/reclaim or CPU > preempted. B can attack A's low protected memory as B's slabs are not > reclaimable temporarily. > > I think the aim is to slow down B enough to give the PSI monitor a > chance to act before either B targets A's protected memory or the > kernel triggers oom-kill. > > My question is do we really want to solve the issue without limiting B > through high/max? Also isn't fine grained PSI monitoring along with > limiting B through memory.[high|max] general enough to solve all three > example scenarios? Yes, we definitely want to solve the issue without involving high and max. I hope that part is clear now. As for whether we want to cover niche cases such as RDMA pinning a large swath of page cache, I don't know, maybe? But I don't think that's a problem with a comparable importance especially given that in both cases you listed the problem is temporary and the workload wouldn't have the ability to keep expanding undeterred. Thanks.
Hi Tejun, On Fri, Apr 17, 2020 at 3:59 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, Shakeel. > > On Fri, Apr 17, 2020 at 02:51:09PM -0700, Shakeel Butt wrote: > > > > In this example does 'B' have memory.high and memory.max set and by A > > > > > > B doesn't have anything set. > > > > > > > having no other restrictions, I am assuming you meant unlimited high > > > > and max for A? Can 'A' use memory.min? > > > > > > Sure, it can but 1. the purpose of the example is illustrating the > > > imcompleteness of the existing mechanism > > > > I understand but is this a real world configuration people use and do > > we want to support the scenario where without setting high/max, the > > kernel still guarantees the isolation. > > Yes, that's the configuration we're deploying fleet-wide and at least the > direction I'm gonna be pushing towards for reasons of generality and ease of > use. > > Here's an example to illustrate the point - consider distros or upstream > desktop environments wanting to provide basic resource configuration to > protect user sessions and critical system services needed for user > interaction by default. That is something which is clearly and immediately > useful but also is extremely challenging to achieve with limits. > > There are no universally good enough upper limits. Any one number is gonna > be both too high to guarantee protection and too low for use cases which > legitimately need that much memory. That's because the upper limits aren't > work-conserving and have a high chance of doing harm when misconfigured > making figuring out the correct configuration almost impossible with > per-use-case manual tuning. > > The whole idea behind memory.low and related efforts is resolving that > problem by making memory control more work-conserving and forgiving, so that > users can say something like "I want the user session to have at least 25% > memory protected if needed and possible" and get most of the benefits of > carefully crafted configuration. We're already deploying such configuration > and it works well enough for a wide variety of workloads. > I got the high level vision but I am very skeptical that in terms of memory and performance isolation this can provide anything better than best effort QoS which might be good enough for desktop users. However, for a server environment where multiple latency sensitive interactive jobs are co-hosted with multiple batch jobs and the machine's memory may be over-committed, this is a recipe for disaster. The only scenario where I think it might work is if there is only one job running on the machine. I do agree that finding the right upper limit is a challenge. For us, we have two types of users, first, who knows exactly how much resources they want and second ask us to set the limits appropriately. We have a ML/history based central system to dynamically set and adjust limits for jobs of such users. Coming back to this path series, to me, it seems like the patch series is contrary to the vision you are presenting. Though the users are not setting memory.[high|max] but they are setting swap.max and this series is asking to set one more tunable i.e. swap.high. The approach more consistent with the presented vision is to throttle or slow down the allocators when the system swap is near full and there is no need to set swap.max or swap.high. thanks, Shakeel
Hello, On Mon, Apr 20, 2020 at 09:12:54AM -0700, Shakeel Butt wrote: > I got the high level vision but I am very skeptical that in terms of > memory and performance isolation this can provide anything better than > best effort QoS which might be good enough for desktop users. However, I don't see that big a gap between desktop and server use cases. There sure are some tolerance differences but for majority of use cases that is a permeable boundary. I believe I can see where you're coming from and think that it'd be difficult to convince you out of the skepticism without concretely demonstrating the contrary, which we're actively working on. A directional point I wanna emphasize tho is that siloing these solutions into special "professional" only use is an easy pitfall which often obscures bigger possibilities and leads to developmental dead-ends and obsolescence. I believe it's a tendency which should be actively resisted and fought against. Servers really aren't all that special. > for a server environment where multiple latency sensitive interactive > jobs are co-hosted with multiple batch jobs and the machine's memory > may be over-committed, this is a recipe for disaster. The only > scenario where I think it might work is if there is only one job > running on the machine. Obviously, you can't overcommit on any resources for critical latency sensitive workloads whether one or multiple, but there also are other types of workloads which can be flexible with resource availability. > I do agree that finding the right upper limit is a challenge. For us, > we have two types of users, first, who knows exactly how much > resources they want and second ask us to set the limits appropriately. > We have a ML/history based central system to dynamically set and > adjust limits for jobs of such users. > > Coming back to this path series, to me, it seems like the patch series > is contrary to the vision you are presenting. Though the users are not > setting memory.[high|max] but they are setting swap.max and this > series is asking to set one more tunable i.e. swap.high. The approach > more consistent with the presented vision is to throttle or slow down > the allocators when the system swap is near full and there is no need > to set swap.max or swap.high. It's a piece of the puzzle to make memory protection work comprehensively. You can argue that the fact swap isn't protection based is against the direction but I find that argument rather facetious as swap is quite different resource from memory and it's not like I'm saying limits shouldn't be used at all. There sure still are missing pieces - ie. slowing down on global depletion, but that doesn't mean swap.high isn't useful. Thanks.
On Mon 20-04-20 12:47:40, Tejun Heo wrote: [...] > > Coming back to this path series, to me, it seems like the patch series > > is contrary to the vision you are presenting. Though the users are not > > setting memory.[high|max] but they are setting swap.max and this > > series is asking to set one more tunable i.e. swap.high. The approach > > more consistent with the presented vision is to throttle or slow down > > the allocators when the system swap is near full and there is no need > > to set swap.max or swap.high. I have the same impression as Shakeel here. The overall information we have here is really scarce. > It's a piece of the puzzle to make memory protection work comprehensively. > You can argue that the fact swap isn't protection based is against the > direction but I find that argument rather facetious as swap is quite > different resource from memory and it's not like I'm saying limits shouldn't > be used at all. There sure still are missing pieces - ie. slowing down on > global depletion, but that doesn't mean swap.high isn't useful. I have asked about the semantic of this know already and didn't really get any real answer. So how does swap.high fit into high limit semantic when it doesn't act as a limit. Considering that we cannot reclaim swap space I find this really hard to grasp. We definitely need more information here!
Hello, On Mon, Apr 20, 2020 at 07:03:18PM +0200, Michal Hocko wrote: > I have asked about the semantic of this know already and didn't really > get any real answer. So how does swap.high fit into high limit semantic > when it doesn't act as a limit. Considering that we cannot reclaim swap > space I find this really hard to grasp. memory.high slow down is for the case when memory reclaim can't be depended upon for throttling, right? This is the same. Swap can't be reclaimed so the backpressure is applied by slowing down the source, the same way memory.high does. It fits together with memory.low in that it prevents runaway anon allocation when swap can't be allocated anymore. It's addressing the same problem that memory.high slowdown does. It's just a different vector. Thanks.
On Mon 20-04-20 13:06:50, Tejun Heo wrote: > Hello, > > On Mon, Apr 20, 2020 at 07:03:18PM +0200, Michal Hocko wrote: > > I have asked about the semantic of this know already and didn't really > > get any real answer. So how does swap.high fit into high limit semantic > > when it doesn't act as a limit. Considering that we cannot reclaim swap > > space I find this really hard to grasp. > > memory.high slow down is for the case when memory reclaim can't be depended > upon for throttling, right? This is the same. Swap can't be reclaimed so the > backpressure is applied by slowing down the source, the same way memory.high > does. Hmm, but the two differ quite considerably that we do not reclaim any swap which means that while no reclaimable memory at all is pretty much the corner case (essentially OOM) the no reclaimable swap is always in that state. So whenever you hit the high limit there is no other way then rely on userspace to unmap swap backed memory or increase the limit. Without that there is always throttling. The question also is what do you want to throttle in that case? Any swap backed allocation or swap based reclaim? The patch throttles any allocations unless I am misreading. This means that also any other !swap backed allocations get throttled as soon as the swap quota is reached. Is this really desirable behavior? I would find it quite surprising to say the least. I am also not sure about the isolation aspect. Because an external memory pressure might have pushed out memory to the swap and then the workload is throttled based on an external event. Compare that to the memory.high throttling which is not directly affected by the external pressure. There is also an aspect of non-determinism. There is no control over the file vs. swap backed reclaim decision for memcgs. That means that behavior is going to be very dependent on the internal implementation of the reclaim. More swapping is going to fill up swap quota quicker. > It fits together with memory.low in that it prevents runaway anon allocation > when swap can't be allocated anymore. It's addressing the same problem that > memory.high slowdown does. It's just a different vector. I suspect that the problem is more related to the swap being handled as a separate resource. And it is still not clear to me why it is easier for you to tune swap.high than memory.high. You have said that you do not want to set up memory.high because it is harder to tune but I do not see why swap is easier in this regards. Maybe it is just that the swap is almost never used so a bad estimate is much easier to tolerate and you really do care about runaways?
On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote: > On Mon 20-04-20 13:06:50, Tejun Heo wrote: > > Hello, > > > > On Mon, Apr 20, 2020 at 07:03:18PM +0200, Michal Hocko wrote: > > > I have asked about the semantic of this know already and didn't really > > > get any real answer. So how does swap.high fit into high limit semantic > > > when it doesn't act as a limit. Considering that we cannot reclaim swap > > > space I find this really hard to grasp. > > > > memory.high slow down is for the case when memory reclaim can't be depended > > upon for throttling, right? This is the same. Swap can't be reclaimed so the > > backpressure is applied by slowing down the source, the same way memory.high > > does. > > Hmm, but the two differ quite considerably that we do not reclaim any > swap which means that while no reclaimable memory at all is pretty much > the corner case (essentially OOM) the no reclaimable swap is always in > that state. So whenever you hit the high limit there is no other way > then rely on userspace to unmap swap backed memory or increase the limit. This is similar to memory.high. The memory.high throttling kicks in when reclaim is NOT keeping up with allocation rate. There may be some form of reclaim going on, but it's not bucking the trend, so you also rely on userspace to free memory voluntarily or increase the limit - or, of course, the throttling sleeps to grow until oomd kicks in. > Without that there is always throttling. The question also is what do > you want to throttle in that case? Any swap backed allocation or swap > based reclaim? The patch throttles any allocations unless I am > misreading. This means that also any other !swap backed allocations get > throttled as soon as the swap quota is reached. Is this really desirable > behavior? I would find it quite surprising to say the least. When cache or slab allocations enter reclaim, they also swap. We *could* be looking whether there are actual anon pages on the LRU lists at this point. But I don't think it matters in practice, please read on below. > I am also not sure about the isolation aspect. Because an external > memory pressure might have pushed out memory to the swap and then the > workload is throttled based on an external event. Compare that to the > memory.high throttling which is not directly affected by the external > pressure. Neither memory.high nor swap.high isolate from external pressure. They are put on cgroups so they don't cause pressure on other cgroups. Swap is required when either your footprint grows or your available space shrinks. That's why it behaves like that. That being said, I think we're getting lost in the implementation details before we have established what the purpose of this all is. Let's talk about this first. Just imagine we had a really slow swap device. Some spinning disk that is terrible at random IO. From a performance point of view, this would obviously suck. But from a resource management point of view, this is actually pretty useful in slowing down a workload that is growing unsustainably. This is so useful, in fact, that Virtuozzo implemented virtual swap devices that are artificially slow to emulate this type of "punishment". A while ago, we didn't have any swap configured. We set memory.high and things were good: when things would go wrong and the workload expanded beyond reclaim capabilities, memory.high would inject sleeps until oomd would take care of the workload. Remember that the point is to avoid the kernel OOM killer and do OOM handling in userspace. That's the difference between memory.high and memory.max as well. However, in many cases we now want to overcommit more aggressively than memory.high would allow us. For this purpose, we're switching to memory.low, to only enforce limits when *physical* memory is short. And we've added swap to have some buffer zone at the edge of this aggressive overcommit. But swap has been a good news, bad news situation. The good news is that we have really fast swap, so if the workload is only temporarily a bit over RAM capacity, we can swap a few colder anon pages to tide the workload over, without the workload even noticing. This is fantastic from a performance point of view. It effectively increases our amount of available memory or the workingset sizes we can support. But the bad news is also that we have really fast swap. If we have a misbehaving workload that has a malloc() problem, we can *exhaust* swap space very, very quickly. Where we previously had those nice gradual slowdowns from memory.high when reclaim was failing, we now have very powerful reclaim that can swap at hundreds of megabytes per second - until swap is suddenly full and reclaim abruptly falls apart. So while fast swap is an enhancement to our memory capacity, it doesn't reliably act as that overcommit crumble zone that memory.high or slower swap devices used to give us. Should we replace those fast SSDs with crappy disks instead to achieve this effect? Or add a slow disk as a secondary swap device once the fast one is full? That would give us the desired effect, but obviously it would be kind of silly. That's where swap.high comes in. It gives us the performance of a fast drive during temporary dips into the overcommit buffer, while also providing that large rubber band kind of slowdown of a slow drive when the workload is expanding at an unsustainable trend. > There is also an aspect of non-determinism. There is no control over > the file vs. swap backed reclaim decision for memcgs. That means that > behavior is going to be very dependent on the internal implementation of > the reclaim. More swapping is going to fill up swap quota quicker. Haha, I mean that implies that reclaim is arbitrary. While it's certainly not perfect, we're trying to reclaim the pages that are least likely to be used again in the future. There is noise in this heuristic, obviously, but it's still going to correlate with reality and provide some level of determinism. The same is true for memory.high, btw. Depending on how effective reclaim is, we're going to throttle more or less. That's also going to fluctuate somewhat around implementation changes. > > It fits together with memory.low in that it prevents runaway anon allocation > > when swap can't be allocated anymore. It's addressing the same problem that > > memory.high slowdown does. It's just a different vector. > > I suspect that the problem is more related to the swap being handled as > a separate resource. And it is still not clear to me why it is easier > for you to tune swap.high than memory.high. You have said that you do > not want to set up memory.high because it is harder to tune but I do > not see why swap is easier in this regards. Maybe it is just that the > swap is almost never used so a bad estimate is much easier to tolerate > and you really do care about runaways? You hit the nail on the head. We don't want memory.high (in most cases) because we want to utilize memory to the absolute maximum. Obviously, the same isn't true for swap because there is no DaX and most workloads can't run when 80% of their workingset are on swap. They're not interchangeable resources. So yes, swap only needs to be roughly sized, and we want to catch runaways. Sometimes they are caught by the IO slowdown, but for some access patterns the IO is too efficient and we need a bit of help when we're coming up against that wall. And we don't really care about not utilizing swap capacity to the absolute max. [ Hopefully that also answers your implementation questions above a bit better. We could be more specific about which allocations to slow down, only slow down if there are actual anon pages etc. But our goal isn't to emulate a realistic version of a slow swap device, we just want the overcommit crumble zone they provide. ]
Hello, On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote: > I suspect that the problem is more related to the swap being handled as > a separate resource. And it is still not clear to me why it is easier > for you to tune swap.high than memory.high. You have said that you do > not want to set up memory.high because it is harder to tune but I do > not see why swap is easier in this regards. Maybe it is just that the > swap is almost never used so a bad estimate is much easier to tolerate > and you really do care about runaways? Johannes responded a lot better. I'm just gonna add a bit here. Swap is intertwined with memory but is a very different resource from memory. You can't seriously equate primary and secondary storages. We never want to underutilize memory but we never want to completely fill up secondary storage. They're exactly the opposite in that sense. It's not that protection schemes can't apply to swap but that such level of dynamic control isn't required because simple upper limit is useful and easy enough. Another backing point I want to raise is that the abrupt transition which happens on swap depletion is a real problem that userspace has been trying to work around. memory.low based protection and oomd is an obvious example but not the only one. earlyoom[1] is an independent project which predates all these things and kills when swap runs low to protect the system from going down the gutter. In this respect, both oomd and earlyoom basically do the same thing but they're racing against the kernel filling up the space. Once the swap space is gone, the programs themselves might not be able to make reasonable forward progress. The only measure they can currently employ is polling more frequently and killing ealier so that swap space never actually runs out, but it's a silly and losing game as the underyling device gets faster and faster. Note that at least fedora is considering including either earlyoom or oomd by default. The problem addressed by swap.high is real and immediate. Thanks.
On Tue 21-04-20 10:27:46, Johannes Weiner wrote: > On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote: [...] > > I am also not sure about the isolation aspect. Because an external > > memory pressure might have pushed out memory to the swap and then the > > workload is throttled based on an external event. Compare that to the > > memory.high throttling which is not directly affected by the external > > pressure. > > Neither memory.high nor swap.high isolate from external pressure. I didn't say they do. What I am saying is that an external pressure might punish swap.high memcg because the external memory pressure would eat up the quota and trigger the throttling. It is fair to say that this externally triggered interference is already possible with swap.max as well though. It would likely be just more verbose because of the oom killer intervention rather than a slowdown. > They > are put on cgroups so they don't cause pressure on other cgroups. Swap > is required when either your footprint grows or your available space > shrinks. That's why it behaves like that. > > That being said, I think we're getting lost in the implementation > details before we have established what the purpose of this all > is. Let's talk about this first. Thanks for describing it in the length. I have a better picture of the intention (this should have been in the changelog ideally). I can see how the swap consumption throttling might be useful but I still dislike the proposed implementation. Mostly because of throttling of all allocations regardless whether they can contribute to the swap consumption or not.
On Tue, Apr 21, 2020 at 06:11:38PM +0200, Michal Hocko wrote: > On Tue 21-04-20 10:27:46, Johannes Weiner wrote: > > On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote: > [...] > > > I am also not sure about the isolation aspect. Because an external > > > memory pressure might have pushed out memory to the swap and then the > > > workload is throttled based on an external event. Compare that to the > > > memory.high throttling which is not directly affected by the external > > > pressure. > > > > Neither memory.high nor swap.high isolate from external pressure. > > I didn't say they do. What I am saying is that an external pressure > might punish swap.high memcg because the external memory pressure would > eat up the quota and trigger the throttling. External pressure could also push a cgroup into a swap device that happens to be very slow and cause the cgroup to be throttled that way. But that effect is actually not undesirable. External pressure means that something more important runs and needs the memory of something less important (otherwise, memory.low would deflect this intrusion). So we're punishing/deprioritizing the right cgroup here. The one that isn't protected from memory pressure. > It is fair to say that this externally triggered interference is already > possible with swap.max as well though. It would likely be just more > verbose because of the oom killer intervention rather than a slowdown. Right. > > They > > are put on cgroups so they don't cause pressure on other cgroups. Swap > > is required when either your footprint grows or your available space > > shrinks. That's why it behaves like that. > > > > That being said, I think we're getting lost in the implementation > > details before we have established what the purpose of this all > > is. Let's talk about this first. > > Thanks for describing it in the length. I have a better picture of the > intention (this should have been in the changelog ideally). I can see > how the swap consumption throttling might be useful but I still dislike the > proposed implementation. Mostly because of throttling of all allocations > regardless whether they can contribute to the swap consumption or not. I mean, even if they're not swappable, they can still contribute to swap consumption that wouldn't otherwise have been there. Each new page that comes in displaces another page at the end of the big LRU pipeline and pushes it into the mouth of reclaim - which may swap. So *every* allocation has a certain probability of increasing swap usage. The fact that we have reached swap.high is a good hint that reclaim has indeed been swapping quite aggressively to accomodate incoming allocations, and probably will continue to do so. We could check whether there are NO anon pages left in a workload, but that's such an extreme and short-lived case that it probably wouldn't make a difference in practice. We could try to come up with a model that calculates a probabilty of each new allocation to cause swap. Whether that new allocation itself is swapbacked would of course be a factor, but there are other factors as well: the millions of existing LRU pages, the reclaim decisions we will make, swappiness and so forth. Of course, I agree with you, if all you have coming in is cache allocations, you'd *eventually* run out of pages to swap. However, 10G of new active cache allocations can still cause 10G of already allocated anon pages to get swapped out. For example if a malloc() leak happened *before* the regular cache workingset is established. We cannot retro-actively throttle those anon pages, we can only keep new allocations from pushing old ones into swap.
Hi Johannes, On Tue, Apr 21, 2020 at 7:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > [snip] > The following is a very good description and it gave me an idea of how you (FB) are approaching the memory overcommit problem. The approach you are taking is very different from ours and I would like to pick your brain on the why (sorry this might be a bit tangent to the series). Please correct me if I am wrong, your memory overcommit strategy is to let the jobs use memory as much as they want but when the system is low on memory, slow down everyone (to not let the kernel oom-killer trigger) and let the userspace oomd take care of releasing the pressure. We run multiple latency sensitive jobs along with multiple batch jobs on the machine. Overcommitting the memory on such machines, we learn that the battle is already lost when the system starts doing direct reclaim. Direct reclaim does not differentiate between the reclaimers. We could have tried the "slow down" approach but our latency sensitive jobs prefer to die and let the load-balancer handover the request to some other instance of the job than to stall the request for non-deterministic time. We could have tried the PSI-like monitor to trigger oom-kills when latency sensitive jobs start seeing the stalls but that would be less work-conserving and non-deterministic behavior (i.e. sometimes more oom-kills and sometimes more memory overcommitted). The approach we took was to do proactive reclaim along with a very low latency refault medium (in-memory compression). Now as you mentioned, you are trying to be a bit more aggressive in the memory overcommit and I can see the writing on the wall that you will be stuffing more jobs of different types on a machine, why do you think the "slow down" approach will be able to provide the performance isolation guarantees? Couple of questions inlined. > Just imagine we had a really slow swap device. Some spinning disk that > is terrible at random IO. From a performance point of view, this would > obviously suck. But from a resource management point of view, this is > actually pretty useful in slowing down a workload that is growing > unsustainably. This is so useful, in fact, that Virtuozzo implemented > virtual swap devices that are artificially slow to emulate this type > of "punishment". > > A while ago, we didn't have any swap configured. We set memory.high > and things were good: when things would go wrong and the workload > expanded beyond reclaim capabilities, memory.high would inject sleeps > until oomd would take care of the workload. > > Remember that the point is to avoid the kernel OOM killer and do OOM > handling in userspace. That's the difference between memory.high and > memory.max as well. > > However, in many cases we now want to overcommit more aggressively > than memory.high would allow us. For this purpose, we're switching to > memory.low, to only enforce limits when *physical* memory is > short. And we've added swap to have some buffer zone at the edge of > this aggressive overcommit. > > But swap has been a good news, bad news situation. The good news is > that we have really fast swap, so if the workload is only temporarily > a bit over RAM capacity, we can swap a few colder anon pages to tide > the workload over, without the workload even noticing. This is > fantastic from a performance point of view. It effectively increases > our amount of available memory or the workingset sizes we can support. > > But the bad news is also that we have really fast swap. If we have a > misbehaving workload that has a malloc() problem, we can *exhaust* > swap space very, very quickly. Where we previously had those nice > gradual slowdowns from memory.high when reclaim was failing, we now > have very powerful reclaim that can swap at hundreds of megabytes per > second - until swap is suddenly full and reclaim abruptly falls apart. I think the concern is kernel oom-killer will be invoked too early and not giving the chance to oomd. I am wondering if the PSI polling interface is usable here as it can give events in milliseconds. Will that be too noisy? > > So while fast swap is an enhancement to our memory capacity, it > doesn't reliably act as that overcommit crumble zone that memory.high > or slower swap devices used to give us. > > Should we replace those fast SSDs with crappy disks instead to achieve > this effect? Or add a slow disk as a secondary swap device once the > fast one is full? That would give us the desired effect, but obviously > it would be kind of silly. > > That's where swap.high comes in. It gives us the performance of a fast > drive during temporary dips into the overcommit buffer, while also > providing that large rubber band kind of slowdown of a slow drive when > the workload is expanding at an unsustainable trend. > BTW can you explain why is the system level low swap slowdown not sufficient and a per-cgroup swap.high is needed? Or maybe you want to slow down only specific cgroups. > > There is also an aspect of non-determinism. There is no control over > > the file vs. swap backed reclaim decision for memcgs. That means that > > behavior is going to be very dependent on the internal implementation of > > the reclaim. More swapping is going to fill up swap quota quicker. > > Haha, I mean that implies that reclaim is arbitrary. While it's > certainly not perfect, we're trying to reclaim the pages that are > least likely to be used again in the future. There is noise in this > heuristic, obviously, but it's still going to correlate with reality > and provide some level of determinism. > > The same is true for memory.high, btw. Depending on how effective > reclaim is, we're going to throttle more or less. That's also going to > fluctuate somewhat around implementation changes. > > > > It fits together with memory.low in that it prevents runaway anon allocation > > > when swap can't be allocated anymore. It's addressing the same problem that > > > memory.high slowdown does. It's just a different vector. > > > > I suspect that the problem is more related to the swap being handled as > > a separate resource. And it is still not clear to me why it is easier > > for you to tune swap.high than memory.high. You have said that you do > > not want to set up memory.high because it is harder to tune but I do > > not see why swap is easier in this regards. Maybe it is just that the > > swap is almost never used so a bad estimate is much easier to tolerate > > and you really do care about runaways? > > You hit the nail on the head. > > We don't want memory.high (in most cases) because we want to utilize > memory to the absolute maximum. > > Obviously, the same isn't true for swap because there is no DaX and > most workloads can't run when 80% of their workingset are on swap. > > They're not interchangeable resources. > What do you mean by not interchangeable? If I keep the hot memory (or workingset) of a job in DRAM and cold memory in swap and control the rate of refaults by controlling the definition of cold memory then I am using the DRAM and swap interchangeably and transparently to the job (that is what we actually do). I am also wondering if you guys explored the in-memory compression based swap medium and if there are any reasons to not follow that route. Oh you mentioned DAX, that brings to mind a very interesting topic. Are you guys exploring the idea of using PMEM as a cheap slow memory? It is byte-addressable, so, regarding memcg accounting, will you treat it as a memory or a separate resource like swap in v2? How does your memory overcommit model work with such a type of memory? thanks, Shakeel
On Tue, Apr 21, 2020 at 12:09:27PM -0700, Shakeel Butt wrote: > Hi Johannes, > > On Tue, Apr 21, 2020 at 7:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > [snip] > > > > The following is a very good description and it gave me an idea of how > you (FB) are approaching the memory overcommit problem. The approach > you are taking is very different from ours and I would like to pick > your brain on the why (sorry this might be a bit tangent to the > series). > > Please correct me if I am wrong, your memory overcommit strategy is to > let the jobs use memory as much as they want but when the system is > low on memory, slow down everyone (to not let the kernel oom-killer > trigger) and let the userspace oomd take care of releasing the > pressure. > > We run multiple latency sensitive jobs along with multiple batch jobs > on the machine. Overcommitting the memory on such machines, we learn > that the battle is already lost when the system starts doing direct > reclaim. Direct reclaim does not differentiate between the reclaimers. > We could have tried the "slow down" approach but our latency sensitive > jobs prefer to die and let the load-balancer handover the request to > some other instance of the job than to stall the request for > non-deterministic time. We could have tried the PSI-like monitor to > trigger oom-kills when latency sensitive jobs start seeing the stalls > but that would be less work-conserving and non-deterministic behavior > (i.e. sometimes more oom-kills and sometimes more memory > overcommitted). The approach we took was to do proactive reclaim along > with a very low latency refault medium (in-memory compression). > > Now as you mentioned, you are trying to be a bit more aggressive in > the memory overcommit and I can see the writing on the wall that you > will be stuffing more jobs of different types on a machine, why do you > think the "slow down" approach will be able to provide the performance > isolation guarantees? We do control very aggressive batch jobs to the extent where they have negligible latency impact on interactive services running on the same hosts. All the tools to do that are upstream and/or public, but it's still pretty new stuff (memory.low, io.cost, cpu headroom control, freezer) and they need to be put together just right. We're working on a demo application that showcases how it all fits together and hope to be ready to publish it soon. > > Just imagine we had a really slow swap device. Some spinning disk that > > is terrible at random IO. From a performance point of view, this would > > obviously suck. But from a resource management point of view, this is > > actually pretty useful in slowing down a workload that is growing > > unsustainably. This is so useful, in fact, that Virtuozzo implemented > > virtual swap devices that are artificially slow to emulate this type > > of "punishment". > > > > A while ago, we didn't have any swap configured. We set memory.high > > and things were good: when things would go wrong and the workload > > expanded beyond reclaim capabilities, memory.high would inject sleeps > > until oomd would take care of the workload. > > > > Remember that the point is to avoid the kernel OOM killer and do OOM > > handling in userspace. That's the difference between memory.high and > > memory.max as well. > > > > However, in many cases we now want to overcommit more aggressively > > than memory.high would allow us. For this purpose, we're switching to > > memory.low, to only enforce limits when *physical* memory is > > short. And we've added swap to have some buffer zone at the edge of > > this aggressive overcommit. > > > > But swap has been a good news, bad news situation. The good news is > > that we have really fast swap, so if the workload is only temporarily > > a bit over RAM capacity, we can swap a few colder anon pages to tide > > the workload over, without the workload even noticing. This is > > fantastic from a performance point of view. It effectively increases > > our amount of available memory or the workingset sizes we can support. > > > > But the bad news is also that we have really fast swap. If we have a > > misbehaving workload that has a malloc() problem, we can *exhaust* > > swap space very, very quickly. Where we previously had those nice > > gradual slowdowns from memory.high when reclaim was failing, we now > > have very powerful reclaim that can swap at hundreds of megabytes per > > second - until swap is suddenly full and reclaim abruptly falls apart. > > I think the concern is kernel oom-killer will be invoked too early and > not giving the chance to oomd. Yes. > I am wondering if the PSI polling interface is usable here as it can > give events in milliseconds. Will that be too noisy? Yes, it would be hard to sample OOM pressure reliably from CPU bound reclaim alone. The difference between successful and failing reclaim isn't all that big in terms of CPU cycles. > > So while fast swap is an enhancement to our memory capacity, it > > doesn't reliably act as that overcommit crumble zone that memory.high > > or slower swap devices used to give us. > > > > Should we replace those fast SSDs with crappy disks instead to achieve > > this effect? Or add a slow disk as a secondary swap device once the > > fast one is full? That would give us the desired effect, but obviously > > it would be kind of silly. > > > > That's where swap.high comes in. It gives us the performance of a fast > > drive during temporary dips into the overcommit buffer, while also > > providing that large rubber band kind of slowdown of a slow drive when > > the workload is expanding at an unsustainable trend. > > BTW can you explain why is the system level low swap slowdown not > sufficient and a per-cgroup swap.high is needed? Or maybe you want to > slow down only specific cgroups. Yes, that's exactly it. We have a hostcritical.slice cgroup that hosts oomd and sshd etc., and we want this cgroup to be able to allocate as quickly as possible. It's more important than the nominal workload. It needs a headroom of fast swap space after everybody else is already getting throttled on their swap consumption. > > > I suspect that the problem is more related to the swap being handled as > > > a separate resource. And it is still not clear to me why it is easier > > > for you to tune swap.high than memory.high. You have said that you do > > > not want to set up memory.high because it is harder to tune but I do > > > not see why swap is easier in this regards. Maybe it is just that the > > > swap is almost never used so a bad estimate is much easier to tolerate > > > and you really do care about runaways? > > > > You hit the nail on the head. > > > > We don't want memory.high (in most cases) because we want to utilize > > memory to the absolute maximum. > > > > Obviously, the same isn't true for swap because there is no DaX and > > most workloads can't run when 80% of their workingset are on swap. > > > > They're not interchangeable resources. > > > > What do you mean by not interchangeable? If I keep the hot memory (or > workingset) of a job in DRAM and cold memory in swap and control the > rate of refaults by controlling the definition of cold memory then I > am using the DRAM and swap interchangeably and transparently to the > job (that is what we actually do). Right, that's a more precise definition than my randomly chosen "80%" number above. There are parts of a workload's memory access curve (where x is distinct data accessed and y is the access frequency) that don't need to stay in RAM permanently and can be got on-demand from secondary storage without violating the workload's throughput/latency requirements. For that part, RAM, swap, disk can be interchangeable. I'm was specifically talking about the other half of that curve, and meant to imply that that's usually bigger than 20%. Usually ;-) I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't matter whether it gets it in ram or in swap. There is a line somewhere in between, and it'll vary with workload requirements, access patterns and IO speed. But no workload can actually run with 10G of swap and 0 bytes worth of direct access memory, right? Since you said before you're using combined memory+swap limits, I'm assuming that you configure the resource as interchangeable, but still have some form of determining where that cutoff line is between them - either by tuning proactive reclaim toward that line or having OOM kill policies when the line is crossed and latencies are violated? > I am also wondering if you guys explored the in-memory compression > based swap medium and if there are any reasons to not follow that > route. We played around with it, but I'm ambivalent about it. You need to identify that perfect "warm" middle section of the workingset curve that is 1) cold enough to not need permanent direct access memory, yet 2) warm enough to justify allocating RAM to it. A lot of our workloads have a distinguishable hot set and various amounts of fairly cold data during stable states, with not too much middle ground in between where compressed swap would really shine. Do you use compressed swap fairly universally, or more specifically for certain workloads? > Oh you mentioned DAX, that brings to mind a very interesting topic. > Are you guys exploring the idea of using PMEM as a cheap slow memory? > It is byte-addressable, so, regarding memcg accounting, will you treat > it as a memory or a separate resource like swap in v2? How does your > memory overcommit model work with such a type of memory? I think we (the kernel MM community, not we as in FB) are still some ways away from having dynamic/transparent data placement for pmem the same way we have for RAM. But I expect the kernel's high-level default strategy to be similar: order virtual memory (the data) by access frequency and distribute across physical memory/storage accordingly. (With pmem being divided into volatile space and filesystem space, where volatile space holds colder anon pages (and, if there is still a disk, disk cache), and the sizing decisions between them being similar as the ones we use for swap and filesystem today). I expect cgroup policy to be separate, because to users the performance difference matters. We won't want greedy batch applications displacing latency sensitive ones from RAM into pmem, just like we don't want this displacement into secondary storage today. Other than that, there isn't too much difference to users, because paging is already transparent - an mmapped() file looks the same whether it's backed by RAM, by disk or by pmem. The difference is access latencies and the aggregate throughput loss they add up to. So I could see pmem cgroup limits and protections (for the volatile space portion) the same way we have RAM limits and protections. But yeah, I think this is going a bit off topic ;-)
On Tue, Apr 21, 2020 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > [snip] > > We do control very aggressive batch jobs to the extent where they have > negligible latency impact on interactive services running on the same > hosts. All the tools to do that are upstream and/or public, but it's > still pretty new stuff (memory.low, io.cost, cpu headroom control, > freezer) and they need to be put together just right. > > We're working on a demo application that showcases how it all fits > together and hope to be ready to publish it soon. > That would be awesome. > [snip] > > > > What do you mean by not interchangeable? If I keep the hot memory (or > > workingset) of a job in DRAM and cold memory in swap and control the > > rate of refaults by controlling the definition of cold memory then I > > am using the DRAM and swap interchangeably and transparently to the > > job (that is what we actually do). > > Right, that's a more precise definition than my randomly chosen "80%" > number above. There are parts of a workload's memory access curve > (where x is distinct data accessed and y is the access frequency) that > don't need to stay in RAM permanently and can be got on-demand from > secondary storage without violating the workload's throughput/latency > requirements. For that part, RAM, swap, disk can be interchangeable. > > I'm was specifically talking about the other half of that curve, and > meant to imply that that's usually bigger than 20%. Usually ;-) > > I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't > matter whether it gets it in ram or in swap. There is a line somewhere > in between, and it'll vary with workload requirements, access patterns > and IO speed. But no workload can actually run with 10G of swap and 0 > bytes worth of direct access memory, right? Yes. > > Since you said before you're using combined memory+swap limits, I'm > assuming that you configure the resource as interchangeable, but still > have some form of determining where that cutoff line is between them - > either by tuning proactive reclaim toward that line or having OOM kill > policies when the line is crossed and latencies are violated? > Yes, more specifically tuning proactive reclaim towards that line. We define that line in terms of acceptable refault rate for the job. The acceptable refault rate is measured through re-use and idle page histograms (these histograms are collected through our internal implementation of Page Idle Tracking). I am planning to upstream and open-source these. > > I am also wondering if you guys explored the in-memory compression > > based swap medium and if there are any reasons to not follow that > > route. > > We played around with it, but I'm ambivalent about it. > > You need to identify that perfect "warm" middle section of the > workingset curve that is 1) cold enough to not need permanent direct > access memory, yet 2) warm enough to justify allocating RAM to it. > > A lot of our workloads have a distinguishable hot set and various > amounts of fairly cold data during stable states, with not too much > middle ground in between where compressed swap would really shine. > > Do you use compressed swap fairly universally, or more specifically > for certain workloads? > Yes, we are using it fairly universally. There are few exceptions like user space net and storage drivers. > > Oh you mentioned DAX, that brings to mind a very interesting topic. > > Are you guys exploring the idea of using PMEM as a cheap slow memory? > > It is byte-addressable, so, regarding memcg accounting, will you treat > > it as a memory or a separate resource like swap in v2? How does your > > memory overcommit model work with such a type of memory? > > I think we (the kernel MM community, not we as in FB) are still some > ways away from having dynamic/transparent data placement for pmem the > same way we have for RAM. But I expect the kernel's high-level default > strategy to be similar: order virtual memory (the data) by access > frequency and distribute across physical memory/storage accordingly. > > (With pmem being divided into volatile space and filesystem space, > where volatile space holds colder anon pages (and, if there is still a > disk, disk cache), and the sizing decisions between them being similar > as the ones we use for swap and filesystem today). > > I expect cgroup policy to be separate, because to users the > performance difference matters. We won't want greedy batch > applications displacing latency sensitive ones from RAM into pmem, > just like we don't want this displacement into secondary storage > today. Other than that, there isn't too much difference to users, > because paging is already transparent - an mmapped() file looks the > same whether it's backed by RAM, by disk or by pmem. The difference is > access latencies and the aggregate throughput loss they add up to. So > I could see pmem cgroup limits and protections (for the volatile space > portion) the same way we have RAM limits and protections. > > But yeah, I think this is going a bit off topic ;-) That's really interesting. Thanks for appeasing my curiosity. thanks, Shakeel
On Tue 21-04-20 12:56:01, Johannes Weiner wrote: > On Tue, Apr 21, 2020 at 06:11:38PM +0200, Michal Hocko wrote: > > On Tue 21-04-20 10:27:46, Johannes Weiner wrote: > > > On Tue, Apr 21, 2020 at 01:06:12PM +0200, Michal Hocko wrote: > > [...] > > > > I am also not sure about the isolation aspect. Because an external > > > > memory pressure might have pushed out memory to the swap and then the > > > > workload is throttled based on an external event. Compare that to the > > > > memory.high throttling which is not directly affected by the external > > > > pressure. > > > > > > Neither memory.high nor swap.high isolate from external pressure. > > > > I didn't say they do. What I am saying is that an external pressure > > might punish swap.high memcg because the external memory pressure would > > eat up the quota and trigger the throttling. > > External pressure could also push a cgroup into a swap device that > happens to be very slow and cause the cgroup to be throttled that way. Yes but it would get throttled at the fault time when the swapped out memory is needed. Unless the anon workload actively doesn't fit into memory then refaults are not that common. Compare that to a continuous throttling because your memory has been pushed out to swap and you cannot do much about that without being slowed down to crawling. > But that effect is actually not undesirable. External pressure means > that something more important runs and needs the memory of something > less important (otherwise, memory.low would deflect this intrusion). > > So we're punishing/deprioritizing the right cgroup here. The one that > isn't protected from memory pressure. > > > It is fair to say that this externally triggered interference is already > > possible with swap.max as well though. It would likely be just more > > verbose because of the oom killer intervention rather than a slowdown. > > Right. > > > > They > > > are put on cgroups so they don't cause pressure on other cgroups. Swap > > > is required when either your footprint grows or your available space > > > shrinks. That's why it behaves like that. > > > > > > That being said, I think we're getting lost in the implementation > > > details before we have established what the purpose of this all > > > is. Let's talk about this first. > > > > Thanks for describing it in the length. I have a better picture of the > > intention (this should have been in the changelog ideally). I can see > > how the swap consumption throttling might be useful but I still dislike the > > proposed implementation. Mostly because of throttling of all allocations > > regardless whether they can contribute to the swap consumption or not. > > I mean, even if they're not swappable, they can still contribute to > swap consumption that wouldn't otherwise have been there. Each new > page that comes in displaces another page at the end of the big LRU > pipeline and pushes it into the mouth of reclaim - which may swap. So > *every* allocation has a certain probability of increasing swap usage. You are right of course and this makes an reasonable implementation of swap.high far from trivial. I would even dare to say that an optimal implementation is impossible because the throttling cannot be done in the reclaim context (at least not in your case where you rely on the global reclaim). > The fact that we have reached swap.high is a good hint that reclaim > has indeed been swapping quite aggressively to accomodate incoming > allocations, and probably will continue to do so. You can fill up a swap space even without an aggressive reclaim so I wouldn't make any assumptions just based on the amount of swapped out memory. > We could check whether there are NO anon pages left in a workload, but > that's such an extreme and short-lived case that it probably wouldn't > make a difference in practice. > > We could try to come up with a model that calculates a probabilty of > each new allocation to cause swap. Whether that new allocation itself > is swapbacked would of course be a factor, but there are other factors > as well: the millions of existing LRU pages, the reclaim decisions we > will make, swappiness and so forth. Yeah, an optimal solution likely doesn't exist. Some portion of get_scan_count could be used to have at least some clue on whether swap out is likely. > Of course, I agree with you, if all you have coming in is cache > allocations, you'd *eventually* run out of pages to swap. > > However, 10G of new active cache allocations can still cause 10G of > already allocated anon pages to get swapped out. For example if a > malloc() leak happened *before* the regular cache workingset is > established. We cannot retro-actively throttle those anon pages, we > can only keep new allocations from pushing old ones into swap. Yes and this is the fundamental problem we have here as I have mentioned above as well. Throttling and swapout are simply not bound together. So we can only guess. And that guessing is a concern because opinions on that might differ. For example I really dislike the huge hammer to throttle for all charges but I do see how reasonable people might disagree on this matter. That being said I believe our discussion is missing an important part. There is no description of the swap.high semantic. What can user expect when using it?
On Wed, Apr 22, 2020 at 03:26:32PM +0200, Michal Hocko wrote: > That being said I believe our discussion is missing an important part. > There is no description of the swap.high semantic. What can user expect > when using it? Good point, we should include that in cgroup-v2.rst. How about this? diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index bcc80269bb6a..49e8733a9d8a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1370,6 +1370,17 @@ PAGE_SIZE multiple when read back. The total amount of swap currently being used by the cgroup and its descendants. + memory.swap.high + A read-write single value file which exists on non-root + cgroups. The default is "max". + + Swap usage throttle limit. If a cgroup's swap usage exceeds + this limit, allocations inside the cgroup will be throttled. + + This slows down expansion of the group's memory footprint as + it runs out of assigned swap space. Compare to memory.swap.max, + which stops swapping abruptly and can provoke kernel OOM kills. + memory.swap.max A read-write single value file which exists on non-root cgroups. The default is "max".
On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > On Wed, Apr 22, 2020 at 03:26:32PM +0200, Michal Hocko wrote: > > That being said I believe our discussion is missing an important part. > > There is no description of the swap.high semantic. What can user expect > > when using it? > > Good point, we should include that in cgroup-v2.rst. How about this? > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index bcc80269bb6a..49e8733a9d8a 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1370,6 +1370,17 @@ PAGE_SIZE multiple when read back. > The total amount of swap currently being used by the cgroup > and its descendants. > > + memory.swap.high > + A read-write single value file which exists on non-root > + cgroups. The default is "max". > + > + Swap usage throttle limit. If a cgroup's swap usage exceeds > + this limit, allocations inside the cgroup will be throttled. Hm, so this doesn't talk about which allocatios are affected. This is good for potential future changes but I am not sure this is useful to make any educated guess about the actual effects. One could expect that only those allocations which could contribute to future memory.swap usage. I fully realize that we do not want to be very specific but we want to provide something useful I believe. I am sorry but I do not have a good suggestion on how to make this better. Mostly because I still struggle on how this should behave to be sane. I am also missing some information about what the user can actually do about this situation and call out explicitly that the throttling is not going away until the swap usage is shrunk and the kernel is not capable of doing that on its own without a help from the userspace. This is really different from memory.high which has means to deal with the excess and shrink it down in most cases. The following would clarify it for me "Once the limit is exceeded it is expected that the userspace is going to act and either free up the swapped out space or tune the limit based on needs. The kernel itself is not able to do that on its own. " > + > + This slows down expansion of the group's memory footprint as > + it runs out of assigned swap space. Compare to memory.swap.max, > + which stops swapping abruptly and can provoke kernel OOM kills. > + > memory.swap.max > A read-write single value file which exists on non-root > cgroups. The default is "max".
On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > > On Wed, Apr 22, 2020 at 03:26:32PM +0200, Michal Hocko wrote: > > > That being said I believe our discussion is missing an important part. > > > There is no description of the swap.high semantic. What can user expect > > > when using it? > > > > Good point, we should include that in cgroup-v2.rst. How about this? > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index bcc80269bb6a..49e8733a9d8a 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1370,6 +1370,17 @@ PAGE_SIZE multiple when read back. > > The total amount of swap currently being used by the cgroup > > and its descendants. > > > > + memory.swap.high > > + A read-write single value file which exists on non-root > > + cgroups. The default is "max". > > + > > + Swap usage throttle limit. If a cgroup's swap usage exceeds > > + this limit, allocations inside the cgroup will be throttled. > > Hm, so this doesn't talk about which allocatios are affected. This is > good for potential future changes but I am not sure this is useful to > make any educated guess about the actual effects. One could expect that > only those allocations which could contribute to future memory.swap > usage. I fully realize that we do not want to be very specific but we > want to provide something useful I believe. I am sorry but I do not have > a good suggestion on how to make this better. Mostly because I still > struggle on how this should behave to be sane. I honestly don't really follow you here. Why is it not helpful to say all allocations will slow down when condition X is met? We do the same for memory.high. > I am also missing some information about what the user can actually do > about this situation and call out explicitly that the throttling is > not going away until the swap usage is shrunk and the kernel is not > capable of doing that on its own without a help from the userspace. This > is really different from memory.high which has means to deal with the > excess and shrink it down in most cases. The following would clarify it I think we may be talking past each other. The user can do the same thing as in any OOM situation: wait for the kill. Swap being full is an OOM situation. Yes, that does not match the kernel's internal definition of an OOM situation. But we've already established that kernel OOM killing has a different objective (memory deadlock avoidance) than userspace OOM killing (quality of life)[1] [1] https://lkml.org/lkml/2019/8/4/15 As Tejun said, things like earlyoom and oomd already kill based on swap exhaustion, no further questions asked. Reclaim has been running for a while, it went after all the low-hanging fruit: it doesn't swap as long as there is easy cache; it also didn't just swap a little, it filled up all of swap; and the pages in swap are all cold too, because refaults would free that space again. The workingset is hugely oversized for the available capacity, and nobody has any interest in sticking around to see what tricks reclaim still has up its sleeves (hint: nothing good). From here on out, it's all thrashing and pain. The kernel might not OOM kill yet, but the quality of life expectancy for a workload with full swap is trending toward zero. We've been killing based on swap exhaustion as a stand-alone trigger for several years now and it's never been the wrong call. All swap.high does is acknowledge that swap-full is a common OOM situation from a userspace view, and helps it handle that situation. Just like memory.high acknowledges that if reclaim fails per kernel definition, it's an OOM situation from a kernel view, and it helps userspace handle that. > for me > "Once the limit is exceeded it is expected that the userspace > is going to act and either free up the swapped out space > or tune the limit based on needs. The kernel itself is not > able to do that on its own. > " I mean, in rare cases, maybe userspace can do some loadshedding and be smart about it. But we certainly don't expect it to. Just like we don't expect it to when memory.high starts injecting sleeps. We expect the workload to die, usually.
On Wed 22-04-20 13:13:28, Johannes Weiner wrote: > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: [...] > > > + Swap usage throttle limit. If a cgroup's swap usage exceeds > > > + this limit, allocations inside the cgroup will be throttled. > > > > Hm, so this doesn't talk about which allocatios are affected. This is > > good for potential future changes but I am not sure this is useful to > > make any educated guess about the actual effects. One could expect that > > only those allocations which could contribute to future memory.swap > > usage. I fully realize that we do not want to be very specific but we > > want to provide something useful I believe. I am sorry but I do not have > > a good suggestion on how to make this better. Mostly because I still > > struggle on how this should behave to be sane. > > I honestly don't really follow you here. Why is it not helpful to say > all allocations will slow down when condition X is met? This might be just me and I definitely do not want to pick on words here but your wording was not specific on which allocations. You can very well interpret that as really all allocations but I wouldn't be surprised if some would interpret it in a way that the kernel doesn't throttle unnecessarily and if allocations cannot really contribute to more swap then why should they be throttled. > We do the same for memory.high. > > > I am also missing some information about what the user can actually do > > about this situation and call out explicitly that the throttling is > > not going away until the swap usage is shrunk and the kernel is not > > capable of doing that on its own without a help from the userspace. This > > is really different from memory.high which has means to deal with the > > excess and shrink it down in most cases. The following would clarify it > > I think we may be talking past each other. The user can do the same > thing as in any OOM situation: wait for the kill. That assumes that reaching swap.high is going to converge to the OOM eventually. And that is far from the general case. There might be a lot of other reclaimable memory to reclaim and stay in the current state. [...] > > for me > > "Once the limit is exceeded it is expected that the userspace > > is going to act and either free up the swapped out space > > or tune the limit based on needs. The kernel itself is not > > able to do that on its own. > > " > > I mean, in rare cases, maybe userspace can do some loadshedding and be > smart about it. But we certainly don't expect it to. I really didn't mean to suggest any clever swap management. All I wanted so say and have documented is that users of swap.high should be aware of the fact that kernel is not able to do much to reduce the throttling. This is really different from memory.high where the kernel pro-actively tries to keep the memory usage below the watermark. So a certain level of userspace cooperation is really needed unless you can tolerate a workload to be throttled to the end of times. So let me be clear here. This is a very tricky interface to use and the more verbose we can be the better.
On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote: > On Wed 22-04-20 13:13:28, Johannes Weiner wrote: > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > > > I am also missing some information about what the user can actually do > > > about this situation and call out explicitly that the throttling is > > > not going away until the swap usage is shrunk and the kernel is not > > > capable of doing that on its own without a help from the userspace. This > > > is really different from memory.high which has means to deal with the > > > excess and shrink it down in most cases. The following would clarify it > > > > I think we may be talking past each other. The user can do the same > > thing as in any OOM situation: wait for the kill. > > That assumes that reaching swap.high is going to converge to the OOM > eventually. And that is far from the general case. There might be a > lot of other reclaimable memory to reclaim and stay in the current > state. No, that's really the general case. And that's based on what users widely experience, including us at FB. When swap is full, it's over. Multiple parties have independently reached this conclusion. This will be the default assumption in major distributions soon: https://fedoraproject.org/wiki/Changes/EnableEarlyoom > > > for me > > > "Once the limit is exceeded it is expected that the userspace > > > is going to act and either free up the swapped out space > > > or tune the limit based on needs. The kernel itself is not > > > able to do that on its own. > > > " > > > > I mean, in rare cases, maybe userspace can do some loadshedding and be > > smart about it. But we certainly don't expect it to. > > I really didn't mean to suggest any clever swap management. All I > wanted so say and have documented is that users of swap.high should > be aware of the fact that kernel is not able to do much to reduce the > throttling. This is really different from memory.high where the kernel > pro-actively tries to keep the memory usage below the watermark. So a > certain level of userspace cooperation is really needed unless you can > tolerate a workload to be throttled to the end of times. That's exactly what happens with memory.high. We've seen this. The workload can go into a crawl and just stay there. It's not unlike disabling the oom killer in cgroup1 without anybody handling it. With memory.high, workloads *might* recover, but you have to handle the ones that don't. Again, we inject sleeps into memory.high when reclaim *is not* pushing back the workload anymore, when reclaim is *failing*. The state isn't as stable as with oom_control=0, but these indefinite hangs really happen in practice. Realistically, you cannot use memory.high without an OOM manager. The assymetry you see between memory.high and swap.high comes from the page cache. memory.high can set a stop to the mindless expansion of the file cache and remove *unused* cache pages from the application's workingset. It cannot permanently remove used cache pages, they'll just refault. So unused cache is where reclaim is useful. Once the workload expands its set of *used* pages past memory.high, we are talking about indefinite slowdowns / OOM situations. Because at that point, reclaim cannot push the workload back and everything will be okay: the pages it takes off mean refaults and continued reclaim, i.e. throttling. You get slowed down either way, and whether you reclaim or sleep() is - to the workload - an accounting difference. Reclaim does NOT have the power to help the workload get better. It can only do amputations to protect the rest of the system, but it cannot reduce the number of pages the workload is trying to access. The only sustainable way out of such a throttling situation is either an OOM kill or the workload voluntarily shrinking and reducing the total number of pages it uses. And doesn't that sound familiar? :-) The actual, observable effects of memory.high and swap.high semantics are much more similar than you think they are: When the workload's true workingset (not throwaway cache) grows past capacity (memory or swap), we slow down further expansion until it either changes its mind and shrinks, or userspace OOM handling takes care of it.
On Thu 23-04-20 11:00:15, Johannes Weiner wrote: > On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote: > > On Wed 22-04-20 13:13:28, Johannes Weiner wrote: > > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > > > > I am also missing some information about what the user can actually do > > > > about this situation and call out explicitly that the throttling is > > > > not going away until the swap usage is shrunk and the kernel is not > > > > capable of doing that on its own without a help from the userspace. This > > > > is really different from memory.high which has means to deal with the > > > > excess and shrink it down in most cases. The following would clarify it > > > > > > I think we may be talking past each other. The user can do the same > > > thing as in any OOM situation: wait for the kill. > > > > That assumes that reaching swap.high is going to converge to the OOM > > eventually. And that is far from the general case. There might be a > > lot of other reclaimable memory to reclaim and stay in the current > > state. > > No, that's really the general case. And that's based on what users > widely experience, including us at FB. When swap is full, it's over. > Multiple parties have independently reached this conclusion. But we are talking about two things. You seem to be focusing on the full swap (quota) while I am talking about swap.high which doesn't imply that the quota/full swap is going to be reached soon. [...] > The assymetry you see between memory.high and swap.high comes from the > page cache. memory.high can set a stop to the mindless expansion of > the file cache and remove *unused* cache pages from the application's > workingset. It cannot permanently remove used cache pages, they'll > just refault. So unused cache is where reclaim is useful. Exactly! And I have seen memory.high being used to throttle huge page cache producers to not disrupt other workloads. > Once the workload expands its set of *used* pages past memory.high, we > are talking about indefinite slowdowns / OOM situations. Because at > that point, reclaim cannot push the workload back and everything will > be okay: the pages it takes off mean refaults and continued reclaim, > i.e. throttling. You get slowed down either way, and whether you > reclaim or sleep() is - to the workload - an accounting difference. > > Reclaim does NOT have the power to help the workload get better. It > can only do amputations to protect the rest of the system, but it > cannot reduce the number of pages the workload is trying to access. Yes I do agree with you here and I believe this scenario wasn't really what the dispute is about. As soon as the real working set doesn't fit into the high limit and still growing then you are effectively OOM and either you do handle that from the userspace or you have to waaaaaaaaait for the kernel oom killer to trigger. But I believe this scenario is much easier to understand because the memory consumption is growing. What I find largely unintuitive from the user POV is that the throttling will remain in place without a userspace intervention even when there is no runaway. Let me give you an example. Say you have a peak load which pushes out a large part of an idle memory to swap. So much it fills up the swap.high. The peak eventually finishes freeing up its resources. The swap situation remains the same because that memory is not refaulted and we do not pro-actively swap in memory (aka reclaim the swap space). You are left with throttling even though the overall memcg consumption is really low. Kernel is currently not able to do anything about that and the userspace would need to be aware of the situation to fault in swapped out memory back to get a normal behavior. Do you think this is something so obvious that people would keep it in mind when using swap.high? Anyway, it seems that we are not making progress here. As I've said I believe that swap.high might lead to a surprising behavior and therefore I would appreciate more clarity in the documentation. If you see a problem with that for some reason then I can live with that. This is not a reason to nack.
On Fri, Apr 24, 2020 at 05:05:10PM +0200, Michal Hocko wrote: > On Thu 23-04-20 11:00:15, Johannes Weiner wrote: > > On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote: > > > On Wed 22-04-20 13:13:28, Johannes Weiner wrote: > > > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > > > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > > > > > I am also missing some information about what the user can actually do > > > > > about this situation and call out explicitly that the throttling is > > > > > not going away until the swap usage is shrunk and the kernel is not > > > > > capable of doing that on its own without a help from the userspace. This > > > > > is really different from memory.high which has means to deal with the > > > > > excess and shrink it down in most cases. The following would clarify it > > > > > > > > I think we may be talking past each other. The user can do the same > > > > thing as in any OOM situation: wait for the kill. > > > > > > That assumes that reaching swap.high is going to converge to the OOM > > > eventually. And that is far from the general case. There might be a > > > lot of other reclaimable memory to reclaim and stay in the current > > > state. > > > > No, that's really the general case. And that's based on what users > > widely experience, including us at FB. When swap is full, it's over. > > Multiple parties have independently reached this conclusion. > > But we are talking about two things. You seem to be focusing on the full > swap (quota) while I am talking about swap.high which doesn't imply > that the quota/full swap is going to be reached soon. Hm, I'm not quite sure I understand. swap.high is supposed to set this quota. It's supposed to say: the workload has now shown such an appetite for swap that it's unlikely to survive for much longer - draw out its death just long enough for userspace OOM handling. Maybe this is our misunderstanding? It certainly doesn't make much sense to set swap.high to 0 or relatively low values. Should we add the above to the doc text? > > Once the workload expands its set of *used* pages past memory.high, we > > are talking about indefinite slowdowns / OOM situations. Because at > > that point, reclaim cannot push the workload back and everything will > > be okay: the pages it takes off mean refaults and continued reclaim, > > i.e. throttling. You get slowed down either way, and whether you > > reclaim or sleep() is - to the workload - an accounting difference. > > > > Reclaim does NOT have the power to help the workload get better. It > > can only do amputations to protect the rest of the system, but it > > cannot reduce the number of pages the workload is trying to access. > > Yes I do agree with you here and I believe this scenario wasn't really > what the dispute is about. As soon as the real working set doesn't > fit into the high limit and still growing then you are effectively > OOM and either you do handle that from the userspace or you have to > waaaaaaaaait for the kernel oom killer to trigger. > > But I believe this scenario is much easier to understand because the > memory consumption is growing. What I find largely unintuitive from the > user POV is that the throttling will remain in place without a userspace > intervention even when there is no runaway. > > Let me give you an example. Say you have a peak load which pushes > out a large part of an idle memory to swap. So much it fills up the > swap.high. The peak eventually finishes freeing up its resources. The > swap situation remains the same because that memory is not refaulted and > we do not pro-actively swap in memory (aka reclaim the swap space). You > are left with throttling even though the overall memcg consumption is > really low. Kernel is currently not able to do anything about that > and the userspace would need to be aware of the situation to fault in > swapped out memory back to get a normal behavior. Do you think this > is something so obvious that people would keep it in mind when using > swap.high? Okay, thanks for clarifying, I understand your concern now. This is not a scenario that swap.high is supposed to handle. It should *not* be set to an amount of memory that the workload can reasonably have sitting around idle. For example, if your memory allowance is 10G, it doesn't make sense to have swap.high at 200M or something. It should be set to "we don't expect healthy workloads to get here". And now I also understand what you mean by being different to memory.high. memory.high is definitely *expected* to get hit because of the cache trimming usecase. We just don't expect the *throttling* part to get into play unless the workload is truly unhealthy. But I can see how user expectations toward swap.high could be different. > Anyway, it seems that we are not making progress here. As I've said I > believe that swap.high might lead to a surprising behavior and therefore > I would appreciate more clarity in the documentation. If you see a > problem with that for some reason then I can live with that. This is not > a reason to nack. No, I agree we should document this. How about the following? memory.swap.high A read-write single value file which exists on non-root cgroups. The default is "max". Swap usage throttle limit. If a cgroup's swap usage exceeds this limit, all its further allocations will be throttled to allow userspace to implement custom out-of-memory procedures. This limit marks a point of no return for the cgroup. It is NOT designed to manage the amount of swapping a workload does during regular operation. Compare to memory.swap.max, which prohibits swapping past a set amount, but lets the cgroup continue unimpeded as long as other memory can be reclaimed.
On Tue 28-04-20 10:24:32, Johannes Weiner wrote: > On Fri, Apr 24, 2020 at 05:05:10PM +0200, Michal Hocko wrote: > > On Thu 23-04-20 11:00:15, Johannes Weiner wrote: > > > On Wed, Apr 22, 2020 at 08:49:21PM +0200, Michal Hocko wrote: > > > > On Wed 22-04-20 13:13:28, Johannes Weiner wrote: > > > > > On Wed, Apr 22, 2020 at 05:43:18PM +0200, Michal Hocko wrote: > > > > > > On Wed 22-04-20 10:15:14, Johannes Weiner wrote: > > > > > > I am also missing some information about what the user can actually do > > > > > > about this situation and call out explicitly that the throttling is > > > > > > not going away until the swap usage is shrunk and the kernel is not > > > > > > capable of doing that on its own without a help from the userspace. This > > > > > > is really different from memory.high which has means to deal with the > > > > > > excess and shrink it down in most cases. The following would clarify it > > > > > > > > > > I think we may be talking past each other. The user can do the same > > > > > thing as in any OOM situation: wait for the kill. > > > > > > > > That assumes that reaching swap.high is going to converge to the OOM > > > > eventually. And that is far from the general case. There might be a > > > > lot of other reclaimable memory to reclaim and stay in the current > > > > state. > > > > > > No, that's really the general case. And that's based on what users > > > widely experience, including us at FB. When swap is full, it's over. > > > Multiple parties have independently reached this conclusion. > > > > But we are talking about two things. You seem to be focusing on the full > > swap (quota) while I am talking about swap.high which doesn't imply > > that the quota/full swap is going to be reached soon. > > Hm, I'm not quite sure I understand. swap.high is supposed to set this > quota. It's supposed to say: the workload has now shown such an > appetite for swap that it's unlikely to survive for much longer - draw > out its death just long enough for userspace OOM handling. > > Maybe this is our misunderstanding? Probably. We already have a quota for swap (swap.max). Workload is not allowed to swap out when the quota is reached. swap.high is supposed to act as a preliminary action towards slowing down swap consumption beyond its limit. > It certainly doesn't make much sense to set swap.high to 0 or > relatively low values. Should we add the above to the doc text? > > > > Once the workload expands its set of *used* pages past memory.high, we > > > are talking about indefinite slowdowns / OOM situations. Because at > > > that point, reclaim cannot push the workload back and everything will > > > be okay: the pages it takes off mean refaults and continued reclaim, > > > i.e. throttling. You get slowed down either way, and whether you > > > reclaim or sleep() is - to the workload - an accounting difference. > > > > > > Reclaim does NOT have the power to help the workload get better. It > > > can only do amputations to protect the rest of the system, but it > > > cannot reduce the number of pages the workload is trying to access. > > > > Yes I do agree with you here and I believe this scenario wasn't really > > what the dispute is about. As soon as the real working set doesn't > > fit into the high limit and still growing then you are effectively > > OOM and either you do handle that from the userspace or you have to > > waaaaaaaaait for the kernel oom killer to trigger. > > > > But I believe this scenario is much easier to understand because the > > memory consumption is growing. What I find largely unintuitive from the > > user POV is that the throttling will remain in place without a userspace > > intervention even when there is no runaway. > > > > Let me give you an example. Say you have a peak load which pushes > > out a large part of an idle memory to swap. So much it fills up the > > swap.high. The peak eventually finishes freeing up its resources. The > > swap situation remains the same because that memory is not refaulted and > > we do not pro-actively swap in memory (aka reclaim the swap space). You > > are left with throttling even though the overall memcg consumption is > > really low. Kernel is currently not able to do anything about that > > and the userspace would need to be aware of the situation to fault in > > swapped out memory back to get a normal behavior. Do you think this > > is something so obvious that people would keep it in mind when using > > swap.high? > > Okay, thanks for clarifying, I understand your concern now. Great that we are on the same page! [...] > No, I agree we should document this. How about the following? > > memory.swap.high > A read-write single value file which exists on non-root > cgroups. The default is "max". > > Swap usage throttle limit. If a cgroup's swap usage exceeds > this limit, all its further allocations will be throttled to > allow userspace to implement custom out-of-memory procedures. > > This limit marks a point of no return for the cgroup. It is NOT > designed to manage the amount of swapping a workload does > during regular operation. Compare to memory.swap.max, which > prohibits swapping past a set amount, but lets the cgroup > continue unimpeded as long as other memory can be reclaimed. OK, this makes the intented use much more clear. I believe that it would be helpful to also add your note that the value should be set to "we don't expect healthy workloads to get here". The usecase is quite narrow and I expect people will start asking about something to help to manage the swap space somehow and this will not be a good fit. This would require much more work to achieve a sane semantic though. I am not aware of usacases at this moment so this is really hard to argue about. I hope this will not backfire when we reach that point though. That being said, I am not a huge fan of the new interface but I can see how it can be useful. I will not ack the patchset but I will not block it either. Thanks for refining the documentation and please make sure that changelogs in the next version describe the intented usecase as mentioned in this email thread.