Message ID | 20230720070825.992023-1-yosryahmed@google.com (mailing list archive) |
---|---|
Headers | show |
Series | memory recharging for offline memcgs | expand |
On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote: > This patch series implements the proposal in LSF/MM/BPF 2023 conference > for reducing offline/zombie memcgs by memory recharging [1]. The main > difference is that this series focuses on recharging and does not > include eviction of any memory charged to offline memcgs. > > Two methods of recharging are proposed: > > (a) Recharging of mapped folios. > > When a memcg is offlined, queue an asynchronous worker that will walk > the lruvec of the offline memcg and try to recharge any mapped folios to > the memcg of one of the processes mapping the folio. The main assumption > is that a process mapping the folio is the "rightful" owner of the > memory. > > Currently, this is only supported for evictable folios, as the > unevictable lru is imaginary and we cannot iterate the folios on it. A > separate proposal [2] was made to revive the unevictable lru, which > would allow recharging of unevictable folios. > > (b) Deferred recharging of folios. > > For folios that are unmapped, or mapped but we fail to recharge them > with (a), we rely on deferred recharging. Simply put, any time a folio > is accessed or dirtied by a userspace process, and that folio is charged > to an offline memcg, we will try to recharge it to the memcg of the > process accessing the folio. Again, we assume this process should be the > "rightful" owner of the memory. This is also done asynchronously to avoid > slowing down the data access path. I'm super skeptical of this proposal. Recharging *might* be the most desirable semantics from a user pov, but only if it applies consistently to the whole memory footprint. There is no mention of slab allocations such as inodes, dentries, network buffers etc. which can be a significant part of a cgroup's footprint. These are currently reparented. I don't think doing one thing with half of the memory, and a totally different thing with the other half upon cgroup deletion is going to be acceptable semantics. It appears this also brings back the reliability issue that caused us to deprecate charge moving. The recharge path has trylocks, LRU isolation attempts, GFP_ATOMIC allocations. These introduce a variable error rate into the relocation process, which causes pages that should belong to the same domain to be scattered around all over the place. It also means that zombie pinning still exists, but it's now even more influenced by timing and race conditions, and so less predictable. There are two issues being conflated here: a) the problem of zombie cgroups, and b) who controls resources that outlive the control domain. For a), reparenting is still the most reasonable proposal. It's reliable for one, but it also fixes the problem fully within the established, user-facing semantics: resources that belong to a cgroup also hierarchically belong to all ancestral groups; if those resources outlive the last-level control domain, they continue to belong to the parents. This is how it works today, and this is how it continues to work with reparenting. The only difference is that those resources no longer pin a dead cgroup anymore, but instead are physically linked to the next online ancestor. Since dead cgroups have no effective control parameters anymore, this is semantically equivalent - it's just a more memory efficient implementation of the same exact thing. b) is a discussion totally separate from this. We can argue what we want this behavior to be, but I'd argue strongly that whatever we do here should apply to all resources managed by the controller equally. It could also be argued that if you don't want to lose control over a set of resources, then maybe don't delete their control domain while they are still alive and in use. For example, when restarting a workload, and the new instance is expected to have largely the same workingset, consider reusing the cgroup instead of making a new one. For the zombie problem, I think we should merge Muchun's patches ASAP. They've been proposed several times, they have Roman's reviews and acks, and they do not change user-facing semantics. There is no good reason not to merge them.
On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote: > It could also be argued that if you don't want to lose control over a > set of resources, then maybe don't delete their control domain while > they are still alive and in use. For example, when restarting a > workload, and the new instance is expected to have largely the same > workingset, consider reusing the cgroup instead of making a new one. Or just create a nesting layer so that there's a cgroup which represents the persistent resources and a nested cgroup instance inside representing the current instance. Thanks.
On Thu, Jul 20, 2023 at 8:35 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote: > > This patch series implements the proposal in LSF/MM/BPF 2023 conference > > for reducing offline/zombie memcgs by memory recharging [1]. The main > > difference is that this series focuses on recharging and does not > > include eviction of any memory charged to offline memcgs. > > > > Two methods of recharging are proposed: > > > > (a) Recharging of mapped folios. > > > > When a memcg is offlined, queue an asynchronous worker that will walk > > the lruvec of the offline memcg and try to recharge any mapped folios to > > the memcg of one of the processes mapping the folio. The main assumption > > is that a process mapping the folio is the "rightful" owner of the > > memory. > > > > Currently, this is only supported for evictable folios, as the > > unevictable lru is imaginary and we cannot iterate the folios on it. A > > separate proposal [2] was made to revive the unevictable lru, which > > would allow recharging of unevictable folios. > > > > (b) Deferred recharging of folios. > > > > For folios that are unmapped, or mapped but we fail to recharge them > > with (a), we rely on deferred recharging. Simply put, any time a folio > > is accessed or dirtied by a userspace process, and that folio is charged > > to an offline memcg, we will try to recharge it to the memcg of the > > process accessing the folio. Again, we assume this process should be the > > "rightful" owner of the memory. This is also done asynchronously to avoid > > slowing down the data access path. > > I'm super skeptical of this proposal. I expected this :) > > Recharging *might* be the most desirable semantics from a user pov, > but only if it applies consistently to the whole memory footprint. > There is no mention of slab allocations such as inodes, dentries, > network buffers etc. which can be a significant part of a cgroup's > footprint. These are currently reparented. I don't think doing one > thing with half of the memory, and a totally different thing with the > other half upon cgroup deletion is going to be acceptable semantics. I think, as you say, recharging has the most desirable semantics because the charge is maintained where it *should* be (with who is actually using it). We simply cannot do that for kernel memory, because we have no way of attributing it to a user. On the other hand, we *can* attribute user memory to a user. Consistency is great, but our inability to do (arguably) the right thing for one type of memory, doesn't mean we shouldn't do it when we can. I would also argue that user memory (anon/file pages) would commonly be the larger portion of memory on a machine compared to kernel memory (e.g. slab). > > It appears this also brings back the reliability issue that caused us > to deprecate charge moving. The recharge path has trylocks, LRU > isolation attempts, GFP_ATOMIC allocations. These introduce a variable > error rate into the relocation process, Recharging is naturally best effort, because it's non-disruptive. After a memcg dies, the kernel continuously tries to move the charges away from it on every chance it gets. If it fails one time that's fine, there will be other chances. Compared to the status quo, it is definitely better than just leaving all the memory behind with the zombie memcg. I would argue that over time (and accesses), most/all memory should eventually get recharged. If not, something is not working correctly, or a wrong assumption is being made. > which causes pages that should > belong to the same domain to be scattered around all over the place. I strongly disagree with this point. Ideally, yes, memory charged to a memcg would belong to the same domain. In practice, due to the first touch charging semantics, this is far from the truth. For anonymous memory, sure, they all belong to the same domain (mostly), the process they belong to. But most of anonymous memory will go away when the process dies anyway, the problem is mostly with shared resources (e.g. file, tmpfs, ..). With file/tmpfs memory, the charging behavior is random. The first memcg that touches a page gets charged for it. Consequently, the file/tmpfs memory charged to a memcg would be a mixture of pages from different files in different mounts, definitely not a single domain. Perhaps with some workloads, where each memcg is accessing different files, most memory charged to a memcg will belong to the same domain, but in this case, recharging wouldn't move it away anyway. > It also means that zombie pinning still exists, but it's now even more > influenced by timing and race conditions, and so less predictable. It still exists, but it is improved. The kernel tries to move charges away from zombies on every chance it gets instead of doing nothing about it. It is less predictable, can't argue about this, but it can't get worse, only better. > > There are two issues being conflated here: > > a) the problem of zombie cgroups, and > > b) who controls resources that outlive the control domain. > > For a), reparenting is still the most reasonable proposal. It's > reliable for one, but it also fixes the problem fully within the > established, user-facing semantics: resources that belong to a cgroup > also hierarchically belong to all ancestral groups; if those resources > outlive the last-level control domain, they continue to belong to the > parents. This is how it works today, and this is how it continues to > work with reparenting. The only difference is that those resources no > longer pin a dead cgroup anymore, but instead are physically linked to > the next online ancestor. Since dead cgroups have no effective control > parameters anymore, this is semantically equivalent - it's just a more > memory efficient implementation of the same exact thing. I agree that reparenting is more deterministic and reliable, but there are two major flaws off the top of my head: (1) If a memcg touches a page one time and gets charged for it, the charge is stuck in its hierarchy forever. It can get reparented, but it will never be charged to whoever is actually using it again, unless it is reclaimed and refaulted (in some cases). Consider this hierarchy: root / \ A B \ C Consider a case where memcg C touches a library file once, and gets charged for some memory, and then dies. The memory gets reparente to memcg B. Meanwhile, memcg A is continuously using the memory that memcg B is charged for. memcg B would be indefinitely taxed by memcg A. The only way out is if memcg B hit its limit, and the pages get reclaimed, and then refaulted and recharged to memcg A. In some cases (e.g. tmpfs), even then the memory would still get charged to memcg B. There is no way to get rid of the charge until the resource itself is freed. This problem exists today, even without reparenting, with the difference being that the charge will remain with C instead of B. Recharging offers a better alternative where the charge will be correctly moved to A, the "rightful" owner. (2) In the above scenario, when memcg B dies, the memory will be reparented to the root. That's even worse. Now memcg A is using memory that is not accounted for anywhere, essentially an accounting leak. From an admin perspective, the memory charged to root is system overhead, it is lost capacity. For long-living systems, as memcgs are created and destroyed for different workloads, memory will keep accumulating at the root. The machine will keep leaking capacity over time, and accounting becomes less and less accurate as more memory becomes charged to the root. > > b) is a discussion totally separate from this. I would argue that the zombie problem is (at least partially) an artifact of the shared/sticky resources problem. If all resources are used by one memcg and do not outlive it, we wouldn't have zombies. > We can argue what we > want this behavior to be, but I'd argue strongly that whatever we do > here should apply to all resources managed by the controller equally. User memory and kernel memory are very different in nature. Ideally yeah, we want to treat all resources equally. But user memory is naturally more attributable to users and easier to account correctly than kernel memory. > > It could also be argued that if you don't want to lose control over a > set of resources, then maybe don't delete their control domain while > they are still alive and in use. This is easier said than done :) As I mentioned earlier, the charging semantics are inherently indeterministic for shared resources (e.g. file/tmpfs). The user cannot control or monitor which resources belong to which control domain. Each memcg in the system could be charged for one page from each file in a shared library for all that matters :) > For example, when restarting a > workload, and the new instance is expected to have largely the same > workingset, consider reusing the cgroup instead of making a new one. In a large fleet with many different jobs getting rescheduled and restarted on different machines, it's really hard in practice to do so. We can keep the same cgroup if the same workload is being restarted on the same machine, sure, but most of the time there's a new workload arriving or so. We can't reuse containers in this case. > > For the zombie problem, I think we should merge Muchun's patches > ASAP. They've been proposed several times, they have Roman's reviews > and acks, and they do not change user-facing semantics. There is no > good reason not to merge them. There are some, which I pointed out above. All in all, I understand where you are coming from. Your concerns are valid. Recharging is not a perfect approach, but it is arguably the best we can do at this point. Being indeterministic sucks, but our charging semantics are inherently indeterministic anyway.
On Thu, Jul 20, 2023 at 12:57 PM Tejun Heo <tj@kernel.org> wrote: > > On Thu, Jul 20, 2023 at 11:35:15AM -0400, Johannes Weiner wrote: > > It could also be argued that if you don't want to lose control over a > > set of resources, then maybe don't delete their control domain while > > they are still alive and in use. For example, when restarting a > > workload, and the new instance is expected to have largely the same > > workingset, consider reusing the cgroup instead of making a new one. > > Or just create a nesting layer so that there's a cgroup which represents the > persistent resources and a nested cgroup instance inside representing the > current instance. In practice it is not easy to know exactly which resources are shared and used by which cgroups, especially in a large dynamic environment. > > Thanks. > > -- > tejun
Hello, On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote: > > Or just create a nesting layer so that there's a cgroup which represents the > > persistent resources and a nested cgroup instance inside representing the > > current instance. > > In practice it is not easy to know exactly which resources are shared > and used by which cgroups, especially in a large dynamic environment. Yeah, that only covers when resource persistence is confined in a known scope. That said, I have a hard time seeing how recharding once after cgroup destruction can be a solution for the situations you describe. What if A touches it once first, B constantly uses it but C only very occasionally and after A dies C ends up owning it due to timing. This is very much possible in a large dynamic environment but neither the initial or final situation is satisfactory. To solve the problems you're describing, you actually would have to guarantee that memory pages are charged to the current majority user (or maybe even spread across current active users). Maybe it can be argued that this is a step towards that but it's a very partial step and at least would need a technically viable direction that this development can follow. On its own, AFAICS, I'm not sure the scope of problems it can actually solve is justifiably greater than what can be achieved with simple nesting. Thanks.
On Thu, Jul 20, 2023 at 3:12 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote: > > > Or just create a nesting layer so that there's a cgroup which represents the > > > persistent resources and a nested cgroup instance inside representing the > > > current instance. > > > > In practice it is not easy to know exactly which resources are shared > > and used by which cgroups, especially in a large dynamic environment. > > Yeah, that only covers when resource persistence is confined in a known > scope. That said, I have a hard time seeing how recharding once after cgroup > destruction can be a solution for the situations you describe. What if A > touches it once first, B constantly uses it but C only very occasionally and > after A dies C ends up owning it due to timing. This is very much possible > in a large dynamic environment but neither the initial or final situation is > satisfactory. That is indeed possible, but it would be more likely that the charge is moved to B. As I said, it's not perfect, but it is an improvement over what we have today. Even if C ends up owning it, it's better than staying with the dead A. > > To solve the problems you're describing, you actually would have to > guarantee that memory pages are charged to the current majority user (or > maybe even spread across current active users). Maybe it can be argued that > this is a step towards that but it's a very partial step and at least would > need a technically viable direction that this development can follow. Right, that would be a much larger effort (arguably memcg v3 ;) ). This proposal is focused on the painful artifact of the sharing/sticky resources problem: zombie memcgs. We can extend the automatic charge movement semantics later to cover more cases or be smarter, or ditch the existing charging semantics completely and start over with sharing/stickiness in mind. Either way, that would be a long-term effort. There is a problem that exists today though that ideally can be fixed/improved by this proposal. > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve > is justifiably greater than what can be achieved with simple nesting. In our use case nesting is not a viable option. As I said, in a large fleet where a lot of different workloads are dynamically being scheduled on different machines, and where there is no way of knowing what resources are being shared among what workloads, and even if we do, it wouldn't be constant, it's very difficult to construct the hierarchy with nesting to keep the resources confined. Keep in mind that the environment is dynamic, workloads are constantly coming and going. Even if find the perfect nesting to appropriately scope resources, some rescheduling may render the hierarchy obsolete and require us to start over. > > Thanks. > > -- > tejun
Hello, On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote: > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve > > is justifiably greater than what can be achieved with simple nesting. > > In our use case nesting is not a viable option. As I said, in a large > fleet where a lot of different workloads are dynamically being > scheduled on different machines, and where there is no way of knowing > what resources are being shared among what workloads, and even if we > do, it wouldn't be constant, it's very difficult to construct the > hierarchy with nesting to keep the resources confined. Hmm... so, usually, the problems we see are resources that are persistent across different instances of the same application as they may want to share large chunks of memory like on-memory cache. I get that machines get different dynamic jobs but unrelated jobs usually don't share huge amount of memory at least in our case. The sharing across them comes down to things like some common library pages which don't really account for much these days. > Keep in mind that the environment is dynamic, workloads are constantly > coming and going. Even if find the perfect nesting to appropriately > scope resources, some rescheduling may render the hierarchy obsolete > and require us to start over. Can you please go into more details on how much memory is shared for what across unrelated dynamic workloads? That sounds different from other use cases. Thanks.
On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote: > > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve > > > is justifiably greater than what can be achieved with simple nesting. > > > > In our use case nesting is not a viable option. As I said, in a large > > fleet where a lot of different workloads are dynamically being > > scheduled on different machines, and where there is no way of knowing > > what resources are being shared among what workloads, and even if we > > do, it wouldn't be constant, it's very difficult to construct the > > hierarchy with nesting to keep the resources confined. > > Hmm... so, usually, the problems we see are resources that are persistent > across different instances of the same application as they may want to share > large chunks of memory like on-memory cache. I get that machines get > different dynamic jobs but unrelated jobs usually don't share huge amount of > memory at least in our case. The sharing across them comes down to things > like some common library pages which don't really account for much these > days. > This has also been my experience in terms of bytes of memory that are incorrectly charged (because they're charged to a zombie), but that is because memcg doesn't currently track the large shared allocations in my case (primarily dma-buf). The greater issue I've seen so far is the number of zombie cgroups that can accumulate over time. But my understanding is that both of these two problems are currently significant for Yosry's case.
Hello, On Thu, Jul 20, 2023 at 04:24:02PM -0700, T.J. Mercier wrote: > > Hmm... so, usually, the problems we see are resources that are persistent > > across different instances of the same application as they may want to share > > large chunks of memory like on-memory cache. I get that machines get > > different dynamic jobs but unrelated jobs usually don't share huge amount of > > memory at least in our case. The sharing across them comes down to things > > like some common library pages which don't really account for much these > > days. > > > This has also been my experience in terms of bytes of memory that are > incorrectly charged (because they're charged to a zombie), but that is > because memcg doesn't currently track the large shared allocations in > my case (primarily dma-buf). The greater issue I've seen so far is the > number of zombie cgroups that can accumulate over time. But my > understanding is that both of these two problems are currently > significant for Yosry's case. memcg already does reparenting of slab pages to lower the number of dying cgroups and maybe it makes sense to expand that to user memory too. One related thing is that if those reparented pages are written to, that's gonna break IO isolation w/ blk-iocost because iocost currently bypasses IOs from intermediate cgroups to root but we can fix that. Anyways, that's something pretty different from what's proposed here. Reparenting, I think, is a lot less conroversial. Thanks.
On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote: > This patch series implements the proposal in LSF/MM/BPF 2023 conference > for reducing offline/zombie memcgs by memory recharging [1]. The main > difference is that this series focuses on recharging and does not > include eviction of any memory charged to offline memcgs. > > Two methods of recharging are proposed: > > (a) Recharging of mapped folios. > > When a memcg is offlined, queue an asynchronous worker that will walk > the lruvec of the offline memcg and try to recharge any mapped folios to > the memcg of one of the processes mapping the folio. The main assumption > is that a process mapping the folio is the "rightful" owner of the > memory. > > Currently, this is only supported for evictable folios, as the > unevictable lru is imaginary and we cannot iterate the folios on it. A > separate proposal [2] was made to revive the unevictable lru, which > would allow recharging of unevictable folios. > > (b) Deferred recharging of folios. > > For folios that are unmapped, or mapped but we fail to recharge them > with (a), we rely on deferred recharging. Simply put, any time a folio > is accessed or dirtied by a userspace process, and that folio is charged > to an offline memcg, we will try to recharge it to the memcg of the > process accessing the folio. Again, we assume this process should be the > "rightful" owner of the memory. This is also done asynchronously to avoid > slowing down the data access path. Unfortunately I have to agree with Johannes, Tejun and others who are not big fans of this approach. Lazy recharging leads to an interesting phenomena: a memory usage of a running workload may suddenly go up only because some other workload is terminated and now it's memory is being recharged. I find it confusing. It also makes hard to set up limits and/or guarantees. In general, I don't think we can handle shared memory well without getting rid of "whoever allocates a page, pays the full price" policy and making a shared ownership a fully supported concept. Of course, it's a huge work and I believe the only way we can achieve it is to compromise on the granularity of the accounting. Will the resulting system be better in the real life, it's hard to say in advance. Thanks!
On Thu, Jul 20, 2023 at 5:02 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote: > > This patch series implements the proposal in LSF/MM/BPF 2023 conference > > for reducing offline/zombie memcgs by memory recharging [1]. The main > > difference is that this series focuses on recharging and does not > > include eviction of any memory charged to offline memcgs. > > > > Two methods of recharging are proposed: > > > > (a) Recharging of mapped folios. > > > > When a memcg is offlined, queue an asynchronous worker that will walk > > the lruvec of the offline memcg and try to recharge any mapped folios to > > the memcg of one of the processes mapping the folio. The main assumption > > is that a process mapping the folio is the "rightful" owner of the > > memory. > > > > Currently, this is only supported for evictable folios, as the > > unevictable lru is imaginary and we cannot iterate the folios on it. A > > separate proposal [2] was made to revive the unevictable lru, which > > would allow recharging of unevictable folios. > > > > (b) Deferred recharging of folios. > > > > For folios that are unmapped, or mapped but we fail to recharge them > > with (a), we rely on deferred recharging. Simply put, any time a folio > > is accessed or dirtied by a userspace process, and that folio is charged > > to an offline memcg, we will try to recharge it to the memcg of the > > process accessing the folio. Again, we assume this process should be the > > "rightful" owner of the memory. This is also done asynchronously to avoid > > slowing down the data access path. > > Unfortunately I have to agree with Johannes, Tejun and others who are not big > fans of this approach. > > Lazy recharging leads to an interesting phenomena: a memory usage of a running > workload may suddenly go up only because some other workload is terminated and > now it's memory is being recharged. I find it confusing. It also makes hard > to set up limits and/or guarantees. This can happen today. If memcg A starts accessing some memory and gets charged for it, and then memcg B also accesses it, it will not be charged for it. If at a later point memcg A runs into reclaim, and the memory is freed, then memcg B tries to access it, its usage will suddenly go up as well, because some other workload experienced reclaim. This is a very similar scenario, only instead of reclaim, the memcg was offlined. As a matter of fact, it's common to try to free up a memcg before removing it (by lowering the limit or using memory.reclaim). In that case, the net result would be exactly the same -- with the difference being that recharging will avoid freeing the memory and faulting it back in. > > In general, I don't think we can handle shared memory well without getting rid > of "whoever allocates a page, pays the full price" policy and making a shared > ownership a fully supported concept. Of course, it's a huge work and I believe > the only way we can achieve it is to compromise on the granularity of the > accounting. Will the resulting system be better in the real life, it's hard to > say in advance. > > Thanks!
On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Jul 20, 2023 at 03:23:59PM -0700, Yosry Ahmed wrote: > > > On its own, AFAICS, I'm not sure the scope of problems it can actually solve > > > is justifiably greater than what can be achieved with simple nesting. > > > > In our use case nesting is not a viable option. As I said, in a large > > fleet where a lot of different workloads are dynamically being > > scheduled on different machines, and where there is no way of knowing > > what resources are being shared among what workloads, and even if we > > do, it wouldn't be constant, it's very difficult to construct the > > hierarchy with nesting to keep the resources confined. > > Hmm... so, usually, the problems we see are resources that are persistent > across different instances of the same application as they may want to share > large chunks of memory like on-memory cache. I get that machines get > different dynamic jobs but unrelated jobs usually don't share huge amount of I am digging deeper to get more information for you. One thing I know now is that different instances of the same job are contained within a common parent, and we even use our previously proposed memcg= mount option for tmpfs to charge their shared resources to a common parent. So restarting tasks is not the problem we are seeing. > memory at least in our case. The sharing across them comes down to things > like some common library pages which don't really account for much these > days. Keep in mind that even a single page charged to a memcg and used by another memcg is sufficient to result in a zombie memcg. > > > Keep in mind that the environment is dynamic, workloads are constantly > > coming and going. Even if find the perfect nesting to appropriately > > scope resources, some rescheduling may render the hierarchy obsolete > > and require us to start over. > > Can you please go into more details on how much memory is shared for what > across unrelated dynamic workloads? That sounds different from other use > cases. I am trying to collect more information from our fleet, but the application restarting in a different cgroup is not what is happening in our case. It is not easy to find out exactly what is going on on machines and where the memory is coming from due to the indeterministic nature of charging. The goal of this proposal is to let the kernel handle leftover memory in zombie memcgs because it is not always obvious to userspace what's going on (like it's not obvious to me now where exactly is the sharing happening :) ). One thing to note is that in some cases, maybe a userspace bug or failed cleanup is a reason for the zombie memcgs. Ideally, this wouldn't happen, but it would be nice to have a fallback mechanism in the kernel if it does. > > Thanks. > > -- > tejun
Hello, On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > memory at least in our case. The sharing across them comes down to things > > like some common library pages which don't really account for much these > > days. > > Keep in mind that even a single page charged to a memcg and used by > another memcg is sufficient to result in a zombie memcg. I mean, yeah, that's a separate issue or rather a subset which isn't all that controversial. That can be deterministically solved by reparenting to the parent like how slab is handled. I think the "deterministic" part is important here. As you said, even a single page can pin a dying cgroup. > > > Keep in mind that the environment is dynamic, workloads are constantly > > > coming and going. Even if find the perfect nesting to appropriately > > > scope resources, some rescheduling may render the hierarchy obsolete > > > and require us to start over. > > > > Can you please go into more details on how much memory is shared for what > > across unrelated dynamic workloads? That sounds different from other use > > cases. > > I am trying to collect more information from our fleet, but the > application restarting in a different cgroup is not what is happening > in our case. It is not easy to find out exactly what is going on on This is the point that Johannes raised but I don't think the current proposal would make things more deterministic. From what I can see, it actually pushes it towards even less predictability. Currently, yeah, some pages may end up in cgroups which aren't the majority user but it at least is clear how that would happen. The proposed change adds layers of indeterministic behaviors on top. I don't think that's the direction we want to go. > machines and where the memory is coming from due to the > indeterministic nature of charging. The goal of this proposal is to > let the kernel handle leftover memory in zombie memcgs because it is > not always obvious to userspace what's going on (like it's not obvious > to me now where exactly is the sharing happening :) ). > > One thing to note is that in some cases, maybe a userspace bug or > failed cleanup is a reason for the zombie memcgs. Ideally, this > wouldn't happen, but it would be nice to have a fallback mechanism in > the kernel if it does. I'm not disagreeing on that. Our handling of pages owned by dying cgroups isn't great but I don't think the proposed change is an acceptable solution. Thanks.
On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > > memory at least in our case. The sharing across them comes down to things > > > like some common library pages which don't really account for much these > > > days. > > > > Keep in mind that even a single page charged to a memcg and used by > > another memcg is sufficient to result in a zombie memcg. > > I mean, yeah, that's a separate issue or rather a subset which isn't all > that controversial. That can be deterministically solved by reparenting to > the parent like how slab is handled. I think the "deterministic" part is > important here. As you said, even a single page can pin a dying cgroup. There are serious flaws with reparenting that I mentioned above. We do it for kernel memory, but that's because we really have no other choice. Oftentimes the memory is not reclaimable and we cannot find an owner for it. This doesn't mean it's the right answer for user memory. The semantics are new compared to normal charging (as opposed to recharging, as I explain below). There is an extra layer of indirection that we did not (as far as I know) measure the impact of. Parents end up with pages that they never used and we have no observability into where it came from. Most importantly, over time user memory will keep accumulating at the root, reducing the accuracy and usefulness of accounting, effectively an accounting leak and reduction of capacity. Memory that is not attributed to any user, aka system overhead. > > > > > Keep in mind that the environment is dynamic, workloads are constantly > > > > coming and going. Even if find the perfect nesting to appropriately > > > > scope resources, some rescheduling may render the hierarchy obsolete > > > > and require us to start over. > > > > > > Can you please go into more details on how much memory is shared for what > > > across unrelated dynamic workloads? That sounds different from other use > > > cases. > > > > I am trying to collect more information from our fleet, but the > > application restarting in a different cgroup is not what is happening > > in our case. It is not easy to find out exactly what is going on on > > This is the point that Johannes raised but I don't think the current > proposal would make things more deterministic. From what I can see, it > actually pushes it towards even less predictability. Currently, yeah, some > pages may end up in cgroups which aren't the majority user but it at least > is clear how that would happen. The proposed change adds layers of > indeterministic behaviors on top. I don't think that's the direction we want > to go. I believe recharging is being mis-framed here :) Recharging semantics are not new, it is a shortcut to a process that is already happening that is focused on offline memcgs. Let's take a step back. It is common practice (at least in my knowledge) to try to reclaim memory from a cgroup before deleting it (by lowering the limit or using memory.reclaim). Reclaim heuristics are biased towards reclaiming memory from offline cgroups. After the memory is reclaimed, if it is used again by a different process, it will be refaulted and charged again (aka recharged) to the new What recharging is doing is *not* anything new. It is effectively doing what reclaim + refault would do above, with an efficient shortcut. It avoids the unnecessary fault, avoids disrupting the workload that will access the memory after it is reclaimed, and cleans up zombie memcgs memory faster than reclaim would. Moreover, it works for memory that may not be reclaimable (e.g. because of lack of swap). All the indeterministic behaviors in recharging are exactly the indeterministic behaviors in reclaim. It is very similar. We iterate the lrus, try to isolate and lock folios, etc. This is what reclaim does. Recharging is basically lightweight reclaim + charging again (as opposed to fully reclaiming the memory then refaulting it). We are not introducing new indeterminism or charging semantics. Recharging does exactly what would happen when we reclaim zombie memory. It is just more efficient and accelerated. > > > machines and where the memory is coming from due to the > > indeterministic nature of charging. The goal of this proposal is to > > let the kernel handle leftover memory in zombie memcgs because it is > > not always obvious to userspace what's going on (like it's not obvious > > to me now where exactly is the sharing happening :) ). > > > > One thing to note is that in some cases, maybe a userspace bug or > > failed cleanup is a reason for the zombie memcgs. Ideally, this > > wouldn't happen, but it would be nice to have a fallback mechanism in > > the kernel if it does. > > I'm not disagreeing on that. Our handling of pages owned by dying cgroups > isn't great but I don't think the proposed change is an acceptable solution. I hope the above arguments change your mind :) > > Thanks. > > -- > tejun
Hello, On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote: > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote: > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > > > memory at least in our case. The sharing across them comes down to things > > > > like some common library pages which don't really account for much these > > > > days. > > > > > > Keep in mind that even a single page charged to a memcg and used by > > > another memcg is sufficient to result in a zombie memcg. > > > > I mean, yeah, that's a separate issue or rather a subset which isn't all > > that controversial. That can be deterministically solved by reparenting to > > the parent like how slab is handled. I think the "deterministic" part is > > important here. As you said, even a single page can pin a dying cgroup. > > There are serious flaws with reparenting that I mentioned above. We do > it for kernel memory, but that's because we really have no other > choice. Oftentimes the memory is not reclaimable and we cannot find an > owner for it. This doesn't mean it's the right answer for user memory. > > The semantics are new compared to normal charging (as opposed to > recharging, as I explain below). There is an extra layer of > indirection that we did not (as far as I know) measure the impact of. > Parents end up with pages that they never used and we have no > observability into where it came from. Most importantly, over time > user memory will keep accumulating at the root, reducing the accuracy > and usefulness of accounting, effectively an accounting leak and > reduction of capacity. Memory that is not attributed to any user, aka > system overhead. That really sounds like the setup is missing cgroup layers tracking persistent resources. Most of the problems you describe can be solved by adding cgroup layers at the right spots which would usually align with the logical structure of the system, right? ... > I believe recharging is being mis-framed here :) > > Recharging semantics are not new, it is a shortcut to a process that > is already happening that is focused on offline memcgs. Let's take a > step back. Yeah, it does sound better when viewed that way. I'm still not sure what extra problems it solves tho. We experienced similar problems but AFAIK all of them came down to needing the appropriate hierarchical structure to capture how resources are being used on systems. Thanks.
On Fri, Jul 21, 2023 at 12:18 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote: > > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote: > > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > > > > memory at least in our case. The sharing across them comes down to things > > > > > like some common library pages which don't really account for much these > > > > > days. > > > > > > > > Keep in mind that even a single page charged to a memcg and used by > > > > another memcg is sufficient to result in a zombie memcg. > > > > > > I mean, yeah, that's a separate issue or rather a subset which isn't all > > > that controversial. That can be deterministically solved by reparenting to > > > the parent like how slab is handled. I think the "deterministic" part is > > > important here. As you said, even a single page can pin a dying cgroup. > > > > There are serious flaws with reparenting that I mentioned above. We do > > it for kernel memory, but that's because we really have no other > > choice. Oftentimes the memory is not reclaimable and we cannot find an > > owner for it. This doesn't mean it's the right answer for user memory. > > > > The semantics are new compared to normal charging (as opposed to > > recharging, as I explain below). There is an extra layer of > > indirection that we did not (as far as I know) measure the impact of. > > Parents end up with pages that they never used and we have no > > observability into where it came from. Most importantly, over time > > user memory will keep accumulating at the root, reducing the accuracy > > and usefulness of accounting, effectively an accounting leak and > > reduction of capacity. Memory that is not attributed to any user, aka > > system overhead. > > That really sounds like the setup is missing cgroup layers tracking > persistent resources. Most of the problems you describe can be solved by > adding cgroup layers at the right spots which would usually align with the > logical structure of the system, right? It is difficult to track down all persistent/shareable resources and find the users, especially when both the resources and the users are dynamically changed. A simple example is text files for a shared library or sidecar processes that run with different workloads and need to have their usage charged to the workload, but they may have memory. For those cases there is no layering that would work. More practically, sometimes userspace just doesn't even know what exactly is being shared by whom. > > ... > > I believe recharging is being mis-framed here :) > > > > Recharging semantics are not new, it is a shortcut to a process that > > is already happening that is focused on offline memcgs. Let's take a > > step back. > > Yeah, it does sound better when viewed that way. I'm still not sure what > extra problems it solves tho. We experienced similar problems but AFAIK all > of them came down to needing the appropriate hierarchical structure to > capture how resources are being used on systems. It solves the problem of zombie memcgs and unaccounted memory. It is great that in some cases an appropriate hierarchy structure fixes the problem by accurately capturing how resources are being shared, but in some cases it's not as straightforward. Recharging attempts to fix the problem in a way that is more consistent with current semantics and more appealing that reparenting in terms of rightful ownership. Some systems are not rebooted for months. Can you imagine how much memory can be accumulated at the root (escaping all accounting) over months of reparenting? > > Thanks. > > -- > tejun
On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote: > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote: > > > > Hello, > > > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > > > memory at least in our case. The sharing across them comes down to things > > > > like some common library pages which don't really account for much these > > > > days. > > > > > > Keep in mind that even a single page charged to a memcg and used by > > > another memcg is sufficient to result in a zombie memcg. > > > > I mean, yeah, that's a separate issue or rather a subset which isn't all > > that controversial. That can be deterministically solved by reparenting to > > the parent like how slab is handled. I think the "deterministic" part is > > important here. As you said, even a single page can pin a dying cgroup. > > There are serious flaws with reparenting that I mentioned above. We do > it for kernel memory, but that's because we really have no other > choice. Oftentimes the memory is not reclaimable and we cannot find an > owner for it. This doesn't mean it's the right answer for user memory. > > The semantics are new compared to normal charging (as opposed to > recharging, as I explain below). There is an extra layer of > indirection that we did not (as far as I know) measure the impact of. > Parents end up with pages that they never used and we have no > observability into where it came from. Most importantly, over time > user memory will keep accumulating at the root, reducing the accuracy > and usefulness of accounting, effectively an accounting leak and > reduction of capacity. Memory that is not attributed to any user, aka > system overhead. Reparenting has been the behavior since the first iteration of cgroups in the kernel. The initial implementation would loop over the LRUs and reparent pages synchronously during rmdir. This had some locking issues, so we switched to the current implementation of just leaving the zombie memcg behind but neutralizing its controls. Thanks to Roman's objcg abstraction, we can now go back to the old implementation of directly moving pages up to avoid the zombies. However, these were pure implementation changes. The user-visible semantics never varied: when you delete a cgroup, any leftover resources are subject to control by the remaining parent cgroups. Don't remove control domains if you still need to control resources. But none of this is new or would change in any way! Neutralizing controls of a zombie cgroup results in the same behavior and accounting as linking the pages to the parent cgroup's LRU! The only thing that's new is the zombie cgroups. We can fix that by effectively going back to the earlier implementation, but thanks to objcg without the locking problems. I just wanted to address this, because your description/framing of reparenting strikes me as quite wrong.
On Fri, Jul 21, 2023 at 1:44 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote: > > On Fri, Jul 21, 2023 at 11:26 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > Hello, > > > > > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > > > > On Thu, Jul 20, 2023 at 3:31 PM Tejun Heo <tj@kernel.org> wrote: > > > > > memory at least in our case. The sharing across them comes down to things > > > > > like some common library pages which don't really account for much these > > > > > days. > > > > > > > > Keep in mind that even a single page charged to a memcg and used by > > > > another memcg is sufficient to result in a zombie memcg. > > > > > > I mean, yeah, that's a separate issue or rather a subset which isn't all > > > that controversial. That can be deterministically solved by reparenting to > > > the parent like how slab is handled. I think the "deterministic" part is > > > important here. As you said, even a single page can pin a dying cgroup. > > > > There are serious flaws with reparenting that I mentioned above. We do > > it for kernel memory, but that's because we really have no other > > choice. Oftentimes the memory is not reclaimable and we cannot find an > > owner for it. This doesn't mean it's the right answer for user memory. > > > > The semantics are new compared to normal charging (as opposed to > > recharging, as I explain below). There is an extra layer of > > indirection that we did not (as far as I know) measure the impact of. > > Parents end up with pages that they never used and we have no > > observability into where it came from. Most importantly, over time > > user memory will keep accumulating at the root, reducing the accuracy > > and usefulness of accounting, effectively an accounting leak and > > reduction of capacity. Memory that is not attributed to any user, aka > > system overhead. > > Reparenting has been the behavior since the first iteration of cgroups > in the kernel. The initial implementation would loop over the LRUs and > reparent pages synchronously during rmdir. This had some locking > issues, so we switched to the current implementation of just leaving > the zombie memcg behind but neutralizing its controls. Thanks for the context. > > Thanks to Roman's objcg abstraction, we can now go back to the old > implementation of directly moving pages up to avoid the zombies. > > However, these were pure implementation changes. The user-visible > semantics never varied: when you delete a cgroup, any leftover > resources are subject to control by the remaining parent cgroups. > Don't remove control domains if you still need to control resources. > But none of this is new or would change in any way! The problem is that you cannot fully monitor or control all the resources charged to a control domain. The example of common shared libraries stands, the pages are charged on first touch basis. You can't easily control it or monitor who is charged for what exactly. Even if you can find out, is the answer to leave the cgroup alive forever because it is charged for a shared resource? > Neutralizing > controls of a zombie cgroup results in the same behavior and > accounting as linking the pages to the parent cgroup's LRU! > > The only thing that's new is the zombie cgroups. We can fix that by > effectively going back to the earlier implementation, but thanks to > objcg without the locking problems. > > I just wanted to address this, because your description/framing of > reparenting strikes me as quite wrong. Thanks for the context, and sorry if my framing was inaccurate. I was more focused on the in-kernel semantics rather than user-visible semantics. Nonetheless, with today's status or with reparenting, once the memory is at the root level (whether reparented to the root level, or in a zombie memcg whose parent is root), the memory has effectively escaped accounting. This is not a new problem that reparenting would introduce, but it's a problem that recharging is trying to fix that reparenting won't. As I outlined above, the semantics of recharging are not new, they are equivalent to reclaiming and refaulting the memory in a more accelerated/efficient manner. The indeterminism in recharging is very similar to reclaiming and refaulting. What do you think?
[Sorry for being late to this discussion] On Thu 20-07-23 11:35:15, Johannes Weiner wrote: [...] > I'm super skeptical of this proposal. Agreed. > Recharging *might* be the most desirable semantics from a user pov, > but only if it applies consistently to the whole memory footprint. > There is no mention of slab allocations such as inodes, dentries, > network buffers etc. which can be a significant part of a cgroup's > footprint. These are currently reparented. I don't think doing one > thing with half of the memory, and a totally different thing with the > other half upon cgroup deletion is going to be acceptable semantics. > > It appears this also brings back the reliability issue that caused us > to deprecate charge moving. The recharge path has trylocks, LRU > isolation attempts, GFP_ATOMIC allocations. These introduce a variable > error rate into the relocation process, which causes pages that should > belong to the same domain to be scattered around all over the place. > It also means that zombie pinning still exists, but it's now even more > influenced by timing and race conditions, and so less predictable. > > There are two issues being conflated here: > > a) the problem of zombie cgroups, and > > b) who controls resources that outlive the control domain. > > For a), reparenting is still the most reasonable proposal. It's > reliable for one, but it also fixes the problem fully within the > established, user-facing semantics: resources that belong to a cgroup > also hierarchically belong to all ancestral groups; if those resources > outlive the last-level control domain, they continue to belong to the > parents. This is how it works today, and this is how it continues to > work with reparenting. The only difference is that those resources no > longer pin a dead cgroup anymore, but instead are physically linked to > the next online ancestor. Since dead cgroups have no effective control > parameters anymore, this is semantically equivalent - it's just a more > memory efficient implementation of the same exact thing. > > b) is a discussion totally separate from this. We can argue what we > want this behavior to be, but I'd argue strongly that whatever we do > here should apply to all resources managed by the controller equally. > > It could also be argued that if you don't want to lose control over a > set of resources, then maybe don't delete their control domain while > they are still alive and in use. For example, when restarting a > workload, and the new instance is expected to have largely the same > workingset, consider reusing the cgroup instead of making a new one. > > For the zombie problem, I think we should merge Muchun's patches > ASAP. They've been proposed several times, they have Roman's reviews > and acks, and they do not change user-facing semantics. There is no > good reason not to merge them. Yes, fully agreed on both points. The problem with zombies is real but reparenting should address it for a large part. Ownership is a different problem. We have discussed that at LSFMM this year and in the past as well I believe. What we probably need is a concept of taking an ownership of the memory (something like madvise(MADV_OWN, range) or fadvise for fd based resources). This would allow the caller to take ownership of the said resource (like memcg charge of it). I understand that would require some changes to existing workloads. Whatever the interface will be, it has to be explicit otherwise we are hitting problems with unaccounted resources that are sitting without any actual ownership and an undeterministic and time dependeing hopping over owners. In other words, nobody should be able to drop responsibility of any object while it is still consuming resources.