Message ID | 20220818143118.17733-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: Introduce selectable memcg for bpf map | expand |
On Thu, Aug 18, 2022 at 7:31 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > After commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1") > we can open a cgroup1 dir as well. So let's update the comment. I missed updating the comment in my change. Thanks for catching this. > > Cc: Yosry Ahmed <yosryahmed@google.com> > Cc: Hao Luo <haoluo@google.com> > Cc: Tejun Heo <tj@kernel.org> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com> > --- > kernel/cgroup/cgroup.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 5f4502a..b7d2e55 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -6625,7 +6625,7 @@ struct cgroup *cgroup_get_from_path(const char *path) > > /** > * cgroup_get_from_fd - get a cgroup pointer from a fd > - * @fd: fd obtained by open(cgroup2_dir) > + * @fd: fd obtained by open(cgroup_dir) > * > * Find the cgroup from a fd which should be obtained > * by opening a cgroup directory. Returns a pointer to the > -- > 1.8.3.1 >
Hello, On Thu, Aug 18, 2022 at 02:31:06PM +0000, Yafang Shao wrote: > After switching to memcg-based bpf memory accounting to limit the bpf > memory, some unexpected issues jumped out at us. > 1. The memory usage is not consistent between the first generation and > new generations. > 2. After the first generation is destroyed, the bpf memory can't be > limited if the bpf maps are not preallocated, because they will be > reparented. > > This patchset tries to resolve these issues by introducing an > independent memcg to limit the bpf memory. memcg folks would have better informed opinions but from generic cgroup pov I don't think this is a good direction to take. This isn't a problem limited to bpf progs and it doesn't make whole lot of sense to solve this for bpf. We have the exact same problem for any resources which span multiple instances of a service including page cache, tmpfs instances and any other thing which can persist longer than procss life time. My current opinion is that this is best solved by introducing an extra cgroup layer to represent the persistent entity and put the per-instance cgroup under it. It does require reorganizing how things are organized from userspace POV but the end result is really desirable. We get entities accurately representing what needs to be tracked and control over the granularity of accounting and control (e.g. folks who don't care about telling apart the current instance's usage can simply not enable controllers at the persistent entity level). We surely can discuss other approaches but my current intuition is that it'd be really difficult to come up with a better solution than layering to introduce persistent service entities. So, please consider the approach nacked for the time being. Thanks.
On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote: > We have the exact same problem for any resources which span multiple > instances of a service including page cache, tmpfs instances and any other > thing which can persist longer than procss life time. My current opinion is To expand a bit more on this point, once we start including page cache and tmpfs, we now get entangled with memory reclaim which then brings in IO and not-yet-but-eventually CPU usage. Once you start splitting the tree like you're suggesting here, all those will break down and now we have to worry about how to split resource accounting and control for the same entities across two split branches of the tree, which doesn't really make any sense. So, we *really* don't wanna paint ourselves into that kind of a corner. This is a dead-end. Please ditch it. Thanks.
On Fri, Aug 19, 2022 at 6:20 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Aug 18, 2022 at 02:31:06PM +0000, Yafang Shao wrote: > > After switching to memcg-based bpf memory accounting to limit the bpf > > memory, some unexpected issues jumped out at us. > > 1. The memory usage is not consistent between the first generation and > > new generations. > > 2. After the first generation is destroyed, the bpf memory can't be > > limited if the bpf maps are not preallocated, because they will be > > reparented. > > > > This patchset tries to resolve these issues by introducing an > > independent memcg to limit the bpf memory. > > memcg folks would have better informed opinions but from generic cgroup pov > I don't think this is a good direction to take. This isn't a problem limited > to bpf progs and it doesn't make whole lot of sense to solve this for bpf. > This change is bpf specific. It doesn't refactor a whole lot of things. > We have the exact same problem for any resources which span multiple > instances of a service including page cache, tmpfs instances and any other > thing which can persist longer than procss life time. My current opinion is > that this is best solved by introducing an extra cgroup layer to represent > the persistent entity and put the per-instance cgroup under it. > It is not practical on k8s. Because, before the persistent entity, the cgroup dir is stateless. After, it is stateful. Pls, don't continue keeping blind eyes on k8s. > It does require reorganizing how things are organized from userspace POV but > the end result is really desirable. We get entities accurately representing > what needs to be tracked and control over the granularity of accounting and > control (e.g. folks who don't care about telling apart the current > instance's usage can simply not enable controllers at the persistent entity > level). > Pls.s also think about why k8s refuse to use cgroup2. > We surely can discuss other approaches but my current intuition is that it'd > be really difficult to come up with a better solution than layering to > introduce persistent service entities. > > So, please consider the approach nacked for the time being. > It doesn't make sense to nack it. I will explain to you by replying to your other email.
On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote: > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote: > > We have the exact same problem for any resources which span multiple > > instances of a service including page cache, tmpfs instances and any other > > thing which can persist longer than procss life time. My current opinion is > > To expand a bit more on this point, once we start including page cache and > tmpfs, we now get entangled with memory reclaim which then brings in IO and > not-yet-but-eventually CPU usage. Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead? > Once you start splitting the tree like > you're suggesting here, all those will break down and now we have to worry > about how to split resource accounting and control for the same entities > across two split branches of the tree, which doesn't really make any sense. > The k8s has already been broken thanks to the memcg accounting on bpf memory. If you ignored it, I paste it below. [0]"1. The memory usage is not consistent between the first generation and new generations." This issue will persist even if you introduce a new layer. > So, we *really* don't wanna paint ourselves into that kind of a corner. This > is a dead-end. Please ditch it. > It makes non-sensen to ditch it. Because, the hierarchy I described in the commit log is *one* use case of the selectable memcg, but not *the only one* use case of it. If you dislike that hierarchy, I will remove it to avoid misleading you. Even if you introduce a new layer, you still need the selectable memcg. For example, to avoid the issue I described in [0], you still need to charge to the parent cgroup instead of the current cgroup. That's why I described in the commit log that the selectable memcg is flexible.
Hello, On Fri, Aug 19, 2022 at 08:59:20AM +0800, Yafang Shao wrote: > On Fri, Aug 19, 2022 at 6:20 AM Tejun Heo <tj@kernel.org> wrote: > > memcg folks would have better informed opinions but from generic cgroup pov > > I don't think this is a good direction to take. This isn't a problem limited > > to bpf progs and it doesn't make whole lot of sense to solve this for bpf. > > This change is bpf specific. It doesn't refactor a whole lot of things. I'm not sure what point the above sentence is making. It may not change a lot of code but it does introduce significantly new mode of operation which affects memcg and cgroup in general. > > We have the exact same problem for any resources which span multiple > > instances of a service including page cache, tmpfs instances and any other > > thing which can persist longer than procss life time. My current opinion is > > that this is best solved by introducing an extra cgroup layer to represent > > the persistent entity and put the per-instance cgroup under it. > > It is not practical on k8s. > Because, before the persistent entity, the cgroup dir is stateless. > After, it is stateful. > Pls, don't continue keeping blind eyes on k8s. Can you please elaborate why it isn't practical for k8s? I don't know the details of k8s and what you wrote above is not a detailed enough technical argument. > > It does require reorganizing how things are organized from userspace POV but > > the end result is really desirable. We get entities accurately representing > > what needs to be tracked and control over the granularity of accounting and > > control (e.g. folks who don't care about telling apart the current > > instance's usage can simply not enable controllers at the persistent entity > > level). > > Pls.s also think about why k8s refuse to use cgroup2. This attitude really bothers me. You aren't spelling it out fully but instead of engaging in the technical argument at the hand, you're putting forth conforming upstream to the current k8s's assumptions and behaviors as a requirement and then insisting that it's upstream's fault that k8s is staying with cgroup1. This is not an acceptable form of argument and it would be irresponsible to grant any kind weight to this line of reasoning. k8s may seem like the world to you but it is one of many use cases of the upstream kernel. We all should pay attention to the use cases and technical arguments to determine how we chart our way forward, but being k8s or whoever else clearly isn't a waiver to claim this kind of unilateral demand. It's okay to emphasize the gravity of the specific use case at hand but please realize that it's one of the many factors that should be considered and sometimes one which can and should get trumped by others. Thanks.
Hello, On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote: > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote: > > > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote: > > > We have the exact same problem for any resources which span multiple > > > instances of a service including page cache, tmpfs instances and any other > > > thing which can persist longer than procss life time. My current opinion is > > > > To expand a bit more on this point, once we start including page cache and > > tmpfs, we now get entangled with memory reclaim which then brings in IO and > > not-yet-but-eventually CPU usage. > > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead? Introducing a new layer in cgroup2 doesn't mean that any specific resource controller is enabled, so there is no runtime overhead difference. In terms of logical complexity, introducing a localized layer seems a lot more straightforward than building a whole separate tree. Note that the same applies to cgroup1 where collapsed controller tree is represented by simply not creating those layers in that particular controller tree. No matter how we cut the problem here, if we want to track these persistent resources, we have to create a cgroup to host them somewhere. The discussion we're having is mostly around where to put them. With your proposal, it can be anywhere and you draw out an example where the persistent cgroups form their own separate tree. What I'm saying is that the logical place to put it is where the current resource consumption is and we just need to put the persistent entity as the parent of the instances. Flexibility, just like anything else, isn't free. Here, if we extrapolate this approach, the cost is evidently hefty in that it doesn't generically work with the basic resource control structure. > > Once you start splitting the tree like > > you're suggesting here, all those will break down and now we have to worry > > about how to split resource accounting and control for the same entities > > across two split branches of the tree, which doesn't really make any sense. > > The k8s has already been broken thanks to the memcg accounting on bpf memory. > If you ignored it, I paste it below. > [0]"1. The memory usage is not consistent between the first generation and > new generations." > > This issue will persist even if you introduce a new layer. Please watch your tone. Again, this isn't a problem specific to k8s. We have the same problem with e.g. persistent tmpfs. One idea which I'm not against is allowing specific resources to be charged to an ancestor. We gotta think carefully about how such charges should be granted / denied but an approach like that jives well with the existing hierarchical control structure and because introducing a persistent layer does too, the combination of the two works well. > > So, we *really* don't wanna paint ourselves into that kind of a corner. This > > is a dead-end. Please ditch it. > > It makes non-sensen to ditch it. > Because, the hierarchy I described in the commit log is *one* use case > of the selectable memcg, but not *the only one* use case of it. If you > dislike that hierarchy, I will remove it to avoid misleading you. But if you drop that, what'd be the rationale for adding what you're proposing? Why would we want bpf memory charges to be attached any part of the hierarchy? > Even if you introduce a new layer, you still need the selectable memcg. > For example, to avoid the issue I described in [0], you still need to > charge to the parent cgroup instead of the current cgroup. As I wrote above, we've been discussing the above. Again, I'd be a lot more amenable to such approach because it fits with how everything is structured. > That's why I described in the commit log that the selectable memcg is flexible. Hopefully, my point on this is clear by now. Thanks.
On Sat, Aug 20, 2022 at 1:06 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Fri, Aug 19, 2022 at 09:09:25AM +0800, Yafang Shao wrote: > > On Fri, Aug 19, 2022 at 6:33 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > On Thu, Aug 18, 2022 at 12:20:33PM -1000, Tejun Heo wrote: > > > > We have the exact same problem for any resources which span multiple > > > > instances of a service including page cache, tmpfs instances and any other > > > > thing which can persist longer than procss life time. My current opinion is > > > > > > To expand a bit more on this point, once we start including page cache and > > > tmpfs, we now get entangled with memory reclaim which then brings in IO and > > > not-yet-but-eventually CPU usage. > > > > Introduce-a-new-layer vs introduce-a-new-cgroup, which one is more overhead? > > Introducing a new layer in cgroup2 doesn't mean that any specific resource > controller is enabled, so there is no runtime overhead difference. In terms > of logical complexity, introducing a localized layer seems a lot more > straightforward than building a whole separate tree. > > Note that the same applies to cgroup1 where collapsed controller tree is > represented by simply not creating those layers in that particular > controller tree. > No, we have observed on our product env that multiple-layers cpuacct would cause obvious performance hit due to cache miss. > No matter how we cut the problem here, if we want to track these persistent > resources, we have to create a cgroup to host them somewhere. The discussion > we're having is mostly around where to put them. With your proposal, it can > be anywhere and you draw out an example where the persistent cgroups form > their own separate tree. What I'm saying is that the logical place to put it > is where the current resource consumption is and we just need to put the > persistent entity as the parent of the instances. > > Flexibility, just like anything else, isn't free. Here, if we extrapolate > this approach, the cost is evidently hefty in that it doesn't generically > work with the basic resource control structure. > > > > Once you start splitting the tree like > > > you're suggesting here, all those will break down and now we have to worry > > > about how to split resource accounting and control for the same entities > > > across two split branches of the tree, which doesn't really make any sense. > > > > The k8s has already been broken thanks to the memcg accounting on bpf memory. > > If you ignored it, I paste it below. > > [0]"1. The memory usage is not consistent between the first generation and > > new generations." > > > > This issue will persist even if you introduce a new layer. > > Please watch your tone. > Hm? I apologize if my words offend you. But, could you pls take a serious look at the patchset before giving a NACK? You didn't even want to know the background before you sent your NACK. > Again, this isn't a problem specific to k8s. We have the same problem with > e.g. persistent tmpfs. One idea which I'm not against is allowing specific > resources to be charged to an ancestor. We gotta think carefully about how > such charges should be granted / denied but an approach like that jives well > with the existing hierarchical control structure and because introducing a > persistent layer does too, the combination of the two works well. > > > > So, we *really* don't wanna paint ourselves into that kind of a corner. This > > > is a dead-end. Please ditch it. > > > > It makes non-sensen to ditch it. > > Because, the hierarchy I described in the commit log is *one* use case > > of the selectable memcg, but not *the only one* use case of it. If you > > dislike that hierarchy, I will remove it to avoid misleading you. > > But if you drop that, what'd be the rationale for adding what you're > proposing? Why would we want bpf memory charges to be attached any part of > the hierarchy? > I have explained it to you. But unfortunately you ignored it again. But I don't mind explaining to you again. Parent-memcg \ Child-memcg (k8s pod) The user can charge the memory to the parent directly without charging into the k8s pod. Then the memory.stat is consistent between different generations. > > Even if you introduce a new layer, you still need the selectable memcg. > > For example, to avoid the issue I described in [0], you still need to > > charge to the parent cgroup instead of the current cgroup. > > As I wrote above, we've been discussing the above. Again, I'd be a lot more > amenable to such approach because it fits with how everything is structured. > > > That's why I described in the commit log that the selectable memcg is flexible. > > Hopefully, my point on this is clear by now. > Unfortunately, you didn't want to get my point.
(Sorry, this is a resend. I messed up the header in the first posting.) Hello, This thread started on a bpf-specific memory tracking change proposal and went south, but a lot of people who would be interested are already cc'd, so I'm hijacking it to discuss what to do w/ persistent memory usage tracking. Cc'ing Mina and Yosry who were involved in the discussions on the similar problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and experience than me, and Lennart for his thoughts from systemd side. The root problem is that there are resources (almost solely memory currently) that outlive a given instance of a, to use systemd-lingo, service. Page cache is the most common case. Let's say there's system.slice/hello.service. When it runs for the first time, page cache backing its binary will be charged to hello.service. However, when it restarts after e.g. a config change, when the initial hello.service cgroup gets destroyed, we reparent the page cache charges to the parent system.slice and when the second instance starts, its binary will stay charged to system.slice. Over time, some may get reclaimed and refaulted into the new hello.service but that's not guaranteed and most hot pages likely won't. The same problem exists for any memory which is not freed synchronously when the current instance exits. While this isn't a problem for many cases, it's not difficult to imagine situations where the amount of memory which ends up getting pushed to the parent is significant, even clear majority, with big page cache footprint, persistent tmpfs instances and so on, creating issues with accounting accuracy and thus control. I think there are two broad issues to discuss here: [1] Can this be solved by layering the instance cgroups under persistent entity cgroup? So, instead of systemd.slice/hello.service, the application runs inside something like systemd.slice/hello.service/hello.service.instance and the service-level cgroup hello.service is not destroyed as long as it is something worth tracking on the system. The benefits are a. While requiring changing how userland organizes cgroup hiearchy, it is a straight-forward extension of the current architecture and doesn't require any conceptual or structural changes. All the accounting and control schemes work exactly the same as before. The only difference is that we now have a persistent entity representing each service as we want to track their persistent resource usages. b. Per-instance tracking and control is optional. To me, it seems that the persistent resource usages would be more meaningful than per-instance and tracking down to the persistent usages shouldn't add noticeable runtime overheads while keeping per-instance process management niceties and allowing use cases to opt-in for per-instance resource tracking and control as needed. The complications are: a. It requires changing cgroup hierarchy in a very visible way. b. What should be the lifetime rules for persistent cgroups? Do we keep them around forever or maybe they can be created on the first use and kept around until the service is removed from the system? When the persistent cgroup is removed, do we need to make sure that the remaining resource usages are low enough? Note that this problem exists for any approach that tries to track persistent usages no matter how it's done. c. Do we need to worry about nesting overhead? Given that there's no reason to enable controllers w/o persisten states for the instance level and the nesting overhead is pretty low for memcg, this doesn't seem like a problem to me. If this becomes a problem, we just need to fix it. A couple alternatives discussed are: a. Userspace keeps reusing the same cgroup for different instances of the same service. This simplifies some aspects while making others more complicated. e.g. Determining the current instance's CPU or IO usages now require the monitoring software remembering what they were when this instance started and calculating the deltas. Also, if some use cases want to distinguish persistent vs. instance usages (more on this later), this isn't gonna work. That said, this definitely is attractive in that it miminizes overt user visible changes. b. Memory is disassociated rather than just reparented on cgroup destruction and get re-charged to the next first user. This is attractive in that it doesn't require any userspace changes; however, I'm not sure how this would work for non-pageable memory usages such as bpf maps. How would we detect the next first usage? [2] Whether and how to solve first and second+ instance charge differences. If we take the layering approach, the first instance will get charged for all memory that it uses while the second+ instances likely won't get charged for a lot of persistent usages. I don't think there is a consensus on whether this needs to be solved and I don't have enough context to form a strong opinion. memcg folks are a lot better equipped to make this decision. Assuming this needs to be solved, here's a braindump to be taken with a big pinch of salt: I have a bit of difficult time imagining a perfect solution given that whether a given page cache page is persistent or not would be really difficult to know (or maybe all page cache is persistent by default while anon is not). However, the problem still seems worthwhile to consider for big ticket items such as persistent tmpfs mounts and huge bpf maps as they can easily make the differences really big. If we want to solve this problem, here are options that I can think of: a. Let userspace put the charges where they belong using the current mechanisms. ie. Create persistent entities in the persistent parent cgroup while there's no current instance. Pro: It won't require any major kernel or interface changes. There still need to be some tweaking such as allowing tmpfs pages to be always charged to the cgroup which created the instance (maybe as long as it's an ancestor of the faulting cgroup?) but nothing too invasive. Con: It may not be flexible enough. b. Let userspace specify which cgroup to charge for some of constructs like tmpfs and bpf maps. The key problems with this approach are 1. How to grant/deny what can be charged where. We must ensure that a descendant can't move charges up or across the tree without the ancestors allowing it. 2. How to specify the cgroup to charge. While specifying the target cgroup directly might seem like an obvious solution, it has a couple rather serious problems. First, if the descendant is inside a cgroup namespace, it might be able to see the target cgroup at all. Second, it's an interface which is likely to cause misunderstandings on how it can be used. It's too broad an interface. One solution that I can think of is leveraging the resource domain concept which is currently only used for threaded cgroups. All memory usages of threaded cgroups are charged to their resource domain cgroup which hosts the processes for those threads. The persistent usages have a similar pattern, so maybe the service level cgroup can declare that it's the encompassing resource domain and the instance cgroup can say whether it's gonna charge e.g. the tmpfs instance to its own or the encompassing resource domain. This has the benefit that the user only needs to indicate its intention without worrying about how cgroups are composed and what their IDs are. It just indicates whether the given resource is persistent and if the cgroup hierarchy is set up for that, it gets charged that way and if not it can be just charged to itself. This is a shower thought but, if we allow nesting such domains (and maybe name them), we can use it for shared resources too so that co-services are put inside a shared slice and shared resources are pushed to the slice level. This became pretty long. I obviously have a pretty strong bias towards solving this within the current basic architecture but other than that most of these decisions are best made by memcg folks. We can hopefully build some consensus on the issue. Thanks.
Ccing Mina. On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote: > > (Sorry, this is a resend. I messed up the header in the first posting.) > > Hello, > > This thread started on a bpf-specific memory tracking change proposal and > went south, but a lot of people who would be interested are already cc'd, so > I'm hijacking it to discuss what to do w/ persistent memory usage tracking. > > Cc'ing Mina and Yosry who were involved in the discussions on the similar > problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and > experience than me, and Lennart for his thoughts from systemd side. > > The root problem is that there are resources (almost solely memory > currently) that outlive a given instance of a, to use systemd-lingo, > service. Page cache is the most common case. > > Let's say there's system.slice/hello.service. When it runs for the first > time, page cache backing its binary will be charged to hello.service. > However, when it restarts after e.g. a config change, when the initial > hello.service cgroup gets destroyed, we reparent the page cache charges to > the parent system.slice and when the second instance starts, its binary will > stay charged to system.slice. Over time, some may get reclaimed and > refaulted into the new hello.service but that's not guaranteed and most hot > pages likely won't. > > The same problem exists for any memory which is not freed synchronously when > the current instance exits. While this isn't a problem for many cases, it's > not difficult to imagine situations where the amount of memory which ends up > getting pushed to the parent is significant, even clear majority, with big > page cache footprint, persistent tmpfs instances and so on, creating issues > with accounting accuracy and thus control. > > I think there are two broad issues to discuss here: > > [1] Can this be solved by layering the instance cgroups under persistent > entity cgroup? > > So, instead of systemd.slice/hello.service, the application runs inside > something like systemd.slice/hello.service/hello.service.instance and the > service-level cgroup hello.service is not destroyed as long as it is > something worth tracking on the system. > > The benefits are > > a. While requiring changing how userland organizes cgroup hiearchy, it is a > straight-forward extension of the current architecture and doesn't > require any conceptual or structural changes. All the accounting and > control schemes work exactly the same as before. The only difference is > that we now have a persistent entity representing each service as we want > to track their persistent resource usages. > > b. Per-instance tracking and control is optional. To me, it seems that the > persistent resource usages would be more meaningful than per-instance and > tracking down to the persistent usages shouldn't add noticeable runtime > overheads while keeping per-instance process management niceties and > allowing use cases to opt-in for per-instance resource tracking and > control as needed. > > The complications are: > > a. It requires changing cgroup hierarchy in a very visible way. > > b. What should be the lifetime rules for persistent cgroups? Do we keep them > around forever or maybe they can be created on the first use and kept > around until the service is removed from the system? When the persistent > cgroup is removed, do we need to make sure that the remaining resource > usages are low enough? Note that this problem exists for any approach > that tries to track persistent usages no matter how it's done. > > c. Do we need to worry about nesting overhead? Given that there's no reason > to enable controllers w/o persisten states for the instance level and the > nesting overhead is pretty low for memcg, this doesn't seem like a > problem to me. If this becomes a problem, we just need to fix it. > > A couple alternatives discussed are: > > a. Userspace keeps reusing the same cgroup for different instances of the > same service. This simplifies some aspects while making others more > complicated. e.g. Determining the current instance's CPU or IO usages now > require the monitoring software remembering what they were when this > instance started and calculating the deltas. Also, if some use cases want > to distinguish persistent vs. instance usages (more on this later), this > isn't gonna work. That said, this definitely is attractive in that it > miminizes overt user visible changes. > > b. Memory is disassociated rather than just reparented on cgroup destruction > and get re-charged to the next first user. This is attractive in that it > doesn't require any userspace changes; however, I'm not sure how this > would work for non-pageable memory usages such as bpf maps. How would we > detect the next first usage? > > > [2] Whether and how to solve first and second+ instance charge differences. > > If we take the layering approach, the first instance will get charged for > all memory that it uses while the second+ instances likely won't get charged > for a lot of persistent usages. I don't think there is a consensus on > whether this needs to be solved and I don't have enough context to form a > strong opinion. memcg folks are a lot better equipped to make this decision. > > Assuming this needs to be solved, here's a braindump to be taken with a big > pinch of salt: > > I have a bit of difficult time imagining a perfect solution given that > whether a given page cache page is persistent or not would be really > difficult to know (or maybe all page cache is persistent by default while > anon is not). However, the problem still seems worthwhile to consider for > big ticket items such as persistent tmpfs mounts and huge bpf maps as they > can easily make the differences really big. > > If we want to solve this problem, here are options that I can think of: > > a. Let userspace put the charges where they belong using the current > mechanisms. ie. Create persistent entities in the persistent parent > cgroup while there's no current instance. > > Pro: It won't require any major kernel or interface changes. There still > need to be some tweaking such as allowing tmpfs pages to be always > charged to the cgroup which created the instance (maybe as long as it's > an ancestor of the faulting cgroup?) but nothing too invasive. > > Con: It may not be flexible enough. > > b. Let userspace specify which cgroup to charge for some of constructs like > tmpfs and bpf maps. The key problems with this approach are > > 1. How to grant/deny what can be charged where. We must ensure that a > descendant can't move charges up or across the tree without the > ancestors allowing it. > > 2. How to specify the cgroup to charge. While specifying the target > cgroup directly might seem like an obvious solution, it has a couple > rather serious problems. First, if the descendant is inside a cgroup > namespace, it might be able to see the target cgroup at all. Second, > it's an interface which is likely to cause misunderstandings on how it > can be used. It's too broad an interface. > > One solution that I can think of is leveraging the resource domain > concept which is currently only used for threaded cgroups. All memory > usages of threaded cgroups are charged to their resource domain cgroup > which hosts the processes for those threads. The persistent usages have a > similar pattern, so maybe the service level cgroup can declare that it's > the encompassing resource domain and the instance cgroup can say whether > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > resource domain. > > This has the benefit that the user only needs to indicate its intention > without worrying about how cgroups are composed and what their IDs are. > It just indicates whether the given resource is persistent and if the > cgroup hierarchy is set up for that, it gets charged that way and if not > it can be just charged to itself. > > This is a shower thought but, if we allow nesting such domains (and maybe > name them), we can use it for shared resources too so that co-services > are put inside a shared slice and shared resources are pushed to the > slice level. > > This became pretty long. I obviously have a pretty strong bias towards > solving this within the current basic architecture but other than that most > of these decisions are best made by memcg folks. We can hopefully build some > consensus on the issue. > > Thanks. > > -- > tejun
On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote: > > (Sorry, this is a resend. I messed up the header in the first posting.) > > Hello, > > This thread started on a bpf-specific memory tracking change proposal and > went south, but a lot of people who would be interested are already cc'd, so > I'm hijacking it to discuss what to do w/ persistent memory usage tracking. > > Cc'ing Mina and Yosry who were involved in the discussions on the similar > problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and > experience than me, and Lennart for his thoughts from systemd side. > > The root problem is that there are resources (almost solely memory > currently) that outlive a given instance of a, to use systemd-lingo, > service. Page cache is the most common case. > > Let's say there's system.slice/hello.service. When it runs for the first > time, page cache backing its binary will be charged to hello.service. > However, when it restarts after e.g. a config change, when the initial > hello.service cgroup gets destroyed, we reparent the page cache charges to > the parent system.slice and when the second instance starts, its binary will > stay charged to system.slice. Over time, some may get reclaimed and > refaulted into the new hello.service but that's not guaranteed and most hot > pages likely won't. > > The same problem exists for any memory which is not freed synchronously when > the current instance exits. While this isn't a problem for many cases, it's > not difficult to imagine situations where the amount of memory which ends up > getting pushed to the parent is significant, even clear majority, with big > page cache footprint, persistent tmpfs instances and so on, creating issues > with accounting accuracy and thus control. > > I think there are two broad issues to discuss here: > > [1] Can this be solved by layering the instance cgroups under persistent > entity cgroup? > > So, instead of systemd.slice/hello.service, the application runs inside > something like systemd.slice/hello.service/hello.service.instance and the > service-level cgroup hello.service is not destroyed as long as it is > something worth tracking on the system. > > The benefits are > > a. While requiring changing how userland organizes cgroup hiearchy, it is a > straight-forward extension of the current architecture and doesn't > require any conceptual or structural changes. All the accounting and > control schemes work exactly the same as before. The only difference is > that we now have a persistent entity representing each service as we want > to track their persistent resource usages. > > b. Per-instance tracking and control is optional. To me, it seems that the > persistent resource usages would be more meaningful than per-instance and > tracking down to the persistent usages shouldn't add noticeable runtime > overheads while keeping per-instance process management niceties and > allowing use cases to opt-in for per-instance resource tracking and > control as needed. > > The complications are: > > a. It requires changing cgroup hierarchy in a very visible way. > > b. What should be the lifetime rules for persistent cgroups? Do we keep them > around forever or maybe they can be created on the first use and kept > around until the service is removed from the system? When the persistent > cgroup is removed, do we need to make sure that the remaining resource > usages are low enough? Note that this problem exists for any approach > that tries to track persistent usages no matter how it's done. > > c. Do we need to worry about nesting overhead? Given that there's no reason > to enable controllers w/o persisten states for the instance level and the > nesting overhead is pretty low for memcg, this doesn't seem like a > problem to me. If this becomes a problem, we just need to fix it. > > A couple alternatives discussed are: > > a. Userspace keeps reusing the same cgroup for different instances of the > same service. This simplifies some aspects while making others more > complicated. e.g. Determining the current instance's CPU or IO usages now > require the monitoring software remembering what they were when this > instance started and calculating the deltas. Also, if some use cases want > to distinguish persistent vs. instance usages (more on this later), this > isn't gonna work. That said, this definitely is attractive in that it > miminizes overt user visible changes. > > b. Memory is disassociated rather than just reparented on cgroup destruction > and get re-charged to the next first user. This is attractive in that it > doesn't require any userspace changes; however, I'm not sure how this > would work for non-pageable memory usages such as bpf maps. How would we > detect the next first usage? > > > [2] Whether and how to solve first and second+ instance charge differences. > > If we take the layering approach, the first instance will get charged for > all memory that it uses while the second+ instances likely won't get charged > for a lot of persistent usages. I don't think there is a consensus on > whether this needs to be solved and I don't have enough context to form a > strong opinion. memcg folks are a lot better equipped to make this decision. > The problem of first and second+ instance charges is something I've ran into when dealing with tmpfs charges, and something I believe Yosry ran into with regards to bpf as well, so AFAIU we think it's something worth addressing, but I'd also like to hear from other memcg folks if this is something that needs to be solved for them as well. For us this is something we want to fix because the memory usage of these services is hard to predict and limit since the memory usage of the first charger may be much greater than second+. > Assuming this needs to be solved, here's a braindump to be taken with a big > pinch of salt: > > I have a bit of difficult time imagining a perfect solution given that > whether a given page cache page is persistent or not would be really > difficult to know (or maybe all page cache is persistent by default while > anon is not). However, the problem still seems worthwhile to consider for > big ticket items such as persistent tmpfs mounts and huge bpf maps as they > can easily make the differences really big. > > If we want to solve this problem, here are options that I can think of: > > a. Let userspace put the charges where they belong using the current > mechanisms. ie. Create persistent entities in the persistent parent > cgroup while there's no current instance. > > Pro: It won't require any major kernel or interface changes. There still > need to be some tweaking such as allowing tmpfs pages to be always > charged to the cgroup which created the instance (maybe as long as it's > an ancestor of the faulting cgroup?) but nothing too invasive. > > Con: It may not be flexible enough. > I believe this works for us, provided memory is charged to the persistent parent entity directly and is not charged to the child instance. If memory is charged to the child instance first and then reparented, I think we still have the same problem. Because the first instance is charged for this memory but the second+ instances are not, making predicting the memory usage of such use cases difficult. > b. Let userspace specify which cgroup to charge for some of constructs like > tmpfs and bpf maps. The key problems with this approach are > > 1. How to grant/deny what can be charged where. We must ensure that a > descendant can't move charges up or across the tree without the > ancestors allowing it. > > 2. How to specify the cgroup to charge. While specifying the target > cgroup directly might seem like an obvious solution, it has a couple > rather serious problems. First, if the descendant is inside a cgroup > namespace, it might be able to see the target cgroup at all. Second, > it's an interface which is likely to cause misunderstandings on how it > can be used. It's too broad an interface. > This is pretty much the solution I sent out for review about a year ago and yes, it suffers from the issues you've brought up: https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > One solution that I can think of is leveraging the resource domain > concept which is currently only used for threaded cgroups. All memory > usages of threaded cgroups are charged to their resource domain cgroup > which hosts the processes for those threads. The persistent usages have a > similar pattern, so maybe the service level cgroup can declare that it's > the encompassing resource domain and the instance cgroup can say whether > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > resource domain. > I think this sounds excellent and addresses our use cases. Basically the tmpfs/bpf memory would get charged to the encompassing resource domain cgroup rather than the instance cgroup, making the memory usage of the first and second+ instances consistent and predictable. Would love to hear from other memcg folks what they would think of such an approach. I would also love to hear what kind of interface you have in mind. Perhaps a cgroup tunable that says whether it's going to charge the tmpfs/bpf instance to itself or to the encompassing resource domain? > This has the benefit that the user only needs to indicate its intention > without worrying about how cgroups are composed and what their IDs are. > It just indicates whether the given resource is persistent and if the > cgroup hierarchy is set up for that, it gets charged that way and if not > it can be just charged to itself. > > This is a shower thought but, if we allow nesting such domains (and maybe > name them), we can use it for shared resources too so that co-services > are put inside a shared slice and shared resources are pushed to the > slice level. > > This became pretty long. I obviously have a pretty strong bias towards > solving this within the current basic architecture but other than that most > of these decisions are best made by memcg folks. We can hopefully build some > consensus on the issue. > > Thanks. > > -- > tejun >
On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote: > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote: > > b. Let userspace specify which cgroup to charge for some of constructs like > > tmpfs and bpf maps. The key problems with this approach are > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > descendant can't move charges up or across the tree without the > > ancestors allowing it. > > > > 2. How to specify the cgroup to charge. While specifying the target > > cgroup directly might seem like an obvious solution, it has a couple > > rather serious problems. First, if the descendant is inside a cgroup > > namespace, it might be able to see the target cgroup at all. Second, > > it's an interface which is likely to cause misunderstandings on how it > > can be used. It's too broad an interface. > > > > This is pretty much the solution I sent out for review about a year > ago and yes, it suffers from the issues you've brought up: > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > > > One solution that I can think of is leveraging the resource domain > > concept which is currently only used for threaded cgroups. All memory > > usages of threaded cgroups are charged to their resource domain cgroup > > which hosts the processes for those threads. The persistent usages have a > > similar pattern, so maybe the service level cgroup can declare that it's > > the encompassing resource domain and the instance cgroup can say whether > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > resource domain. > > > > I think this sounds excellent and addresses our use cases. Basically > the tmpfs/bpf memory would get charged to the encompassing resource > domain cgroup rather than the instance cgroup, making the memory usage > of the first and second+ instances consistent and predictable. > > Would love to hear from other memcg folks what they would think of > such an approach. I would also love to hear what kind of interface you > have in mind. Perhaps a cgroup tunable that says whether it's going to > charge the tmpfs/bpf instance to itself or to the encompassing > resource domain? I like this too. It makes shared charging predictable, with a coherent resource hierarchy (congruent OOM, CPU, IO domains), and without the need for cgroup paths in tmpfs mounts or similar. As far as who is declaring what goes, though: if the instance groups can declare arbitrary files/objects persistent or shared, they'd be able to abuse this and sneak private memory past local limits and burden the wider persistent/shared domain with it. I'm thinking it might make more sense for the service level to declare which objects are persistent and shared across instances. If that's the case, we may not need a two-component interface. Just the ability for an intermediate cgroup to say: "This object's future memory is to be charged to me, not the instantiating cgroup." Can we require a process in the intermediate cgroup to set up the file or object, and use madvise/fadvise to say "charge me", before any instances are launched?
On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote: > > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote: > > > b. Let userspace specify which cgroup to charge for some of constructs like > > > tmpfs and bpf maps. The key problems with this approach are > > > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > > descendant can't move charges up or across the tree without the > > > ancestors allowing it. > > > > > > 2. How to specify the cgroup to charge. While specifying the target > > > cgroup directly might seem like an obvious solution, it has a couple > > > rather serious problems. First, if the descendant is inside a cgroup > > > namespace, it might be able to see the target cgroup at all. Second, > > > it's an interface which is likely to cause misunderstandings on how it > > > can be used. It's too broad an interface. > > > > > > > This is pretty much the solution I sent out for review about a year > > ago and yes, it suffers from the issues you've brought up: > > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > > > > > > One solution that I can think of is leveraging the resource domain > > > concept which is currently only used for threaded cgroups. All memory > > > usages of threaded cgroups are charged to their resource domain cgroup > > > which hosts the processes for those threads. The persistent usages have a > > > similar pattern, so maybe the service level cgroup can declare that it's > > > the encompassing resource domain and the instance cgroup can say whether > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > resource domain. > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > the tmpfs/bpf memory would get charged to the encompassing resource > > domain cgroup rather than the instance cgroup, making the memory usage > > of the first and second+ instances consistent and predictable. > > > > Would love to hear from other memcg folks what they would think of > > such an approach. I would also love to hear what kind of interface you > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > charge the tmpfs/bpf instance to itself or to the encompassing > > resource domain? > > I like this too. It makes shared charging predictable, with a coherent > resource hierarchy (congruent OOM, CPU, IO domains), and without the > need for cgroup paths in tmpfs mounts or similar. > > As far as who is declaring what goes, though: if the instance groups > can declare arbitrary files/objects persistent or shared, they'd be > able to abuse this and sneak private memory past local limits and > burden the wider persistent/shared domain with it. > > I'm thinking it might make more sense for the service level to declare > which objects are persistent and shared across instances. > > If that's the case, we may not need a two-component interface. Just > the ability for an intermediate cgroup to say: "This object's future > memory is to be charged to me, not the instantiating cgroup." > > Can we require a process in the intermediate cgroup to set up the file > or object, and use madvise/fadvise to say "charge me", before any > instances are launched? I think doing this on a file granularity makes it logistically hard to use, no? The service needs to create a file in the shared domain and all its instances need to re-use this exact same file. Our kubernetes use case from [1] shares a mount between subtasks rather than specific files. This allows subtasks to create files at will in the mount with the memory charged to the shared domain. I imagine this is more convenient than a shared file. Our other use case, which I hope to address here as well, is a service-client relationship from [1] where the service would like to charge per-client memory back to the client itself. In this case the service or client can create a mount from the shared domain and pass it to the service at which point the service is free to create/remove files in this mount as it sees fit. Would you be open to a per-mount interface rather than a per-file fadvise interface? Yosry, would a proposal like so be extensible to address the bpf charging issues? [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
On Mon, Aug 22, 2022 at 05:19:50PM -0400, Johannes Weiner wrote: > On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote: > > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote: > > > b. Let userspace specify which cgroup to charge for some of constructs like > > > tmpfs and bpf maps. The key problems with this approach are > > > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > > descendant can't move charges up or across the tree without the > > > ancestors allowing it. > > > > > > 2. How to specify the cgroup to charge. While specifying the target > > > cgroup directly might seem like an obvious solution, it has a couple > > > rather serious problems. First, if the descendant is inside a cgroup > > > namespace, it might be able to see the target cgroup at all. Second, > > > it's an interface which is likely to cause misunderstandings on how it > > > can be used. It's too broad an interface. > > > > > > > This is pretty much the solution I sent out for review about a year > > ago and yes, it suffers from the issues you've brought up: > > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > > > > > > One solution that I can think of is leveraging the resource domain > > > concept which is currently only used for threaded cgroups. All memory > > > usages of threaded cgroups are charged to their resource domain cgroup > > > which hosts the processes for those threads. The persistent usages have a > > > similar pattern, so maybe the service level cgroup can declare that it's > > > the encompassing resource domain and the instance cgroup can say whether > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > resource domain. > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > the tmpfs/bpf memory would get charged to the encompassing resource > > domain cgroup rather than the instance cgroup, making the memory usage > > of the first and second+ instances consistent and predictable. > > > > Would love to hear from other memcg folks what they would think of > > such an approach. I would also love to hear what kind of interface you > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > charge the tmpfs/bpf instance to itself or to the encompassing > > resource domain? > > I like this too. It makes shared charging predictable, with a coherent > resource hierarchy (congruent OOM, CPU, IO domains), and without the > need for cgroup paths in tmpfs mounts or similar. > > As far as who is declaring what goes, though: if the instance groups > can declare arbitrary files/objects persistent or shared, they'd be > able to abuse this and sneak private memory past local limits and > burden the wider persistent/shared domain with it. > > I'm thinking it might make more sense for the service level to declare > which objects are persistent and shared across instances. I like this idea. > > If that's the case, we may not need a two-component interface. Just > the ability for an intermediate cgroup to say: "This object's future > memory is to be charged to me, not the instantiating cgroup." > > Can we require a process in the intermediate cgroup to set up the file > or object, and use madvise/fadvise to say "charge me", before any > instances are launched? We need to think how to make this interface convenient to use. First, these persistent resources are likely created by some agent software, not the main workload. So the requirement to call madvise() from the actual cgroup might be not easily achievable. So _maybe_ something like writing a fd into cgroup.memory.resources. Second, it would be really useful to export the current configuration to userspace. E.g. a user should be able to query to which cgroup the given bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise it will create a problem for userspace programs which manage cgroups (e.g. systemd): they should be able to restore the current configuration from the kernel state, without "remembering" what has been configured before. Thanks!
Hello, On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote: > > > > One solution that I can think of is leveraging the resource domain > > > > concept which is currently only used for threaded cgroups. All memory > > > > usages of threaded cgroups are charged to their resource domain cgroup > > > > which hosts the processes for those threads. The persistent usages have a > > > > similar pattern, so maybe the service level cgroup can declare that it's > > > > the encompassing resource domain and the instance cgroup can say whether > > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > > resource domain. > > > > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > > the tmpfs/bpf memory would get charged to the encompassing resource > > > domain cgroup rather than the instance cgroup, making the memory usage > > > of the first and second+ instances consistent and predictable. > > > > > > Would love to hear from other memcg folks what they would think of > > > such an approach. I would also love to hear what kind of interface you > > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > > charge the tmpfs/bpf instance to itself or to the encompassing > > > resource domain? > > > > I like this too. It makes shared charging predictable, with a coherent > > resource hierarchy (congruent OOM, CPU, IO domains), and without the > > need for cgroup paths in tmpfs mounts or similar. > > > > As far as who is declaring what goes, though: if the instance groups > > can declare arbitrary files/objects persistent or shared, they'd be > > able to abuse this and sneak private memory past local limits and > > burden the wider persistent/shared domain with it. My thought was that the persistent cgroup and instance cgroups should belong to the same trust domain and system level control should be applied at the resource domain level. The application may decide to shift between persistent and per-instance however it wants to and may even configure resource control at that level but all that's for its own accounting accuracy and benefit. > > I'm thinking it might make more sense for the service level to declare > > which objects are persistent and shared across instances. > > I like this idea. > > > If that's the case, we may not need a two-component interface. Just > > the ability for an intermediate cgroup to say: "This object's future > > memory is to be charged to me, not the instantiating cgroup." > > > > Can we require a process in the intermediate cgroup to set up the file > > or object, and use madvise/fadvise to say "charge me", before any > > instances are launched? > > We need to think how to make this interface convenient to use. > First, these persistent resources are likely created by some agent software, > not the main workload. So the requirement to call madvise() from the > actual cgroup might be not easily achievable. So one worry that I have for this is that it requires the application itself to be aware of cgroup topolgies and restructure itself so that allocation of those resources are factored out into something else. Maybe that's not a huge problem but it may limit its applicability quite a bit. If we can express all the resource contraints and structures in the cgroup side and configured by the management agent, the application can simply e.g. madvise whatever memory region or flag bpf maps as "these are persistent" and the rest can be handled by the system. If the agent set up the environment for that, it gets accounted accordingly; otherwise, it'd behave as if those tagging didn't exist. Asking the application to set up all its resources in separate steps, that might require significant restructuring and knowledge of how the hierarchy is setup in many cases. > So _maybe_ something like writing a fd into cgroup.memory.resources. > > Second, it would be really useful to export the current configuration > to userspace. E.g. a user should be able to query to which cgroup the given > bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise > it will create a problem for userspace programs which manage cgroups > (e.g. systemd): they should be able to restore the current configuration > from the kernel state, without "remembering" what has been configured > before. This too can be achieved by separating out cgroup setup and tagging specific resources. Agent and cgroup know what each cgroup is supposed to do as they already do now and each resource is tagged whether they're persistent or not, so everything is always known without the agent and the application having to explicitly share the information. Thanks.
On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote: > > (Sorry, this is a resend. I messed up the header in the first posting.) > > Hello, > > This thread started on a bpf-specific memory tracking change proposal and > went south, but a lot of people who would be interested are already cc'd, so > I'm hijacking it to discuss what to do w/ persistent memory usage tracking. > > Cc'ing Mina and Yosry who were involved in the discussions on the similar > problem re. tmpfs, Dan Schatzberg who has a lot more prod knowledge and > experience than me, and Lennart for his thoughts from systemd side. > > The root problem is that there are resources (almost solely memory > currently) that outlive a given instance of a, to use systemd-lingo, > service. Page cache is the most common case. > > Let's say there's system.slice/hello.service. When it runs for the first > time, page cache backing its binary will be charged to hello.service. > However, when it restarts after e.g. a config change, when the initial > hello.service cgroup gets destroyed, we reparent the page cache charges to > the parent system.slice and when the second instance starts, its binary will > stay charged to system.slice. Over time, some may get reclaimed and > refaulted into the new hello.service but that's not guaranteed and most hot > pages likely won't. > > The same problem exists for any memory which is not freed synchronously when > the current instance exits. While this isn't a problem for many cases, it's > not difficult to imagine situations where the amount of memory which ends up > getting pushed to the parent is significant, even clear majority, with big > page cache footprint, persistent tmpfs instances and so on, creating issues > with accounting accuracy and thus control. > > I think there are two broad issues to discuss here: > > [1] Can this be solved by layering the instance cgroups under persistent > entity cgroup? > Hi, Below is some background of kubernetes. In kubernetes, a pod is organized as follows, pod |- Container |- Container IOW, it is a two-layer unit, or a two-layer instance. The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver. Once the old pod is destroyed (that can happen when the user wants to update their service), the new pod will have a different UUID. That said, different instances will have different cgroup dir. If we want to introduce a persistent entity cgroup, we have to make it a three-layer unit. persistent-entity |- pod |- Container |- Container There will be some issues, 1. The kuber-apiserver must maintain the persistent-entity on each host. It needs a great refactor and the compatibility is also a problem per my discussion with kubernetes experts. 2. How to do the monitor? If there's only one pod under this persistent-entity, we can easily get the memory size of shared resources by: Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod) But what if it has N pods and N is dynamically changed ? 3. What if it has more than one shared resource? For example, pod-foo has two shared resources A and B, pod-bar has two shared resources A and C, and another pod has two shared resources B and C. How to deploy them? Pls, note that we can introduce multiple-layer persistent-entity, but which one should be the parent ? So from my perspective, it is almost impossible. > So, instead of systemd.slice/hello.service, the application runs inside > something like systemd.slice/hello.service/hello.service.instance and the > service-level cgroup hello.service is not destroyed as long as it is > something worth tracking on the system. > > The benefits are > > a. While requiring changing how userland organizes cgroup hiearchy, it is a > straight-forward extension of the current architecture and doesn't > require any conceptual or structural changes. All the accounting and > control schemes work exactly the same as before. The only difference is > that we now have a persistent entity representing each service as we want > to track their persistent resource usages. > > b. Per-instance tracking and control is optional. To me, it seems that the > persistent resource usages would be more meaningful than per-instance and > tracking down to the persistent usages shouldn't add noticeable runtime > overheads while keeping per-instance process management niceties and > allowing use cases to opt-in for per-instance resource tracking and > control as needed. > > The complications are: > > a. It requires changing cgroup hierarchy in a very visible way. > > b. What should be the lifetime rules for persistent cgroups? Do we keep them > around forever or maybe they can be created on the first use and kept > around until the service is removed from the system? When the persistent > cgroup is removed, do we need to make sure that the remaining resource > usages are low enough? Note that this problem exists for any approach > that tries to track persistent usages no matter how it's done. > > c. Do we need to worry about nesting overhead? Given that there's no reason > to enable controllers w/o persisten states for the instance level and the > nesting overhead is pretty low for memcg, this doesn't seem like a > problem to me. If this becomes a problem, we just need to fix it. > > A couple alternatives discussed are: > > a. Userspace keeps reusing the same cgroup for different instances of the > same service. This simplifies some aspects while making others more > complicated. e.g. Determining the current instance's CPU or IO usages now > require the monitoring software remembering what they were when this > instance started and calculating the deltas. Also, if some use cases want > to distinguish persistent vs. instance usages (more on this later), this > isn't gonna work. That said, this definitely is attractive in that it > miminizes overt user visible changes. > > b. Memory is disassociated rather than just reparented on cgroup destruction > and get re-charged to the next first user. This is attractive in that it > doesn't require any userspace changes; however, I'm not sure how this > would work for non-pageable memory usages such as bpf maps. How would we > detect the next first usage? > JFYI, There is a reuse path for the bpf map, see my previous RFC[1]. [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/ > > [2] Whether and how to solve first and second+ instance charge differences. > > If we take the layering approach, the first instance will get charged for > all memory that it uses while the second+ instances likely won't get charged > for a lot of persistent usages. I don't think there is a consensus on > whether this needs to be solved and I don't have enough context to form a > strong opinion. memcg folks are a lot better equipped to make this decision. > Just sharing our practice. For many of our users, it means the memcg is unreliable (at least for the observability) when the memory usage is inconsistent. So they prefer to drop the page cache (by echoing memory.force_empty) when the container (pod) is destroyed, at the cost of taking time to reload these page caches next time. Reliability is more important than performance. > Assuming this needs to be solved, here's a braindump to be taken with a big > pinch of salt: > > I have a bit of difficult time imagining a perfect solution given that > whether a given page cache page is persistent or not would be really > difficult to know (or maybe all page cache is persistent by default while > anon is not). However, the problem still seems worthwhile to consider for > big ticket items such as persistent tmpfs mounts and huge bpf maps as they > can easily make the differences really big. > > If we want to solve this problem, here are options that I can think of: > > a. Let userspace put the charges where they belong using the current > mechanisms. ie. Create persistent entities in the persistent parent > cgroup while there's no current instance. > > Pro: It won't require any major kernel or interface changes. There still > need to be some tweaking such as allowing tmpfs pages to be always > charged to the cgroup which created the instance (maybe as long as it's > an ancestor of the faulting cgroup?) but nothing too invasive. > > Con: It may not be flexible enough. > > b. Let userspace specify which cgroup to charge for some of constructs like > tmpfs and bpf maps. The key problems with this approach are > > 1. How to grant/deny what can be charged where. We must ensure that a > descendant can't move charges up or across the tree without the > ancestors allowing it. > We can add restrictions to check which memcg can be selected (regarding the selectable memcg). But I think it may be too early to do the restrictions, as only the privileged user can set it. It is the sys admin's responsbility to select a proper memcg. That said, the selectable memcg is not going south. > 2. How to specify the cgroup to charge. While specifying the target > cgroup directly might seem like an obvious solution, it has a couple > rather serious problems. First, if the descendant is inside a cgroup > namespace, it might be able to see the target cgroup at all. It is not a problem. Just sharing our practice below. $ docker run -tid --privileged \ --mount type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf \ --mount type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg \ docker-image The bind-mount can make it work. > Second, > it's an interface which is likely to cause misunderstandings on how it > can be used. It's too broad an interface. > As I said above, we just need some restrictions or guidance if that is desired now. > One solution that I can think of is leveraging the resource domain > concept which is currently only used for threaded cgroups. All memory > usages of threaded cgroups are charged to their resource domain cgroup > which hosts the processes for those threads. The persistent usages have a > similar pattern, so maybe the service level cgroup can declare that it's > the encompassing resource domain and the instance cgroup can say whether > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > resource domain. > > This has the benefit that the user only needs to indicate its intention > without worrying about how cgroups are composed and what their IDs are. > It just indicates whether the given resource is persistent and if the > cgroup hierarchy is set up for that, it gets charged that way and if not > it can be just charged to itself. > > This is a shower thought but, if we allow nesting such domains (and maybe > name them), we can use it for shared resources too so that co-services > are put inside a shared slice and shared resources are pushed to the > slice level. > > This became pretty long. I obviously have a pretty strong bias towards > solving this within the current basic architecture but other than that most > of these decisions are best made by memcg folks. We can hopefully build some > consensus on the issue. > JFYI. Apologize in advance if my words offend you again.
Hello, On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote: > On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote: > > [1] Can this be solved by layering the instance cgroups under persistent > > entity cgroup? > > Below is some background of kubernetes. > In kubernetes, a pod is organized as follows, > > pod > |- Container > |- Container > > IOW, it is a two-layer unit, or a two-layer instance. > The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver. > Once the old pod is destroyed (that can happen when the user wants to > update their service), the new pod will have a different UUID. > That said, different instances will have different cgroup dir. > > If we want to introduce a persistent entity cgroup, we have to make > it a three-layer unit. > > persistent-entity > |- pod > |- Container > |- Container > > There will be some issues, > 1. The kuber-apiserver must maintain the persistent-entity on each host. > It needs a great refactor and the compatibility is also a problem > per my discussion with kubernetes experts. This is gonna be true for anybody. The basic strategy here should be defining a clear boundary between system agent and applications so that individual applications, even when they build their own subhierarchy internally, aren't affected by system level hierarchy configuration changes. systemd is already like that with clear delegation boundary. Our (fb) container management is like that too. So, while this requires some reorganization from the agent side, things like this don't create huge backward compatbility issues involving applications. I have no idea about k8s, so the situation may differ but in the long term at least, it'd be a good idea to build in similar conceptual separation even if it stays with cgroup1. > 2. How to do the monitor? > If there's only one pod under this persistent-entity, we can > easily get the memory size of shared resources by: > Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod) > But what if it has N pods and N is dynamically changed ? There should only be one live pod instance inside the pod's persistent cgroup, so the calculation doesn't really change. > 3. What if it has more than one shared resource? > For example, pod-foo has two shared resources A and B, pod-bar > has two shared resources A and C, and another pod has two shared > resources B and C. > How to deploy them? > Pls, note that we can introduce multiple-layer persistent-entity, > but which one should be the parent ? > > So from my perspective, it is almost impossible. Yeah, this is a different problem and cgroup has never been good at tracking resources shared across multiple groups. There is always tension between overhead, complexity and accuracy and the tradeoff re. resource sharing has almost always been towards the former two. Naming specific resources and designating them as being shared and accounting them commonly somehow does make sense as an approach as we get to avoid the biggest headaches (e.g. how to split a page cache page?) and maybe it can even be claimed that a lot of use cases which may want cross-group sharing can be sufficiently served by such approach. That said, it still has to be balanced against other factors. For example, memory pressure caused by the shared resources should affect all participants in the sharing group in terms of both memory and IO, which means that they'll have to stay within a nested subtree structure. This does restrict overlapping partial sharing that you described above but the goal is finding a reasonable tradeoff, so something has to give. > > b. Memory is disassociated rather than just reparented on cgroup destruction > > and get re-charged to the next first user. This is attractive in that it > > doesn't require any userspace changes; however, I'm not sure how this > > would work for non-pageable memory usages such as bpf maps. How would we > > detect the next first usage? > > > > JFYI, There is a reuse path for the bpf map, see my previous RFC[1]. > [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/ I'm not a big fan of explicit recharging. It's too hairy to use requiring mixing system level hierarchy configuration knoweldge with in-application resource handling. There should be clear isolation between the two. This is also what leads to namespace and visibility issues. Ideally, these should be handled by structuring the resource hierarchay correctly from the get-go. ... > > b. Let userspace specify which cgroup to charge for some of constructs like > > tmpfs and bpf maps. The key problems with this approach are > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > descendant can't move charges up or across the tree without the > > ancestors allowing it. > > > > We can add restrictions to check which memcg can be selected > (regarding the selectable memcg). > But I think it may be too early to do the restrictions, as only the > privileged user can set it. > It is the sys admin's responsbility to select a proper memcg. > That said, the selectable memcg is not going south. I generally tend towards shaping interfaces more carefully. We can of course add a do-whatever-you-want interface and declare that it's for root only but even that becomes complicated with things like userns as we're finding out in different areas, but again, nothing is free and approaches like that often bring more longterm headaches than the problems they solve. > > 2. How to specify the cgroup to charge. While specifying the target > > cgroup directly might seem like an obvious solution, it has a couple > > rather serious problems. First, if the descendant is inside a cgroup > > namespace, it might be able to see the target cgroup at all. > > It is not a problem. Just sharing our practice below. > $ docker run -tid --privileged \ > --mount > type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf \ > --mount > type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg \ > docker-image > > The bind-mount can make it work. Not mount namespace. cgroup namespace which is used to isolate the cgroup subtree that's visible to each container. Please take a look at cgroup_namespaces(7). > > Second, > > it's an interface which is likely to cause misunderstandings on how it > > can be used. It's too broad an interface. > > > > As I said above, we just need some restrictions or guidance if that is > desired now. I'm really unlikely to go down that path because as I said before I believe that that's a road to long term headaches and already creates immediate problems in terms of how it'd interact with other resources. No matter what approach we may choose, it has to work with the existing resource hierarchy. Otherwise, we end up in a situation where we have to split accounting and control of other controllers too which makes little sense for non-memory controllers (not that it's great for memory in the first place). Thanks.
On Wed, Aug 24, 2022 at 1:12 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Tue, Aug 23, 2022 at 07:08:17PM +0800, Yafang Shao wrote: > > On Mon, Aug 22, 2022 at 7:29 PM Tejun Heo <tj@kernel.org> wrote: > > > [1] Can this be solved by layering the instance cgroups under persistent > > > entity cgroup? > > > > Below is some background of kubernetes. > > In kubernetes, a pod is organized as follows, > > > > pod > > |- Container > > |- Container > > > > IOW, it is a two-layer unit, or a two-layer instance. > > The cgroup dir of the pod is named with a UUID assigned by kubernetes-apiserver. > > Once the old pod is destroyed (that can happen when the user wants to > > update their service), the new pod will have a different UUID. > > That said, different instances will have different cgroup dir. > > > > If we want to introduce a persistent entity cgroup, we have to make > > it a three-layer unit. > > > > persistent-entity > > |- pod > > |- Container > > |- Container > > > > There will be some issues, > > 1. The kuber-apiserver must maintain the persistent-entity on each host. > > It needs a great refactor and the compatibility is also a problem > > per my discussion with kubernetes experts. > > This is gonna be true for anybody. The basic strategy here should be > defining a clear boundary between system agent and applications so that > individual applications, even when they build their own subhierarchy > internally, aren't affected by system level hierarchy configuration changes. > systemd is already like that with clear delegation boundary. Our (fb) > container management is like that too. So, while this requires some > reorganization from the agent side, things like this don't create huge > backward compatbility issues involving applications. I have no idea about > k8s, so the situation may differ but in the long term at least, it'd be a > good idea to build in similar conceptual separation even if it stays with > cgroup1. > > > 2. How to do the monitor? > > If there's only one pod under this persistent-entity, we can > > easily get the memory size of shared resources by: > > Sizeof(shared-resources) = Sizeof(persistent-entity) - Sizeof(pod) > > But what if it has N pods and N is dynamically changed ? > > There should only be one live pod instance inside the pod's persistent > cgroup, so the calculation doesn't really change. > > > 3. What if it has more than one shared resource? > > For example, pod-foo has two shared resources A and B, pod-bar > > has two shared resources A and C, and another pod has two shared > > resources B and C. > > How to deploy them? > > Pls, note that we can introduce multiple-layer persistent-entity, > > but which one should be the parent ? > > > > So from my perspective, it is almost impossible. > > Yeah, this is a different problem and cgroup has never been good at tracking > resources shared across multiple groups. There is always tension between > overhead, complexity and accuracy and the tradeoff re. resource sharing has > almost always been towards the former two. > > Naming specific resources and designating them as being shared and > accounting them commonly somehow does make sense as an approach as we get to > avoid the biggest headaches (e.g. how to split a page cache page?) and maybe > it can even be claimed that a lot of use cases which may want cross-group > sharing can be sufficiently served by such approach. > > That said, it still has to be balanced against other factors. For example, > memory pressure caused by the shared resources should affect all > participants in the sharing group in terms of both memory and IO, which > means that they'll have to stay within a nested subtree structure. This does > restrict overlapping partial sharing that you described above but the goal > is finding a reasonable tradeoff, so something has to give. > > > > b. Memory is disassociated rather than just reparented on cgroup destruction > > > and get re-charged to the next first user. This is attractive in that it > > > doesn't require any userspace changes; however, I'm not sure how this > > > would work for non-pageable memory usages such as bpf maps. How would we > > > detect the next first usage? > > > > > > > JFYI, There is a reuse path for the bpf map, see my previous RFC[1]. > > [1] https://lore.kernel.org/bpf/20220619155032.32515-1-laoar.shao@gmail.com/ > > I'm not a big fan of explicit recharging. It's too hairy to use requiring > mixing system level hierarchy configuration knoweldge with in-application > resource handling. There should be clear isolation between the two. This is > also what leads to namespace and visibility issues. Ideally, these should be > handled by structuring the resource hierarchay correctly from the get-go. > > ... > > > b. Let userspace specify which cgroup to charge for some of constructs like > > > tmpfs and bpf maps. The key problems with this approach are > > > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > > descendant can't move charges up or across the tree without the > > > ancestors allowing it. > > > > > > > We can add restrictions to check which memcg can be selected > > (regarding the selectable memcg). > > But I think it may be too early to do the restrictions, as only the > > privileged user can set it. > > It is the sys admin's responsbility to select a proper memcg. > > That said, the selectable memcg is not going south. > > I generally tend towards shaping interfaces more carefully. We can of course > add a do-whatever-you-want interface and declare that it's for root only but > even that becomes complicated with things like userns as we're finding out > in different areas, but again, nothing is free and approaches like that > often bring more longterm headaches than the problems they solve. > > > > 2. How to specify the cgroup to charge. While specifying the target > > > cgroup directly might seem like an obvious solution, it has a couple > > > rather serious problems. First, if the descendant is inside a cgroup > > > namespace, it might be able to see the target cgroup at all. > > > > It is not a problem. Just sharing our practice below. > > $ docker run -tid --privileged \ > > --mount > > type=bind,source=/sys/fs/bpf,target=/sys/fs/bpf \ > > --mount > > type=bind,source=/sys/fs/cgroup/memory/bpf,target=/bpf-memcg \ > > docker-image > > > > The bind-mount can make it work. > > Not mount namespace. cgroup namespace which is used to isolate the cgroup > subtree that's visible to each container. Please take a look at > cgroup_namespaces(7). > IIUC, we can get the target cgroup with a relative path if cgroup namespace is enabled, for example ../bpf, right ? (If not, then I think we can extend it.) > > > Second, > > > it's an interface which is likely to cause misunderstandings on how it > > > can be used. It's too broad an interface. > > > > > > > As I said above, we just need some restrictions or guidance if that is > > desired now. > > I'm really unlikely to go down that path because as I said before I believe > that that's a road to long term headaches and already creates immediate > problems in terms of how it'd interact with other resources. No matter what > approach we may choose, it has to work with the existing resource hierarchy. Unfortunately, the bpf map has already broken and is still breaking the existing resource hierarchy. What I'm doing is to improve or even fix this breakage. The reason I say it is breaking the existing resource hierarchy is that the bpf map is improperly treated as a process while it is really a shared resource. A bpf-map can be written by processes running in other memcgs, but the memory allocated caused by the writing won't be charged to the writer's memcg, but will be charged to bpf-map's memcg. That's why I try to convey to you that what I'm trying to fix is a bpf specific issue. > Otherwise, we end up in a situation where we have to split accounting and > control of other controllers too which makes little sense for non-memory > controllers (not that it's great for memory in the first place). > If the resource hierarchy is what you concern, then I think it can be addressed with below addition change, if (cgroup_is_not_ancestor(cgrp)) return -EINVAL;
On Mon, Aug 22, 2022 at 8:14 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote: > > > > > One solution that I can think of is leveraging the resource domain > > > > > concept which is currently only used for threaded cgroups. All memory > > > > > usages of threaded cgroups are charged to their resource domain cgroup > > > > > which hosts the processes for those threads. The persistent usages have a > > > > > similar pattern, so maybe the service level cgroup can declare that it's > > > > > the encompassing resource domain and the instance cgroup can say whether > > > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > > > resource domain. > > > > > > > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > > > the tmpfs/bpf memory would get charged to the encompassing resource > > > > domain cgroup rather than the instance cgroup, making the memory usage > > > > of the first and second+ instances consistent and predictable. > > > > > > > > Would love to hear from other memcg folks what they would think of > > > > such an approach. I would also love to hear what kind of interface you > > > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > > > charge the tmpfs/bpf instance to itself or to the encompassing > > > > resource domain? > > > > > > I like this too. It makes shared charging predictable, with a coherent > > > resource hierarchy (congruent OOM, CPU, IO domains), and without the > > > need for cgroup paths in tmpfs mounts or similar. > > > > > > As far as who is declaring what goes, though: if the instance groups > > > can declare arbitrary files/objects persistent or shared, they'd be > > > able to abuse this and sneak private memory past local limits and > > > burden the wider persistent/shared domain with it. > > My thought was that the persistent cgroup and instance cgroups should belong > to the same trust domain and system level control should be applied at the > resource domain level. The application may decide to shift between > persistent and per-instance however it wants to and may even configure > resource control at that level but all that's for its own accounting > accuracy and benefit. > > > > I'm thinking it might make more sense for the service level to declare > > > which objects are persistent and shared across instances. > > > > I like this idea. > > > > > If that's the case, we may not need a two-component interface. Just > > > the ability for an intermediate cgroup to say: "This object's future > > > memory is to be charged to me, not the instantiating cgroup." > > > > > > Can we require a process in the intermediate cgroup to set up the file > > > or object, and use madvise/fadvise to say "charge me", before any > > > instances are launched? > > > > We need to think how to make this interface convenient to use. > > First, these persistent resources are likely created by some agent software, > > not the main workload. So the requirement to call madvise() from the > > actual cgroup might be not easily achievable. > > So one worry that I have for this is that it requires the application itself > to be aware of cgroup topolgies and restructure itself so that allocation of > those resources are factored out into something else. Maybe that's not a > huge problem but it may limit its applicability quite a bit. > I agree with this point 100%. The interfaces being discussed here require existing applications restructuring themselves which I don't imagine will be very useful for us at least. > If we can express all the resource contraints and structures in the cgroup > side and configured by the management agent, the application can simply e.g. > madvise whatever memory region or flag bpf maps as "these are persistent" > and the rest can be handled by the system. If the agent set up the > environment for that, it gets accounted accordingly; otherwise, it'd behave > as if those tagging didn't exist. Asking the application to set up all its > resources in separate steps, that might require significant restructuring > and knowledge of how the hierarchy is setup in many cases. > I don't know if this level of granularity is needed with a madvise() or such. The kernel knows whether resources are persistent due to the nature of the resource. For example a shared tmpfs file is a resource that is persistent and not cleaned up after the process using it dies, but private memory is. madvise(PERSISTENT) on private memory would not make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed memory region would make sense. Also, this requires adding madvise() hints in userspace code to leverage this. > > So _maybe_ something like writing a fd into cgroup.memory.resources. > > Sorry, I don't see this being useful - to us at least - if it is an fd-based interface. It needs to support marking entire tmpfs mounts as persistent. The reasoning is as Tejun alludes to: our container management agent generally sets up the container hierarchy for a job and also sets up the filesystem mounts the job requests. This is generally because the job doesn't run as root and doesn't bother with mount namespaces. Thus, our jobs are well-trained in receiving mounts set up for them from our container management agent. Jobs are _not_ well-trained in receiving an fd from the container management agent, and restructuring our jobs/services for such an interface will not be feasible I think. This applies to us but I imagine it is common practice for the container management agent to set up mounts for the jobs to use, rather than provide the job with an fd or collection of fds. > > Second, it would be really useful to export the current configuration > > to userspace. E.g. a user should be able to query to which cgroup the given > > bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise > > it will create a problem for userspace programs which manage cgroups > > (e.g. systemd): they should be able to restore the current configuration > > from the kernel state, without "remembering" what has been configured > > before. > > This too can be achieved by separating out cgroup setup and tagging specific > resources. Agent and cgroup know what each cgroup is supposed to do as they > already do now and each resource is tagged whether they're persistent or > not, so everything is always known without the agent and the application > having to explicitly share the information. > > Thanks. > > -- > tejun
Hello, On Wed, Aug 24, 2022 at 12:02:04PM -0700, Mina Almasry wrote: > > If we can express all the resource contraints and structures in the cgroup > > side and configured by the management agent, the application can simply e.g. > > madvise whatever memory region or flag bpf maps as "these are persistent" > > and the rest can be handled by the system. If the agent set up the > > environment for that, it gets accounted accordingly; otherwise, it'd behave > > as if those tagging didn't exist. Asking the application to set up all its > > resources in separate steps, that might require significant restructuring > > and knowledge of how the hierarchy is setup in many cases. > > I don't know if this level of granularity is needed with a madvise() > or such. The kernel knows whether resources are persistent due to the > nature of the resource. For example a shared tmpfs file is a resource > that is persistent and not cleaned up after the process using it dies, > but private memory is. madvise(PERSISTENT) on private memory would not > make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed > memory region would make sense. Also, this requires adding madvise() > hints in userspace code to leverage this. I haven't thought hard about what the hinting interface should be like. The default assumptions would be that page cache belongs to the persistent domain and anon belongs to the instance (mm folks, please correct me if I'm off the rails here), but I can imagine situations where that doesn't necessarily hold - like temp files which get unlinked on instance shutdown. In terms of hint granularity, more coarse grained (e.g. file, mount whatever) seems to make sense but again I haven't thought too hard on it. That said, as long as the default behavior is reasonable, I think adding some hinting calls in the application is acceptable. It doesn't require any structrual changes and the additions would be for its own benefit of more accurate accounting and control. That makes sense to me. One unfortunate effect this will have is that we'll be actively putting resources into intermediate cgroups. This already happens today but if we support persistent domains, it's gonna be a lot more prevalent and we'll need to update e.g. iocost to support IOs coming out of intermediate cgroups. This kinda sucks because we don't even have knobs to control self vs. children distributions. Oh well... Thanks.