Message ID | 20220902023003.47124-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: Introduce selectable memcg for bpf map | expand |
Hello, On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: ... > This patchset tries to resolve the above two issues by introducing a > selectable memcg to limit the bpf memory. Currently we only allow to > select its ancestor to avoid breaking the memcg hierarchy further. > Possible use cases of the selectable memcg as follows, As discussed in the following thread, there are clear downsides to an interface which requires the users to specify the cgroups directly. https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org So, I don't really think this is an interface we wanna go for. I was hoping to hear more from memcg folks in the above thread. Maybe ping them in that thread and continue there? Thanks.
On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > Hello, > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > ... > > This patchset tries to resolve the above two issues by introducing a > > selectable memcg to limit the bpf memory. Currently we only allow to > > select its ancestor to avoid breaking the memcg hierarchy further. > > Possible use cases of the selectable memcg as follows, > > As discussed in the following thread, there are clear downsides to an > interface which requires the users to specify the cgroups directly. > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > So, I don't really think this is an interface we wanna go for. I was hoping > to hear more from memcg folks in the above thread. Maybe ping them in that > thread and continue there? Ah, another thing. If the memcg accounting is breaking things right now, we can easily introduce a memcg disable flag for bpf memory. That should help alleviating the immediate breakage while we figure this out. Thanks.
On Wed, Sep 7, 2022 at 8:45 AM Tejun Heo <tj@kernel.org> wrote: > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > Hello, > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > ... > > > This patchset tries to resolve the above two issues by introducing a > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > Possible use cases of the selectable memcg as follows, > > > > As discussed in the following thread, there are clear downsides to an > > interface which requires the users to specify the cgroups directly. > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > to hear more from memcg folks in the above thread. Maybe ping them in that > > thread and continue there? > > Ah, another thing. If the memcg accounting is breaking things right now, we > can easily introduce a memcg disable flag for bpf memory. That should help > alleviating the immediate breakage while we figure this out. Hmm. We discussed this option already. We definitely don't want to introduce an uapi knob that will allow anyone to skip memcg accounting today and in the future.
Hello, On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote: > Hmm. We discussed this option already. We definitely don't want > to introduce an uapi knob that will allow anyone to skip memcg > accounting today and in the future. cgroup.memory boot parameter is how memcg provides last-resort workarounds for this sort of problems / regressions while they're being addressed. It's not a dynamically changeable or programmable thing. Just a boot time opt-out. That said, if you don't want it, you don't want it. Thanks.
On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote: > > Hmm. We discussed this option already. We definitely don't want > > to introduce an uapi knob that will allow anyone to skip memcg > > accounting today and in the future. > > cgroup.memory boot parameter is how memcg provides last-resort workarounds > for this sort of problems / regressions while they're being addressed. It's > not a dynamically changeable or programmable thing. Just a boot time > opt-out. That said, if you don't want it, you don't want it. ahh. boot param. Are you suggesting a global off switch ? Like nosocket and nokmem. That would be a different story. Need to think more about it. It could be ok.
On Wed, Sep 07, 2022 at 09:27:09AM -0700, Alexei Starovoitov wrote: > On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote: > > > > Hello, > > > > On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote: > > > Hmm. We discussed this option already. We definitely don't want > > > to introduce an uapi knob that will allow anyone to skip memcg > > > accounting today and in the future. > > > > cgroup.memory boot parameter is how memcg provides last-resort workarounds > > for this sort of problems / regressions while they're being addressed. It's > > not a dynamically changeable or programmable thing. Just a boot time > > opt-out. That said, if you don't want it, you don't want it. > > ahh. boot param. > Are you suggesting a global off switch ? Like nosocket and nokmem. > That would be a different story. > Need to think more about it. It could be ok. Yeah, nobpf or sth like that. An equivalent cgroup.memory parameter. Thanks.
On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > Hello, > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > ... > > This patchset tries to resolve the above two issues by introducing a > > selectable memcg to limit the bpf memory. Currently we only allow to > > select its ancestor to avoid breaking the memcg hierarchy further. > > Possible use cases of the selectable memcg as follows, > > As discussed in the following thread, there are clear downsides to an > interface which requires the users to specify the cgroups directly. > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > So, I don't really think this is an interface we wanna go for. I was hoping > to hear more from memcg folks in the above thread. Maybe ping them in that > thread and continue there? As I said previously, I don't like it, because it's an attempt to solve a non bpf-specific problem in a bpf-specific way. Yes, memory cgroups are not great for accounting of shared resources, it's well known. This patchset looks like an attempt to "fix" it specifically for bpf maps in a particular cgroup setup. Honestly, I don't think it's worth the added complexity. Especially because a similar behaviour can be achieved simple by placing the task which creates the map into the desired cgroup. Beatiful? Not. Neither is the proposed solution. Thanks!
On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > Hello, > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > ... > > > This patchset tries to resolve the above two issues by introducing a > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > Possible use cases of the selectable memcg as follows, > > > > As discussed in the following thread, there are clear downsides to an > > interface which requires the users to specify the cgroups directly. > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > to hear more from memcg folks in the above thread. Maybe ping them in that > > thread and continue there? > Hi Roman, > As I said previously, I don't like it, because it's an attempt to solve a non > bpf-specific problem in a bpf-specific way. > Why do you still insist that bpf_map->memcg is not a bpf-specific issue after so many discussions? Do you charge the bpf-map's memory the same way as you charge the page caches or slabs ? No, you don't. You charge it in a bpf-specific way. > Yes, memory cgroups are not great for accounting of shared resources, it's well > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > in a particular cgroup setup. Honestly, I don't think it's worth the added > complexity. Especially because a similar behaviour can be achieved simple > by placing the task which creates the map into the desired cgroup. Are you serious ? Have you ever read the cgroup doc? Which clearly describe the "No Internal Process Constraint".[1] Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > Beatiful? Not. Neither is the proposed solution. > Is it really hard to admit a fault?
On Wed, Sep 7, 2022 at 7:37 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > Hello, > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > ... > > > > This patchset tries to resolve the above two issues by introducing a > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > Possible use cases of the selectable memcg as follows, > > > > > > As discussed in the following thread, there are clear downsides to an > > > interface which requires the users to specify the cgroups directly. > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > thread and continue there? > > > > Hi Roman, > > > As I said previously, I don't like it, because it's an attempt to solve a non > > bpf-specific problem in a bpf-specific way. > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > issue after so many discussions? > Do you charge the bpf-map's memory the same way as you charge the page > caches or slabs ? > No, you don't. You charge it in a bpf-specific way. > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > complexity. Especially because a similar behaviour can be achieved simple > > by placing the task which creates the map into the desired cgroup. > > Are you serious ? > Have you ever read the cgroup doc? Which clearly describe the "No > Internal Process Constraint".[1] > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > Beatiful? Not. Neither is the proposed solution. > > > > Is it really hard to admit a fault? Yafang, This attitude won't get you anywhere. Selecting memcg by fd is no go. You need to work with the community to figure out a solution acceptable to maintainers of relevant subsystems.
On Thu, Sep 8, 2022 at 1:01 AM Tejun Heo <tj@kernel.org> wrote: > > On Wed, Sep 07, 2022 at 09:27:09AM -0700, Alexei Starovoitov wrote: > > On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > Hello, > > > > > > On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote: > > > > Hmm. We discussed this option already. We definitely don't want > > > > to introduce an uapi knob that will allow anyone to skip memcg > > > > accounting today and in the future. > > > > > > cgroup.memory boot parameter is how memcg provides last-resort workarounds > > > for this sort of problems / regressions while they're being addressed. It's > > > not a dynamically changeable or programmable thing. Just a boot time > > > opt-out. That said, if you don't want it, you don't want it. > > > > ahh. boot param. > > Are you suggesting a global off switch ? Like nosocket and nokmem. > > That would be a different story. > > Need to think more about it. It could be ok. > > Yeah, nobpf or sth like that. An equivalent cgroup.memory parameter. > It may be a useful feature for some cases, but it can't help container users. The memcg works well to limit the non-pinned bpf-map, that's the reason why we, a container user, switch to memcg-based bpf charging. Our goal is to make it also work for pinned bpf-map. That said, your proposal may be a useful feature, but it should be another different patchset.
On Thu, Sep 8, 2022 at 10:43 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Sep 7, 2022 at 7:37 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > ... > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > thread and continue there? > > > > > > > Hi Roman, > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > bpf-specific problem in a bpf-specific way. > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > issue after so many discussions? > > Do you charge the bpf-map's memory the same way as you charge the page > > caches or slabs ? > > No, you don't. You charge it in a bpf-specific way. > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > complexity. Especially because a similar behaviour can be achieved simple > > > by placing the task which creates the map into the desired cgroup. > > > > Are you serious ? > > Have you ever read the cgroup doc? Which clearly describe the "No > > Internal Process Constraint".[1] > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > Is it really hard to admit a fault? > > Yafang, > > This attitude won't get you anywhere. > Thanks for pointing it out. It is my fault. > Selecting memcg by fd is no go. > You need to work with the community to figure out a solution > acceptable to maintainers of relevant subsystems. Yes, I'm trying.
On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > Hello, > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > ... > > > > This patchset tries to resolve the above two issues by introducing a > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > Possible use cases of the selectable memcg as follows, > > > > > > As discussed in the following thread, there are clear downsides to an > > > interface which requires the users to specify the cgroups directly. > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > thread and continue there? > > > > Hi Roman, > > > As I said previously, I don't like it, because it's an attempt to solve a non > > bpf-specific problem in a bpf-specific way. > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > issue after so many discussions? > Do you charge the bpf-map's memory the same way as you charge the page > caches or slabs ? > No, you don't. You charge it in a bpf-specific way. The only difference is that we charge the cgroup of the processes who created a map, not a process who is doing a specific allocation. Your patchset doesn't change this. There are pros and cons with this approach, we've discussed it back to the times when bpf memcg accounting was developed. If you want to revisit this, it's maybe possible (given there is a really strong and likely new motivation appears), but I haven't seen any complaints yet except from you. > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > complexity. Especially because a similar behaviour can be achieved simple > > by placing the task which creates the map into the desired cgroup. > > Are you serious ? > Have you ever read the cgroup doc? Which clearly describe the "No > Internal Process Constraint".[1] > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. But you can place it into another leaf cgroup. You can delete this leaf cgroup and your memcg will get reparented. You can attach this process and create a bpf map to the parent cgroup before it gets child cgroups. You can revisit the idea of shared bpf maps and outlive specific cgroups. Lof of options. > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > Beatiful? Not. Neither is the proposed solution. > > > > Is it really hard to admit a fault? Yafang, you posted several versions and so far I haven't seen much of support or excitement from anyone (please, fix me if I'm wrong). It's not like I'm nacking a patchset with many acks, reviews and supporters. Still think you're solving an important problem in a reasonable way? It seems like not many are convinced yet. I'd recommend to focus on this instead of blaming me. Thanks!
On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > Hello, > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > ... > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > thread and continue there? > > > > > > > Hi Roman, > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > bpf-specific problem in a bpf-specific way. > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > issue after so many discussions? > > Do you charge the bpf-map's memory the same way as you charge the page > > caches or slabs ? > > No, you don't. You charge it in a bpf-specific way. > Hi Roman, Sorry for the late response. I've been on vacation in the past few days. > The only difference is that we charge the cgroup of the processes who > created a map, not a process who is doing a specific allocation. This means the bpf-map can be indepent of process, IOW, the memcg of bpf-map can be indepent of the memcg of the processes. This is the fundamental difference between bpf-map and page caches, then... > Your patchset doesn't change this. We can make this behavior reasonable by introducing an independent memcg, as what I did in the previous version. > There are pros and cons with this approach, we've discussed it back > to the times when bpf memcg accounting was developed. If you want > to revisit this, it's maybe possible (given there is a really strong and likely > new motivation appears), but I haven't seen any complaints yet except from you. > memcg-base bpf accounting is a new feature, which may not be used widely. > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > complexity. Especially because a similar behaviour can be achieved simple > > > by placing the task which creates the map into the desired cgroup. > > > > Are you serious ? > > Have you ever read the cgroup doc? Which clearly describe the "No > > Internal Process Constraint".[1] > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > But you can place it into another leaf cgroup. You can delete this leaf cgroup > and your memcg will get reparented. You can attach this process and create > a bpf map to the parent cgroup before it gets child cgroups. If the process doesn't exit after it created bpf-map, we have to migrate it around memcgs.... The complexity in deployment can introduce unexpected issues easily. > You can revisit the idea of shared bpf maps and outlive specific cgroups. > Lof of options. > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > Is it really hard to admit a fault? > > Yafang, you posted several versions and so far I haven't seen much of support > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm > nacking a patchset with many acks, reviews and supporters. > > Still think you're solving an important problem in a reasonable way? > It seems like not many are convinced yet. I'd recommend to focus on this instead > of blaming me. > The best way so far is to introduce specific memcg for specific resources. Because not only the process owns its memcg, but also specific resources own their memcgs, for example bpf-map, or socket. struct bpf_map { <<<< memcg owner struct memcg_cgroup *memcg; }; struct sock { <<<< memcg owner struct mem_cgroup *sk_memcg; }; These resources already have their own memcgs, so we should make this behavior formal. The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote: > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > > Hello, > > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > > ... > > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > > thread and continue there? > > > > > > > > > > Hi Roman, > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > > bpf-specific problem in a bpf-specific way. > > > > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > > issue after so many discussions? > > > Do you charge the bpf-map's memory the same way as you charge the page > > > caches or slabs ? > > > No, you don't. You charge it in a bpf-specific way. > > > > Hi Roman, > > Sorry for the late response. > I've been on vacation in the past few days. > > > The only difference is that we charge the cgroup of the processes who > > created a map, not a process who is doing a specific allocation. > > This means the bpf-map can be indepent of process, IOW, the memcg of > bpf-map can be indepent of the memcg of the processes. > This is the fundamental difference between bpf-map and page caches, then... > > > Your patchset doesn't change this. > > We can make this behavior reasonable by introducing an independent > memcg, as what I did in the previous version. > > > There are pros and cons with this approach, we've discussed it back > > to the times when bpf memcg accounting was developed. If you want > > to revisit this, it's maybe possible (given there is a really strong and likely > > new motivation appears), but I haven't seen any complaints yet except from you. > > > > memcg-base bpf accounting is a new feature, which may not be used widely. > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > > complexity. Especially because a similar behaviour can be achieved simple > > > > by placing the task which creates the map into the desired cgroup. > > > > > > Are you serious ? > > > Have you ever read the cgroup doc? Which clearly describe the "No > > > Internal Process Constraint".[1] > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup > > and your memcg will get reparented. You can attach this process and create > > a bpf map to the parent cgroup before it gets child cgroups. > > If the process doesn't exit after it created bpf-map, we have to > migrate it around memcgs.... > The complexity in deployment can introduce unexpected issues easily. > > > You can revisit the idea of shared bpf maps and outlive specific cgroups. > > Lof of options. > > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > > > > Is it really hard to admit a fault? > > > > Yafang, you posted several versions and so far I haven't seen much of support > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm > > nacking a patchset with many acks, reviews and supporters. > > > > Still think you're solving an important problem in a reasonable way? > > It seems like not many are convinced yet. I'd recommend to focus on this instead > > of blaming me. > > > > The best way so far is to introduce specific memcg for specific resources. > Because not only the process owns its memcg, but also specific > resources own their memcgs, for example bpf-map, or socket. > > struct bpf_map { <<<< memcg owner > struct memcg_cgroup *memcg; > }; > > struct sock { <<<< memcg owner > struct mem_cgroup *sk_memcg; > }; > > These resources already have their own memcgs, so we should make this > behavior formal. > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'. This is a fundamental change: cgroups were always hierarchical groups of processes/threads. You're basically suggesting to extend it to hierarchical groups of processes and some other objects (what's a good definition?). It's a huge change and it's scope is definetely larger than bpf and even memory cgroups. It will raise a lot of questions: e.g. what does it mean for other controllers (cpu, io, etc)? Which objects can have dedicated cgroups and which not? How the interface will look like? How the oom handling will work? Etc. The history showed that starting small with one controller and/or specific use case isn't working well for cgroups, because different resources and controllers are not living independently. So if you really want to go this way you need to present the whole picture and convince many people that it's worth it. I doubt this specific bpf map accounting thing can justify it. Personally I know some examples where such functionality could be useful, but I'm not yet convinced it's worth the effort and potential problems. Thanks!
On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote: > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > > > Hello, > > > > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > > > ... > > > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > > > thread and continue there? > > > > > > > > > > > > > Hi Roman, > > > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > > > bpf-specific problem in a bpf-specific way. > > > > > > > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > > > issue after so many discussions? > > > > Do you charge the bpf-map's memory the same way as you charge the page > > > > caches or slabs ? > > > > No, you don't. You charge it in a bpf-specific way. > > > > > > > Hi Roman, > > > > Sorry for the late response. > > I've been on vacation in the past few days. > > > > > The only difference is that we charge the cgroup of the processes who > > > created a map, not a process who is doing a specific allocation. > > > > This means the bpf-map can be indepent of process, IOW, the memcg of > > bpf-map can be indepent of the memcg of the processes. > > This is the fundamental difference between bpf-map and page caches, then... > > > > > Your patchset doesn't change this. > > > > We can make this behavior reasonable by introducing an independent > > memcg, as what I did in the previous version. > > > > > There are pros and cons with this approach, we've discussed it back > > > to the times when bpf memcg accounting was developed. If you want > > > to revisit this, it's maybe possible (given there is a really strong and likely > > > new motivation appears), but I haven't seen any complaints yet except from you. > > > > > > > memcg-base bpf accounting is a new feature, which may not be used widely. > > > > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > > > complexity. Especially because a similar behaviour can be achieved simple > > > > > by placing the task which creates the map into the desired cgroup. > > > > > > > > Are you serious ? > > > > Have you ever read the cgroup doc? Which clearly describe the "No > > > > Internal Process Constraint".[1] > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup > > > and your memcg will get reparented. You can attach this process and create > > > a bpf map to the parent cgroup before it gets child cgroups. > > > > If the process doesn't exit after it created bpf-map, we have to > > migrate it around memcgs.... > > The complexity in deployment can introduce unexpected issues easily. > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups. > > > Lof of options. > > > > > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > > > > > > > Is it really hard to admit a fault? > > > > > > Yafang, you posted several versions and so far I haven't seen much of support > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm > > > nacking a patchset with many acks, reviews and supporters. > > > > > > Still think you're solving an important problem in a reasonable way? > > > It seems like not many are convinced yet. I'd recommend to focus on this instead > > > of blaming me. > > > > > > > The best way so far is to introduce specific memcg for specific resources. > > Because not only the process owns its memcg, but also specific > > resources own their memcgs, for example bpf-map, or socket. > > > > struct bpf_map { <<<< memcg owner > > struct memcg_cgroup *memcg; > > }; > > > > struct sock { <<<< memcg owner > > struct mem_cgroup *sk_memcg; > > }; > > > > These resources already have their own memcgs, so we should make this > > behavior formal. > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'. > > This is a fundamental change: cgroups were always hierarchical groups > of processes/threads. You're basically suggesting to extend it to > hierarchical groups of processes and some other objects (what's a good > definition?). Kind of, but not exactly. We can do it without breaking the cgroup hierarchy. Under current cgroup hierarchy, the user can only echo processes/threads into a cgroup, that won't be changed in the future. The specific resources are not exposed to the user, the user can only control these specific resources by controlling their associated processes/threads. For example, Memcg-A |---- Memcg-A1 |---- Memcg-A2 We can introduce a new file memory.owner into each memcg. Each bit of memory.owner represents a specific resources, memory.owner: | bit31 | bitN | ... | bit1 | bit0 | | | |------ bit0: bpf memory | |-------------- bit1: socket memory | |--------------------------- bitN: a specific resource There won't be too many specific resources which have to own their memcgs, so I think 32bits is enough. Memcg-A : memory.owner == 0x1 |---- Memcg-A1 : memory.owner == 0 |---- Memcg-A2 : memory.owner == 0x1 Then the bpf created by processes in Memcg-A1 will be charged into Memcg-A directly without charging into Memcg-A1. But the bpf created by processes in Memcg-A2 will be charged into Memcg-A2 as its memory.owner is 0x1. That said, these specific resources are not fully independent of process, while they are still associated with the processes which create them. Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we don't need to care about the possible migration issue. I think we may also apply it to shared page caches. For example, struct inode { struct mem_cgroup *memcg; <<<< add a new member }; We define struct inode as a memcg owner, and use scope-based charge to charge its pages into inode->memcg. And then put all memcgs which shared these resources under the same parent. The page caches of this inode will be charged into the parent directly. The shared page cache is more complicated than bpf memory, so I'm not quite sure if it can apply to shared page cache, but it can work well for bpf memory. Regarding the observability, we can introduce a specific item into memory.stat for this specific memory. For example a new item 'bpf' for bpf memory. That can be accounted/unaccounted for in the same way as we do in set_active_memcg(). for example, struct task_struct { struct mem_cgroup *active_memcg; int active_memcg_item; <<<< introduce a new member }; bpf_memory_alloc() { old_memcg = set_active_memcg(memcg); old_item = set_active_memcg_item(MEMCG_BPF); alloc(); set_active_memcg_item(old_item); set_active_memcg(old_memcg); } bpf_memory_free() { old = set_active_memcg_item(MEMCG_BPF); free(); set_active_memcg_item(old); } Then we can know the bpf memory size in each memcg, and possible system-wide bpf-memory usage if it can be supported in root memcg. (Currently kmem is not charged into root memcg) > It's a huge change and it's scope is definetely larger > than bpf and even memory cgroups. It will raise a lot of questions: > e.g. what does it mean for other controllers (cpu, io, etc)? > Which objects can have dedicated cgroups and which not? How the interface will > look like? How the oom handling will work? Etc. > With the above hierarchy, I think the oom handling can work as-is. > The history showed that starting small with one controller and/or specific > use case isn't working well for cgroups, because different resources and > controllers are not living independently. Agreed. That is why I still associate the specific resources with the process which creates them. If all the resources are still associated with the process, I think it can work well. > So if you really want to go this way > you need to present the whole picture and convince many people that it's worth > it. I doubt this specific bpf map accounting thing can justify it. > > Personally I know some examples where such functionality could be useful, > but I'm not yet convinced it's worth the effort and potential problems. > Thanks for your reply.
On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote: > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin > <roman.gushchin@linux.dev> wrote: > > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote: > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > > > > Hello, > > > > > > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > > > > ... > > > > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > > > > thread and continue there? > > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > > > > bpf-specific problem in a bpf-specific way. > > > > > > > > > > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > > > > issue after so many discussions? > > > > > Do you charge the bpf-map's memory the same way as you charge the page > > > > > caches or slabs ? > > > > > No, you don't. You charge it in a bpf-specific way. > > > > > > > > > > Hi Roman, > > > > > > Sorry for the late response. > > > I've been on vacation in the past few days. > > > > > > > The only difference is that we charge the cgroup of the processes who > > > > created a map, not a process who is doing a specific allocation. > > > > > > This means the bpf-map can be indepent of process, IOW, the memcg of > > > bpf-map can be indepent of the memcg of the processes. > > > This is the fundamental difference between bpf-map and page caches, then... > > > > > > > Your patchset doesn't change this. > > > > > > We can make this behavior reasonable by introducing an independent > > > memcg, as what I did in the previous version. > > > > > > > There are pros and cons with this approach, we've discussed it back > > > > to the times when bpf memcg accounting was developed. If you want > > > > to revisit this, it's maybe possible (given there is a really strong and likely > > > > new motivation appears), but I haven't seen any complaints yet except from you. > > > > > > > > > > memcg-base bpf accounting is a new feature, which may not be used widely. > > > > > > > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > > > > complexity. Especially because a similar behaviour can be achieved simple > > > > > > by placing the task which creates the map into the desired cgroup. > > > > > > > > > > Are you serious ? > > > > > Have you ever read the cgroup doc? Which clearly describe the "No > > > > > Internal Process Constraint".[1] > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > > > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup > > > > and your memcg will get reparented. You can attach this process and create > > > > a bpf map to the parent cgroup before it gets child cgroups. > > > > > > If the process doesn't exit after it created bpf-map, we have to > > > migrate it around memcgs.... > > > The complexity in deployment can introduce unexpected issues easily. > > > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups. > > > > Lof of options. > > > > > > > > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > > > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > > > > > > > > > > Is it really hard to admit a fault? > > > > > > > > Yafang, you posted several versions and so far I haven't seen much of support > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm > > > > nacking a patchset with many acks, reviews and supporters. > > > > > > > > Still think you're solving an important problem in a reasonable way? > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead > > > > of blaming me. > > > > > > > > > > The best way so far is to introduce specific memcg for specific resources. > > > Because not only the process owns its memcg, but also specific > > > resources own their memcgs, for example bpf-map, or socket. > > > > > > struct bpf_map { <<<< memcg owner > > > struct memcg_cgroup *memcg; > > > }; > > > > > > struct sock { <<<< memcg owner > > > struct mem_cgroup *sk_memcg; > > > }; > > > > > > These resources already have their own memcgs, so we should make this > > > behavior formal. > > > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'. > > > > This is a fundamental change: cgroups were always hierarchical groups > > of processes/threads. You're basically suggesting to extend it to > > hierarchical groups of processes and some other objects (what's a good > > definition?). > > Kind of, but not exactly. > We can do it without breaking the cgroup hierarchy. Under current > cgroup hierarchy, the user can only echo processes/threads into a > cgroup, that won't be changed in the future. The specific resources > are not exposed to the user, the user can only control these specific > resources by controlling their associated processes/threads. > For example, > > Memcg-A > |---- Memcg-A1 > |---- Memcg-A2 > > We can introduce a new file memory.owner into each memcg. Each bit of > memory.owner represents a specific resources, > > memory.owner: | bit31 | bitN | ... | bit1 | bit0 | > | | > |------ bit0: bpf memory > | > |-------------- bit1: socket memory > | > |--------------------------- > bitN: a specific resource > > There won't be too many specific resources which have to own their > memcgs, so I think 32bits is enough. > > Memcg-A : memory.owner == 0x1 > |---- Memcg-A1 : memory.owner == 0 > |---- Memcg-A2 : memory.owner == 0x1 > > Then the bpf created by processes in Memcg-A1 will be charged into > Memcg-A directly without charging into Memcg-A1. > But the bpf created by processes in Memcg-A2 will be charged into > Memcg-A2 as its memory.owner is 0x1. > That said, these specific resources are not fully independent of > process, while they are still associated with the processes which > create them. > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we > don't need to care about the possible migration issue. > > I think we may also apply it to shared page caches. For example, > struct inode { > struct mem_cgroup *memcg; <<<< add a new member > }; > > We define struct inode as a memcg owner, and use scope-based charge to > charge its pages into inode->memcg. > And then put all memcgs which shared these resources under the same > parent. The page caches of this inode will be charged into the parent > directly. Ok, so it's something like premature selective reparenting. > The shared page cache is more complicated than bpf memory, so I'm not > quite sure if it can apply to shared page cache, but it can work well > for bpf memory. Yeah, this is the problem. It feels like it's a problem very specific to bpf maps and an exact way you use them. I don't think you can successfully advocate for changes of these calibre without a more generic problem. I might be wrong. > > Regarding the observability, we can introduce a specific item into > memory.stat for this specific memory. For example a new item 'bpf' for > bpf memory. > That can be accounted/unaccounted for in the same way as we do in > set_active_memcg(). for example, > > struct task_struct { > struct mem_cgroup *active_memcg; > int active_memcg_item; <<<< > introduce a new member > }; > > bpf_memory_alloc() > { > old_memcg = set_active_memcg(memcg); > old_item = set_active_memcg_item(MEMCG_BPF); I thought about something like this but for a different purpose: to track the amount of memory consumed by bpf. > alloc(); > set_active_memcg_item(old_item); > set_active_memcg(old_memcg); > } > > bpf_memory_free() > { > old = set_active_memcg_item(MEMCG_BPF); > free(); > set_active_memcg_item(old); > } But the problem is that we shoud very carefully mark all allocations and releases, which is very error-prone. Interfaces which don't require annotating releases are generally better, but require additional memory. Thanks!
On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote: > > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin > > <roman.gushchin@linux.dev> wrote: > > > > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote: > > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > > > > > Hello, > > > > > > > > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > > > > > ... > > > > > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > > > > > thread and continue there? > > > > > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > > > > > bpf-specific problem in a bpf-specific way. > > > > > > > > > > > > > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > > > > > issue after so many discussions? > > > > > > Do you charge the bpf-map's memory the same way as you charge the page > > > > > > caches or slabs ? > > > > > > No, you don't. You charge it in a bpf-specific way. > > > > > > > > > > > > > Hi Roman, > > > > > > > > Sorry for the late response. > > > > I've been on vacation in the past few days. > > > > > > > > > The only difference is that we charge the cgroup of the processes who > > > > > created a map, not a process who is doing a specific allocation. > > > > > > > > This means the bpf-map can be indepent of process, IOW, the memcg of > > > > bpf-map can be indepent of the memcg of the processes. > > > > This is the fundamental difference between bpf-map and page caches, then... > > > > > > > > > Your patchset doesn't change this. > > > > > > > > We can make this behavior reasonable by introducing an independent > > > > memcg, as what I did in the previous version. > > > > > > > > > There are pros and cons with this approach, we've discussed it back > > > > > to the times when bpf memcg accounting was developed. If you want > > > > > to revisit this, it's maybe possible (given there is a really strong and likely > > > > > new motivation appears), but I haven't seen any complaints yet except from you. > > > > > > > > > > > > > memcg-base bpf accounting is a new feature, which may not be used widely. > > > > > > > > > > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > > > > > complexity. Especially because a similar behaviour can be achieved simple > > > > > > > by placing the task which creates the map into the desired cgroup. > > > > > > > > > > > > Are you serious ? > > > > > > Have you ever read the cgroup doc? Which clearly describe the "No > > > > > > Internal Process Constraint".[1] > > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > > > > > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup > > > > > and your memcg will get reparented. You can attach this process and create > > > > > a bpf map to the parent cgroup before it gets child cgroups. > > > > > > > > If the process doesn't exit after it created bpf-map, we have to > > > > migrate it around memcgs.... > > > > The complexity in deployment can introduce unexpected issues easily. > > > > > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups. > > > > > Lof of options. > > > > > > > > > > > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > > > > > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > > > > > > > > > > > > > Is it really hard to admit a fault? > > > > > > > > > > Yafang, you posted several versions and so far I haven't seen much of support > > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm > > > > > nacking a patchset with many acks, reviews and supporters. > > > > > > > > > > Still think you're solving an important problem in a reasonable way? > > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead > > > > > of blaming me. > > > > > > > > > > > > > The best way so far is to introduce specific memcg for specific resources. > > > > Because not only the process owns its memcg, but also specific > > > > resources own their memcgs, for example bpf-map, or socket. > > > > > > > > struct bpf_map { <<<< memcg owner > > > > struct memcg_cgroup *memcg; > > > > }; > > > > > > > > struct sock { <<<< memcg owner > > > > struct mem_cgroup *sk_memcg; > > > > }; > > > > > > > > These resources already have their own memcgs, so we should make this > > > > behavior formal. > > > > > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'. > > > > > > This is a fundamental change: cgroups were always hierarchical groups > > > of processes/threads. You're basically suggesting to extend it to > > > hierarchical groups of processes and some other objects (what's a good > > > definition?). > > > > Kind of, but not exactly. > > We can do it without breaking the cgroup hierarchy. Under current > > cgroup hierarchy, the user can only echo processes/threads into a > > cgroup, that won't be changed in the future. The specific resources > > are not exposed to the user, the user can only control these specific > > resources by controlling their associated processes/threads. > > For example, > > > > Memcg-A > > |---- Memcg-A1 > > |---- Memcg-A2 > > > > We can introduce a new file memory.owner into each memcg. Each bit of > > memory.owner represents a specific resources, > > > > memory.owner: | bit31 | bitN | ... | bit1 | bit0 | > > | | > > |------ bit0: bpf memory > > | > > |-------------- bit1: socket memory > > | > > |--------------------------- > > bitN: a specific resource > > > > There won't be too many specific resources which have to own their > > memcgs, so I think 32bits is enough. > > > > Memcg-A : memory.owner == 0x1 > > |---- Memcg-A1 : memory.owner == 0 > > |---- Memcg-A2 : memory.owner == 0x1 > > > > Then the bpf created by processes in Memcg-A1 will be charged into > > Memcg-A directly without charging into Memcg-A1. > > But the bpf created by processes in Memcg-A2 will be charged into > > Memcg-A2 as its memory.owner is 0x1. > > That said, these specific resources are not fully independent of > > process, while they are still associated with the processes which > > create them. > > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we > > don't need to care about the possible migration issue. > > > > I think we may also apply it to shared page caches. For example, > > struct inode { > > struct mem_cgroup *memcg; <<<< add a new member > > }; > > > > We define struct inode as a memcg owner, and use scope-based charge to > > charge its pages into inode->memcg. > > And then put all memcgs which shared these resources under the same > > parent. The page caches of this inode will be charged into the parent > > directly. > > Ok, so it's something like premature selective reparenting. > Right. I think it may be a good way to handle the resources which may outlive the process. > > The shared page cache is more complicated than bpf memory, so I'm not > > quite sure if it can apply to shared page cache, but it can work well > > for bpf memory. > > Yeah, this is the problem. It feels like it's a problem very specific > to bpf maps and an exact way you use them. I don't think you can successfully > advocate for changes of these calibre without a more generic problem. I might > be wrong. > What is your concern about this method? Are there any potential issues? > > > > Regarding the observability, we can introduce a specific item into > > memory.stat for this specific memory. For example a new item 'bpf' for > > bpf memory. > > That can be accounted/unaccounted for in the same way as we do in > > set_active_memcg(). for example, > > > > struct task_struct { > > struct mem_cgroup *active_memcg; > > int active_memcg_item; <<<< > > introduce a new member > > }; > > > > bpf_memory_alloc() > > { > > old_memcg = set_active_memcg(memcg); > > old_item = set_active_memcg_item(MEMCG_BPF); > > I thought about something like this but for a different purpose: > to track the amount of memory consumed by bpf. > Right, we can use it to track bpf memory consumption. > > alloc(); > > set_active_memcg_item(old_item); > > set_active_memcg(old_memcg); > > } > > > > bpf_memory_free() > > { > > old = set_active_memcg_item(MEMCG_BPF); > > free(); > > set_active_memcg_item(old); > > } > > But the problem is that we shoud very carefully mark all allocations and > releases, which is very error-prone. Interfaces which don't require annotating > releases are generally better, but require additional memory. > If we don't annotate the releases, we have to add something into the struct page, which may not be worth it. It is clear how the bpf memory is allocated and freed, so I think we can start it with bpf memory. If in the future we can figure out a lightweight way to avoid annotating the releases, then we can remove the annotations in the bpf memory releases.
On Tue, Sep 20, 2022 at 08:42:36PM +0800, Yafang Shao wrote: > On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin > <roman.gushchin@linux.dev> wrote: > > > > On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote: > > > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin > > > <roman.gushchin@linux.dev> wrote: > > > > > > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote: > > > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > > > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > > > > > > ... > > > > > > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > > > > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > > > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > > > > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > > > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > > > > > > thread and continue there? > > > > > > > > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > > > > > > bpf-specific problem in a bpf-specific way. > > > > > > > > > > > > > > > > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > > > > > > issue after so many discussions? > > > > > > > Do you charge the bpf-map's memory the same way as you charge the page > > > > > > > caches or slabs ? > > > > > > > No, you don't. You charge it in a bpf-specific way. > > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > > > Sorry for the late response. > > > > > I've been on vacation in the past few days. > > > > > > > > > > > The only difference is that we charge the cgroup of the processes who > > > > > > created a map, not a process who is doing a specific allocation. > > > > > > > > > > This means the bpf-map can be indepent of process, IOW, the memcg of > > > > > bpf-map can be indepent of the memcg of the processes. > > > > > This is the fundamental difference between bpf-map and page caches, then... > > > > > > > > > > > Your patchset doesn't change this. > > > > > > > > > > We can make this behavior reasonable by introducing an independent > > > > > memcg, as what I did in the previous version. > > > > > > > > > > > There are pros and cons with this approach, we've discussed it back > > > > > > to the times when bpf memcg accounting was developed. If you want > > > > > > to revisit this, it's maybe possible (given there is a really strong and likely > > > > > > new motivation appears), but I haven't seen any complaints yet except from you. > > > > > > > > > > > > > > > > memcg-base bpf accounting is a new feature, which may not be used widely. > > > > > > > > > > > > > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > > > > > > complexity. Especially because a similar behaviour can be achieved simple > > > > > > > > by placing the task which creates the map into the desired cgroup. > > > > > > > > > > > > > > Are you serious ? > > > > > > > Have you ever read the cgroup doc? Which clearly describe the "No > > > > > > > Internal Process Constraint".[1] > > > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > > > > > > > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup > > > > > > and your memcg will get reparented. You can attach this process and create > > > > > > a bpf map to the parent cgroup before it gets child cgroups. > > > > > > > > > > If the process doesn't exit after it created bpf-map, we have to > > > > > migrate it around memcgs.... > > > > > The complexity in deployment can introduce unexpected issues easily. > > > > > > > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups. > > > > > > Lof of options. > > > > > > > > > > > > > > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > > > > > > > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > > > > > > > > > > > > > > > > Is it really hard to admit a fault? > > > > > > > > > > > > Yafang, you posted several versions and so far I haven't seen much of support > > > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm > > > > > > nacking a patchset with many acks, reviews and supporters. > > > > > > > > > > > > Still think you're solving an important problem in a reasonable way? > > > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead > > > > > > of blaming me. > > > > > > > > > > > > > > > > The best way so far is to introduce specific memcg for specific resources. > > > > > Because not only the process owns its memcg, but also specific > > > > > resources own their memcgs, for example bpf-map, or socket. > > > > > > > > > > struct bpf_map { <<<< memcg owner > > > > > struct memcg_cgroup *memcg; > > > > > }; > > > > > > > > > > struct sock { <<<< memcg owner > > > > > struct mem_cgroup *sk_memcg; > > > > > }; > > > > > > > > > > These resources already have their own memcgs, so we should make this > > > > > behavior formal. > > > > > > > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'. > > > > > > > > This is a fundamental change: cgroups were always hierarchical groups > > > > of processes/threads. You're basically suggesting to extend it to > > > > hierarchical groups of processes and some other objects (what's a good > > > > definition?). > > > > > > Kind of, but not exactly. > > > We can do it without breaking the cgroup hierarchy. Under current > > > cgroup hierarchy, the user can only echo processes/threads into a > > > cgroup, that won't be changed in the future. The specific resources > > > are not exposed to the user, the user can only control these specific > > > resources by controlling their associated processes/threads. > > > For example, > > > > > > Memcg-A > > > |---- Memcg-A1 > > > |---- Memcg-A2 > > > > > > We can introduce a new file memory.owner into each memcg. Each bit of > > > memory.owner represents a specific resources, > > > > > > memory.owner: | bit31 | bitN | ... | bit1 | bit0 | > > > | | > > > |------ bit0: bpf memory > > > | > > > |-------------- bit1: socket memory > > > | > > > |--------------------------- > > > bitN: a specific resource > > > > > > There won't be too many specific resources which have to own their > > > memcgs, so I think 32bits is enough. > > > > > > Memcg-A : memory.owner == 0x1 > > > |---- Memcg-A1 : memory.owner == 0 > > > |---- Memcg-A2 : memory.owner == 0x1 > > > > > > Then the bpf created by processes in Memcg-A1 will be charged into > > > Memcg-A directly without charging into Memcg-A1. > > > But the bpf created by processes in Memcg-A2 will be charged into > > > Memcg-A2 as its memory.owner is 0x1. > > > That said, these specific resources are not fully independent of > > > process, while they are still associated with the processes which > > > create them. > > > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we > > > don't need to care about the possible migration issue. > > > > > > I think we may also apply it to shared page caches. For example, > > > struct inode { > > > struct mem_cgroup *memcg; <<<< add a new member > > > }; > > > > > > We define struct inode as a memcg owner, and use scope-based charge to > > > charge its pages into inode->memcg. > > > And then put all memcgs which shared these resources under the same > > > parent. The page caches of this inode will be charged into the parent > > > directly. > > > > Ok, so it's something like premature selective reparenting. > > > > Right. I think it may be a good way to handle the resources which may > outlive the process. > > > > The shared page cache is more complicated than bpf memory, so I'm not > > > quite sure if it can apply to shared page cache, but it can work well > > > for bpf memory. > > > > Yeah, this is the problem. It feels like it's a problem very specific > > to bpf maps and an exact way you use them. I don't think you can successfully > > advocate for changes of these calibre without a more generic problem. I might > > be wrong. > > > > What is your concern about this method? Are there any potential issues? The issue is simple: nobody wants to support a new non-trivial cgroup interface to solve a specific bpf accounting issue in one particular setup. Any new interface will become an API and has to be supported for many many years, so it has to be generic and future-proof. If you want to go this direction, please, show that it solves a _generic_ problem, not limited to a specific way how you use bpf maps in your specific setup. Accounting of a bpf map shared by many cgroups, which should outlive the original memory cgroups... Idk, maybe it's how many users are using bpf maps, but I don't hear it yet. There were some patches from Google folks about the tmpfs accounting, _maybe_ it's something to look at in order to get an idea about a more generic problem and solution. Thanks!
On Wed, Sep 21, 2022 at 7:15 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > On Tue, Sep 20, 2022 at 08:42:36PM +0800, Yafang Shao wrote: > > On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin > > <roman.gushchin@linux.dev> wrote: > > > > > > On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote: > > > > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin > > > > <roman.gushchin@linux.dev> wrote: > > > > > > > > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote: > > > > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote: > > > > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > > > > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote: > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote: > > > > > > > > > > ... > > > > > > > > > > > This patchset tries to resolve the above two issues by introducing a > > > > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to > > > > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further. > > > > > > > > > > > Possible use cases of the selectable memcg as follows, > > > > > > > > > > > > > > > > > > > > As discussed in the following thread, there are clear downsides to an > > > > > > > > > > interface which requires the users to specify the cgroups directly. > > > > > > > > > > > > > > > > > > > > https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org > > > > > > > > > > > > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping > > > > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that > > > > > > > > > > thread and continue there? > > > > > > > > > > > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > > > > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non > > > > > > > > > bpf-specific problem in a bpf-specific way. > > > > > > > > > > > > > > > > > > > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific > > > > > > > > issue after so many discussions? > > > > > > > > Do you charge the bpf-map's memory the same way as you charge the page > > > > > > > > caches or slabs ? > > > > > > > > No, you don't. You charge it in a bpf-specific way. > > > > > > > > > > > > > > > > > > > Hi Roman, > > > > > > > > > > > > Sorry for the late response. > > > > > > I've been on vacation in the past few days. > > > > > > > > > > > > > The only difference is that we charge the cgroup of the processes who > > > > > > > created a map, not a process who is doing a specific allocation. > > > > > > > > > > > > This means the bpf-map can be indepent of process, IOW, the memcg of > > > > > > bpf-map can be indepent of the memcg of the processes. > > > > > > This is the fundamental difference between bpf-map and page caches, then... > > > > > > > > > > > > > Your patchset doesn't change this. > > > > > > > > > > > > We can make this behavior reasonable by introducing an independent > > > > > > memcg, as what I did in the previous version. > > > > > > > > > > > > > There are pros and cons with this approach, we've discussed it back > > > > > > > to the times when bpf memcg accounting was developed. If you want > > > > > > > to revisit this, it's maybe possible (given there is a really strong and likely > > > > > > > new motivation appears), but I haven't seen any complaints yet except from you. > > > > > > > > > > > > > > > > > > > memcg-base bpf accounting is a new feature, which may not be used widely. > > > > > > > > > > > > > > > > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well > > > > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps > > > > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added > > > > > > > > > complexity. Especially because a similar behaviour can be achieved simple > > > > > > > > > by placing the task which creates the map into the desired cgroup. > > > > > > > > > > > > > > > > Are you serious ? > > > > > > > > Have you ever read the cgroup doc? Which clearly describe the "No > > > > > > > > Internal Process Constraint".[1] > > > > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg. > > > > > > > > > > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup > > > > > > > and your memcg will get reparented. You can attach this process and create > > > > > > > a bpf map to the parent cgroup before it gets child cgroups. > > > > > > > > > > > > If the process doesn't exit after it created bpf-map, we have to > > > > > > migrate it around memcgs.... > > > > > > The complexity in deployment can introduce unexpected issues easily. > > > > > > > > > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups. > > > > > > > Lof of options. > > > > > > > > > > > > > > > > > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt > > > > > > > > > > > > > > > > > Beatiful? Not. Neither is the proposed solution. > > > > > > > > > > > > > > > > > > > > > > > > > Is it really hard to admit a fault? > > > > > > > > > > > > > > Yafang, you posted several versions and so far I haven't seen much of support > > > > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm > > > > > > > nacking a patchset with many acks, reviews and supporters. > > > > > > > > > > > > > > Still think you're solving an important problem in a reasonable way? > > > > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead > > > > > > > of blaming me. > > > > > > > > > > > > > > > > > > > The best way so far is to introduce specific memcg for specific resources. > > > > > > Because not only the process owns its memcg, but also specific > > > > > > resources own their memcgs, for example bpf-map, or socket. > > > > > > > > > > > > struct bpf_map { <<<< memcg owner > > > > > > struct memcg_cgroup *memcg; > > > > > > }; > > > > > > > > > > > > struct sock { <<<< memcg owner > > > > > > struct mem_cgroup *sk_memcg; > > > > > > }; > > > > > > > > > > > > These resources already have their own memcgs, so we should make this > > > > > > behavior formal. > > > > > > > > > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'. > > > > > > > > > > This is a fundamental change: cgroups were always hierarchical groups > > > > > of processes/threads. You're basically suggesting to extend it to > > > > > hierarchical groups of processes and some other objects (what's a good > > > > > definition?). > > > > > > > > Kind of, but not exactly. > > > > We can do it without breaking the cgroup hierarchy. Under current > > > > cgroup hierarchy, the user can only echo processes/threads into a > > > > cgroup, that won't be changed in the future. The specific resources > > > > are not exposed to the user, the user can only control these specific > > > > resources by controlling their associated processes/threads. > > > > For example, > > > > > > > > Memcg-A > > > > |---- Memcg-A1 > > > > |---- Memcg-A2 > > > > > > > > We can introduce a new file memory.owner into each memcg. Each bit of > > > > memory.owner represents a specific resources, > > > > > > > > memory.owner: | bit31 | bitN | ... | bit1 | bit0 | > > > > | | > > > > |------ bit0: bpf memory > > > > | > > > > |-------------- bit1: socket memory > > > > | > > > > |--------------------------- > > > > bitN: a specific resource > > > > > > > > There won't be too many specific resources which have to own their > > > > memcgs, so I think 32bits is enough. > > > > > > > > Memcg-A : memory.owner == 0x1 > > > > |---- Memcg-A1 : memory.owner == 0 > > > > |---- Memcg-A2 : memory.owner == 0x1 > > > > > > > > Then the bpf created by processes in Memcg-A1 will be charged into > > > > Memcg-A directly without charging into Memcg-A1. > > > > But the bpf created by processes in Memcg-A2 will be charged into > > > > Memcg-A2 as its memory.owner is 0x1. > > > > That said, these specific resources are not fully independent of > > > > process, while they are still associated with the processes which > > > > create them. > > > > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we > > > > don't need to care about the possible migration issue. > > > > > > > > I think we may also apply it to shared page caches. For example, > > > > struct inode { > > > > struct mem_cgroup *memcg; <<<< add a new member > > > > }; > > > > > > > > We define struct inode as a memcg owner, and use scope-based charge to > > > > charge its pages into inode->memcg. > > > > And then put all memcgs which shared these resources under the same > > > > parent. The page caches of this inode will be charged into the parent > > > > directly. > > > > > > Ok, so it's something like premature selective reparenting. > > > > > > > Right. I think it may be a good way to handle the resources which may > > outlive the process. > > > > > > The shared page cache is more complicated than bpf memory, so I'm not > > > > quite sure if it can apply to shared page cache, but it can work well > > > > for bpf memory. > > > > > > Yeah, this is the problem. It feels like it's a problem very specific > > > to bpf maps and an exact way you use them. I don't think you can successfully > > > advocate for changes of these calibre without a more generic problem. I might > > > be wrong. > > > > > > > What is your concern about this method? Are there any potential issues? > > The issue is simple: nobody wants to support a new non-trivial cgroup interface > to solve a specific bpf accounting issue in one particular setup. Any new > interface will become an API and has to be supported for many many years, > so it has to be generic and future-proof. > > If you want to go this direction, please, show that it solves a _generic_ > problem, not limited to a specific way how you use bpf maps in your specific > setup. Accounting of a bpf map shared by many cgroups, which should outlive > the original memory cgroups... Idk, maybe it's how many users are using bpf > maps, but I don't hear it yet. > > There were some patches from Google folks about the tmpfs accounting, _maybe_ > it's something to look at in order to get an idea about a more generic problem > and solution. > Hmm... It seems that we are in a dilemma now. We can't fix it in memcg way, because the issue we are fixing it a bpf-specific issue. But we can't fix it in a bpf-specific way neither...