[bpf-next,v3,00/13] bpf: Introduce selectable memcg for bpf map

Message ID	20220902023003.47124-1-laoar.shao@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Yafang Shao <laoar.shao@gmail.com> To: ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, kafai@fb.com, songliubraving@fb.com, yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, songmuchun@bytedance.com, akpm@linux-foundation.org, tj@kernel.org, lizefan.x@bytedance.com Cc: cgroups@vger.kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com> Subject: [PATCH bpf-next v3 00/13] bpf: Introduce selectable memcg for bpf map Date: Fri, 2 Sep 2022 02:29:50 +0000 Message-Id: <20220902023003.47124-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	bpf: Introduce selectable memcg for bpf map \| expand [bpf-next,v3,00/13] bpf: Introduce selectable memcg for bpf map [bpf-next,v3,01/13] cgroup: Update the comment on cgroup_get_from_fd [bpf-next,v3,02/13] bpf: Introduce new helper bpf_map_put_memcg() [bpf-next,v3,03/13] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM [bpf-next,v3,04/13] bpf: Call bpf_map_init_from_attr() immediately after map creation [bpf-next,v3,05/13] bpf: Save memcg in bpf_map_init_from_attr() [bpf-next,v3,06/13] bpf: Use scoped-based charge in bpf_map_area_alloc [bpf-next,v3,07/13] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} [bpf-next,v3,08/13] bpf: Use bpf_map_kzalloc in arraymap [bpf-next,v3,09/13] bpf: Use bpf_map_kvcalloc in bpf_local_storage [bpf-next,v3,10/13] mm, memcg: Add new helper get_obj_cgroup_from_cgroup [bpf-next,v3,11/13] mm, memcg: Add new helper task_under_memcg_hierarchy [bpf-next,v3,12/13] bpf: Add return value for bpf_map_init_from_attr [bpf-next,v3,13/13] bpf: Introduce selectable memcg for bpf map

Yafang Shao Sept. 2, 2022, 2:29 a.m. UTC

On our production environment, we may load, run and pin bpf programs and
maps in containers. For example, some of our networking bpf programs and
maps are loaded and pinned by a process running in a container on our
k8s environment. In this container, there're also running some other
user applications which watch the networking configurations from remote
servers and update them on this local host, log the error events, monitor
the traffic, and do some other stuffs. Sometimes we may need to update 
these user applications to a new release, and in this update process we
will destroy the old container and then start a new genration. In order not
to interrupt the bpf programs in the update process, we will pin the bpf
programs and maps in bpffs. That is the background and use case on our
production environment. 

After switching to memcg-based bpf memory accounting to limit the bpf
memory, some unexpected issues jumped out at us.
1. The memory usage is not consistent between the first generation and
new generations.
2. After the first generation is destroyed, the bpf memory can't be
limited if the bpf maps are not preallocated, because they will be
reparented.

Besides, there's another issue that the bpf-map's memcg is breaking the
memcg hierarchy, because bpf-map has its own memcg. A bpf map can be
wrote by tasks running in other memcgs, once a writer in other memcg
writes a shared bpf map, the memory allocated in this writing won't be
charged to the writer's memcg but it will be charge to bpf-map's own
memcg instead. IOW, the bpf-map is improperly treated as a task, while
actually it is a shared resource. This patchset doesn't resolve this
issue. I will post another RFC once I find a workable solution to
address it.

This patchset tries to resolve the above two issues by introducing a
selectable memcg to limit the bpf memory. Currently we only allow to
select its ancestor to avoid breaking the memcg hierarchy further. 
Possible use cases of the selectable memcg as follows,
- Select the root memcg as bpf-map's memcg
  Then bpf-map's memory won't be throttled by current memcg limit.
- Put current memcg under a fixed memcg dir and select the fixed memcg
  as bpf-map's memcg
  The hierarchy as follows,

      Parent-memcg (A fixed dir, i.e. /sys/fs/cgroup/memory/bpf)
         \
        Current-memcg (Container dir, i.e. /sys/fs/cgroup/memory/bpf/foo)

  At the map creation time, the bpf-map's memory will be charged
  into the parent directly without charging into current memcg, and thus
  current memcg's usage will be consistent among different generations.
  To limit bpf-map's memory usage, we can set the limit in the parent
  memcg.

Currenly it only supports for bpf map, and we can extend it to bpf prog
as well.

The observebility can also be supported in the next step, for example,
showing the bpf map's memcg by 'bpftool map show' or even showing which
maps are charged to a specific memcg by 'bpftool cgroup show'.
Furthermore, we may also show an accurate memory size of a bpf map
instead of an estimated memory size in 'bpftool map show' in the future. 

v2->v3:
- use css_tryget() instead of css_tryget_online() (Shakeel)
- add comment for get_obj_cgroup_from_cgroup() (Shakeel)
- add new memcg helper task_under_memcg_hierarchy()
- add restriction to allow selecting ancestor only to avoid breaking the
  memcg hierarchy further, per discussion with Tejun 

v1->v2:
- cgroup1 is also supported after
  commit f3a2aebdd6fb ("cgroup: enable cgroup_get_from_file() on cgroup1")
  So update the commit log.
- remove incorrect fix to mem_cgroup_put  (Shakeel,Roman,Muchun) 
- use cgroup_put() in bpf_map_save_memcg() (Shakeel)
- add detailed commit log for get_obj_cgroup_from_cgroup (Shakeel) 

RFC->v1:
- get rid of bpf_map container wrapper (Alexei)
- add the new field into the end of struct (Alexei)
- get rid of BPF_F_SELECTABLE_MEMCG (Alexei)
- save memcg in bpf_map_init_from_attr
- introduce bpf_ringbuf_pages_{alloc,free} and keep them inside
  kernel/bpf/ringbuf.c  (Andrii)

Yafang Shao (13):
  cgroup: Update the comment on cgroup_get_from_fd
  bpf: Introduce new helper bpf_map_put_memcg()
  bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM
  bpf: Call bpf_map_init_from_attr() immediately after map creation
  bpf: Save memcg in bpf_map_init_from_attr()
  bpf: Use scoped-based charge in bpf_map_area_alloc
  bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free}
  bpf: Use bpf_map_kzalloc in arraymap
  bpf: Use bpf_map_kvcalloc in bpf_local_storage
  mm, memcg: Add new helper get_obj_cgroup_from_cgroup
  mm, memcg: Add new helper task_under_memcg_hierarchy
  bpf: Add return value for bpf_map_init_from_attr
  bpf: Introduce selectable memcg for bpf map

 include/linux/bpf.h            |  40 +++++++++++-
 include/linux/memcontrol.h     |  25 ++++++++
 include/uapi/linux/bpf.h       |   1 +
 kernel/bpf/arraymap.c          |  34 +++++-----
 kernel/bpf/bloom_filter.c      |  11 +++-
 kernel/bpf/bpf_local_storage.c |  17 +++--
 kernel/bpf/bpf_struct_ops.c    |  19 +++---
 kernel/bpf/cpumap.c            |  17 +++--
 kernel/bpf/devmap.c            |  30 +++++----
 kernel/bpf/hashtab.c           |  26 +++++---
 kernel/bpf/local_storage.c     |  11 +++-
 kernel/bpf/lpm_trie.c          |  12 +++-
 kernel/bpf/offload.c           |  12 ++--
 kernel/bpf/queue_stack_maps.c  |  11 +++-
 kernel/bpf/reuseport_array.c   |  11 +++-
 kernel/bpf/ringbuf.c           | 104 ++++++++++++++++++++----------
 kernel/bpf/stackmap.c          |  13 ++--
 kernel/bpf/syscall.c           | 140 ++++++++++++++++++++++++++++-------------
 kernel/cgroup/cgroup.c         |   2 +-
 mm/memcontrol.c                |  48 ++++++++++++++
 net/core/sock_map.c            |  30 +++++----
 net/xdp/xskmap.c               |  12 +++-
 tools/include/uapi/linux/bpf.h |   1 +
 tools/lib/bpf/bpf.c            |   3 +-
 tools/lib/bpf/bpf.h            |   3 +-
 tools/lib/bpf/gen_loader.c     |   2 +-
 tools/lib/bpf/libbpf.c         |   2 +
 tools/lib/bpf/skel_internal.h  |   2 +-
 28 files changed, 462 insertions(+), 177 deletions(-)

Tejun Heo Sept. 7, 2022, 3:43 p.m. UTC | #1

Hello,

On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
...
> This patchset tries to resolve the above two issues by introducing a
> selectable memcg to limit the bpf memory. Currently we only allow to
> select its ancestor to avoid breaking the memcg hierarchy further. 
> Possible use cases of the selectable memcg as follows,

As discussed in the following thread, there are clear downsides to an
interface which requires the users to specify the cgroups directly.

 https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org

So, I don't really think this is an interface we wanna go for. I was hoping
to hear more from memcg folks in the above thread. Maybe ping them in that
thread and continue there?

Thanks.

Tejun Heo Sept. 7, 2022, 3:45 p.m. UTC | #2

On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> ...
> > This patchset tries to resolve the above two issues by introducing a
> > selectable memcg to limit the bpf memory. Currently we only allow to
> > select its ancestor to avoid breaking the memcg hierarchy further. 
> > Possible use cases of the selectable memcg as follows,
> 
> As discussed in the following thread, there are clear downsides to an
> interface which requires the users to specify the cgroups directly.
> 
>  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> 
> So, I don't really think this is an interface we wanna go for. I was hoping
> to hear more from memcg folks in the above thread. Maybe ping them in that
> thread and continue there?

Ah, another thing. If the memcg accounting is breaking things right now, we
can easily introduce a memcg disable flag for bpf memory. That should help
alleviating the immediate breakage while we figure this out.

Thanks.

Alexei Starovoitov Sept. 7, 2022, 4:13 p.m. UTC | #3

On Wed, Sep 7, 2022 at 8:45 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > Hello,
> >
> > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > ...
> > > This patchset tries to resolve the above two issues by introducing a
> > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > Possible use cases of the selectable memcg as follows,
> >
> > As discussed in the following thread, there are clear downsides to an
> > interface which requires the users to specify the cgroups directly.
> >
> >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> >
> > So, I don't really think this is an interface we wanna go for. I was hoping
> > to hear more from memcg folks in the above thread. Maybe ping them in that
> > thread and continue there?
>
> Ah, another thing. If the memcg accounting is breaking things right now, we
> can easily introduce a memcg disable flag for bpf memory. That should help
> alleviating the immediate breakage while we figure this out.

Hmm. We discussed this option already. We definitely don't want
to introduce an uapi knob that will allow anyone to skip memcg
accounting today and in the future.

Tejun Heo Sept. 7, 2022, 4:18 p.m. UTC | #4

Hello,

On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> Hmm. We discussed this option already. We definitely don't want
> to introduce an uapi knob that will allow anyone to skip memcg
> accounting today and in the future.

cgroup.memory boot parameter is how memcg provides last-resort workarounds
for this sort of problems / regressions while they're being addressed. It's
not a dynamically changeable or programmable thing. Just a boot time
opt-out. That said, if you don't want it, you don't want it.

Thanks.

Alexei Starovoitov Sept. 7, 2022, 4:27 p.m. UTC | #5

On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> > Hmm. We discussed this option already. We definitely don't want
> > to introduce an uapi knob that will allow anyone to skip memcg
> > accounting today and in the future.
>
> cgroup.memory boot parameter is how memcg provides last-resort workarounds
> for this sort of problems / regressions while they're being addressed. It's
> not a dynamically changeable or programmable thing. Just a boot time
> opt-out. That said, if you don't want it, you don't want it.

ahh. boot param.
Are you suggesting a global off switch ? Like nosocket and nokmem.
That would be a different story.
Need to think more about it. It could be ok.

Tejun Heo Sept. 7, 2022, 5:01 p.m. UTC | #6

On Wed, Sep 07, 2022 at 09:27:09AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> > > Hmm. We discussed this option already. We definitely don't want
> > > to introduce an uapi knob that will allow anyone to skip memcg
> > > accounting today and in the future.
> >
> > cgroup.memory boot parameter is how memcg provides last-resort workarounds
> > for this sort of problems / regressions while they're being addressed. It's
> > not a dynamically changeable or programmable thing. Just a boot time
> > opt-out. That said, if you don't want it, you don't want it.
> 
> ahh. boot param.
> Are you suggesting a global off switch ? Like nosocket and nokmem.
> That would be a different story.
> Need to think more about it. It could be ok.

Yeah, nobpf or sth like that. An equivalent cgroup.memory parameter.

Thanks.

Roman Gushchin Sept. 7, 2022, 10:28 p.m. UTC | #7

On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> ...
> > This patchset tries to resolve the above two issues by introducing a
> > selectable memcg to limit the bpf memory. Currently we only allow to
> > select its ancestor to avoid breaking the memcg hierarchy further. 
> > Possible use cases of the selectable memcg as follows,
> 
> As discussed in the following thread, there are clear downsides to an
> interface which requires the users to specify the cgroups directly.
> 
>  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> 
> So, I don't really think this is an interface we wanna go for. I was hoping
> to hear more from memcg folks in the above thread. Maybe ping them in that
> thread and continue there?

As I said previously, I don't like it, because it's an attempt to solve a non
bpf-specific problem in a bpf-specific way.

Yes, memory cgroups are not great for accounting of shared resources, it's well
known. This patchset looks like an attempt to "fix" it specifically for bpf maps
in a particular cgroup setup. Honestly, I don't think it's worth the added
complexity. Especially because a similar behaviour can be achieved simple
by placing the task which creates the map into the desired cgroup.
Beatiful? Not. Neither is the proposed solution.

Thanks!

Yafang Shao Sept. 8, 2022, 2:37 a.m. UTC | #8

On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > Hello,
> >
> > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > ...
> > > This patchset tries to resolve the above two issues by introducing a
> > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > Possible use cases of the selectable memcg as follows,
> >
> > As discussed in the following thread, there are clear downsides to an
> > interface which requires the users to specify the cgroups directly.
> >
> >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> >
> > So, I don't really think this is an interface we wanna go for. I was hoping
> > to hear more from memcg folks in the above thread. Maybe ping them in that
> > thread and continue there?
>

Hi Roman,

> As I said previously, I don't like it, because it's an attempt to solve a non
> bpf-specific problem in a bpf-specific way.
>

Why do you still insist that bpf_map->memcg is not a bpf-specific
issue after so many discussions?
Do you charge the bpf-map's memory the same way as you charge the page
caches or slabs ?
No, you don't. You charge it in a bpf-specific way.

> Yes, memory cgroups are not great for accounting of shared resources, it's well
> known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> in a particular cgroup setup. Honestly, I don't think it's worth the added
> complexity. Especially because a similar behaviour can be achieved simple
> by placing the task which creates the map into the desired cgroup.

Are you serious ?
Have you ever read the cgroup doc? Which clearly describe the "No
Internal Process Constraint".[1]
Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.

[1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt

> Beatiful? Not. Neither is the proposed solution.
>

Is it really hard to admit a fault?

Alexei Starovoitov Sept. 8, 2022, 2:43 a.m. UTC | #9

On Wed, Sep 7, 2022 at 7:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > ...
> > > > This patchset tries to resolve the above two issues by introducing a
> > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > Possible use cases of the selectable memcg as follows,
> > >
> > > As discussed in the following thread, there are clear downsides to an
> > > interface which requires the users to specify the cgroups directly.
> > >
> > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > >
> > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > thread and continue there?
> >
>
> Hi Roman,
>
> > As I said previously, I don't like it, because it's an attempt to solve a non
> > bpf-specific problem in a bpf-specific way.
> >
>
> Why do you still insist that bpf_map->memcg is not a bpf-specific
> issue after so many discussions?
> Do you charge the bpf-map's memory the same way as you charge the page
> caches or slabs ?
> No, you don't. You charge it in a bpf-specific way.
>
> > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > complexity. Especially because a similar behaviour can be achieved simple
> > by placing the task which creates the map into the desired cgroup.
>
> Are you serious ?
> Have you ever read the cgroup doc? Which clearly describe the "No
> Internal Process Constraint".[1]
> Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
>
> [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>
> > Beatiful? Not. Neither is the proposed solution.
> >
>
> Is it really hard to admit a fault?

Yafang,

This attitude won't get you anywhere.

Selecting memcg by fd is no go.
You need to work with the community to figure out a solution
acceptable to maintainers of relevant subsystems.

Yafang Shao Sept. 8, 2022, 2:44 a.m. UTC | #10

On Thu, Sep 8, 2022 at 1:01 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Sep 07, 2022 at 09:27:09AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 7, 2022 at 9:18 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Sep 07, 2022 at 09:13:09AM -0700, Alexei Starovoitov wrote:
> > > > Hmm. We discussed this option already. We definitely don't want
> > > > to introduce an uapi knob that will allow anyone to skip memcg
> > > > accounting today and in the future.
> > >
> > > cgroup.memory boot parameter is how memcg provides last-resort workarounds
> > > for this sort of problems / regressions while they're being addressed. It's
> > > not a dynamically changeable or programmable thing. Just a boot time
> > > opt-out. That said, if you don't want it, you don't want it.
> >
> > ahh. boot param.
> > Are you suggesting a global off switch ? Like nosocket and nokmem.
> > That would be a different story.
> > Need to think more about it. It could be ok.
>
> Yeah, nobpf or sth like that. An equivalent cgroup.memory parameter.
>

It may be a useful feature for some cases, but it can't help container users.
The memcg works well to limit the non-pinned bpf-map, that's the
reason why we, a container user, switch to memcg-based bpf charging.
Our goal is to make it also work for pinned bpf-map.

That said,  your proposal may be a useful feature,  but it should be
another different patchset.

Yafang Shao Sept. 8, 2022, 2:48 a.m. UTC | #11

On Thu, Sep 8, 2022 at 10:43 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Sep 7, 2022 at 7:37 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > ...
> > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > Possible use cases of the selectable memcg as follows,
> > > >
> > > > As discussed in the following thread, there are clear downsides to an
> > > > interface which requires the users to specify the cgroups directly.
> > > >
> > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > >
> > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > thread and continue there?
> > >
> >
> > Hi Roman,
> >
> > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > bpf-specific problem in a bpf-specific way.
> > >
> >
> > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > issue after so many discussions?
> > Do you charge the bpf-map's memory the same way as you charge the page
> > caches or slabs ?
> > No, you don't. You charge it in a bpf-specific way.
> >
> > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > complexity. Especially because a similar behaviour can be achieved simple
> > > by placing the task which creates the map into the desired cgroup.
> >
> > Are you serious ?
> > Have you ever read the cgroup doc? Which clearly describe the "No
> > Internal Process Constraint".[1]
> > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> >
> > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> >
> > > Beatiful? Not. Neither is the proposed solution.
> > >
> >
> > Is it really hard to admit a fault?
>
> Yafang,
>
> This attitude won't get you anywhere.
>

Thanks for pointing it out. It is my fault.

> Selecting memcg by fd is no go.
> You need to work with the community to figure out a solution
> acceptable to maintainers of relevant subsystems.

Yes, I'm trying.

Roman Gushchin Sept. 8, 2022, 4:13 p.m. UTC | #12

On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > ...
> > > > This patchset tries to resolve the above two issues by introducing a
> > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > Possible use cases of the selectable memcg as follows,
> > >
> > > As discussed in the following thread, there are clear downsides to an
> > > interface which requires the users to specify the cgroups directly.
> > >
> > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > >
> > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > thread and continue there?
> >
> 
> Hi Roman,
> 
> > As I said previously, I don't like it, because it's an attempt to solve a non
> > bpf-specific problem in a bpf-specific way.
> >
> 
> Why do you still insist that bpf_map->memcg is not a bpf-specific
> issue after so many discussions?
> Do you charge the bpf-map's memory the same way as you charge the page
> caches or slabs ?
> No, you don't. You charge it in a bpf-specific way.

The only difference is that we charge the cgroup of the processes who
created a map, not a process who is doing a specific allocation.
Your patchset doesn't change this.
There are pros and cons with this approach, we've discussed it back
to the times when bpf memcg accounting was developed. If you want
to revisit this, it's maybe possible (given there is a really strong and likely
new motivation appears), but I haven't seen any complaints yet except from you.

> 
> > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > complexity. Especially because a similar behaviour can be achieved simple
> > by placing the task which creates the map into the desired cgroup.
> 
> Are you serious ?
> Have you ever read the cgroup doc? Which clearly describe the "No
> Internal Process Constraint".[1]
> Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.

But you can place it into another leaf cgroup. You can delete this leaf cgroup
and your memcg will get reparented. You can attach this process and create
a bpf map to the parent cgroup before it gets child cgroups.
You can revisit the idea of shared bpf maps and outlive specific cgroups.
Lof of options.

> 
> [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> 
> > Beatiful? Not. Neither is the proposed solution.
> >
> 
> Is it really hard to admit a fault?

Yafang, you posted several versions and so far I haven't seen much of support
or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
nacking a patchset with many acks, reviews and supporters.

Still think you're solving an important problem in a reasonable way?
It seems like not many are convinced yet. I'd recommend to focus on this instead
of blaming me.

Thanks!

Yafang Shao Sept. 13, 2022, 6:15 a.m. UTC | #13

On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > ...
> > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > Possible use cases of the selectable memcg as follows,
> > > >
> > > > As discussed in the following thread, there are clear downsides to an
> > > > interface which requires the users to specify the cgroups directly.
> > > >
> > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > >
> > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > thread and continue there?
> > >
> >
> > Hi Roman,
> >
> > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > bpf-specific problem in a bpf-specific way.
> > >
> >
> > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > issue after so many discussions?
> > Do you charge the bpf-map's memory the same way as you charge the page
> > caches or slabs ?
> > No, you don't. You charge it in a bpf-specific way.
>

Hi Roman,

Sorry for the late response.
I've been on vacation in the past few days.

> The only difference is that we charge the cgroup of the processes who
> created a map, not a process who is doing a specific allocation.

This means the bpf-map can be indepent of process, IOW, the memcg of
bpf-map can be indepent of the memcg of the processes.
This is the fundamental difference between bpf-map and page caches, then...

> Your patchset doesn't change this.

We can make this behavior reasonable by introducing an independent
memcg, as what I did in the previous version.

> There are pros and cons with this approach, we've discussed it back
> to the times when bpf memcg accounting was developed. If you want
> to revisit this, it's maybe possible (given there is a really strong and likely
> new motivation appears), but I haven't seen any complaints yet except from you.
>

memcg-base bpf accounting is a new feature, which may not be used widely.

> >
> > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > complexity. Especially because a similar behaviour can be achieved simple
> > > by placing the task which creates the map into the desired cgroup.
> >
> > Are you serious ?
> > Have you ever read the cgroup doc? Which clearly describe the "No
> > Internal Process Constraint".[1]
> > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
>
> But you can place it into another leaf cgroup. You can delete this leaf cgroup
> and your memcg will get reparented. You can attach this process and create
> a bpf map to the parent cgroup before it gets child cgroups.

If the process doesn't exit after it created bpf-map, we have to
migrate it around memcgs....
The complexity in deployment can introduce unexpected issues easily.

> You can revisit the idea of shared bpf maps and outlive specific cgroups.
> Lof of options.
>
> >
> > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> >
> > > Beatiful? Not. Neither is the proposed solution.
> > >
> >
> > Is it really hard to admit a fault?
>
> Yafang, you posted several versions and so far I haven't seen much of support
> or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> nacking a patchset with many acks, reviews and supporters.
>
> Still think you're solving an important problem in a reasonable way?
> It seems like not many are convinced yet. I'd recommend to focus on this instead
> of blaming me.
>

The best way so far is to introduce specific memcg for specific resources.
Because not only the process owns its memcg, but also specific
resources own their memcgs, for example bpf-map, or socket.

struct bpf_map {                                 <<<< memcg owner
    struct memcg_cgroup *memcg;
};

struct sock {                                       <<<< memcg owner
    struct mem_cgroup *sk_memcg;
};

These resources already have their own memcgs, so we should make this
behavior formal.

The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.

Roman Gushchin Sept. 16, 2022, 4:53 p.m. UTC | #14

On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > Hello,
> > > > >
> > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > ...
> > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > Possible use cases of the selectable memcg as follows,
> > > > >
> > > > > As discussed in the following thread, there are clear downsides to an
> > > > > interface which requires the users to specify the cgroups directly.
> > > > >
> > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > >
> > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > thread and continue there?
> > > >
> > >
> > > Hi Roman,
> > >
> > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > bpf-specific problem in a bpf-specific way.
> > > >
> > >
> > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > issue after so many discussions?
> > > Do you charge the bpf-map's memory the same way as you charge the page
> > > caches or slabs ?
> > > No, you don't. You charge it in a bpf-specific way.
> >
> 
> Hi Roman,
> 
> Sorry for the late response.
> I've been on vacation in the past few days.
> 
> > The only difference is that we charge the cgroup of the processes who
> > created a map, not a process who is doing a specific allocation.
> 
> This means the bpf-map can be indepent of process, IOW, the memcg of
> bpf-map can be indepent of the memcg of the processes.
> This is the fundamental difference between bpf-map and page caches, then...
> 
> > Your patchset doesn't change this.
> 
> We can make this behavior reasonable by introducing an independent
> memcg, as what I did in the previous version.
> 
> > There are pros and cons with this approach, we've discussed it back
> > to the times when bpf memcg accounting was developed. If you want
> > to revisit this, it's maybe possible (given there is a really strong and likely
> > new motivation appears), but I haven't seen any complaints yet except from you.
> >
> 
> memcg-base bpf accounting is a new feature, which may not be used widely.
> 
> > >
> > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > by placing the task which creates the map into the desired cgroup.
> > >
> > > Are you serious ?
> > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > Internal Process Constraint".[1]
> > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> >
> > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > and your memcg will get reparented. You can attach this process and create
> > a bpf map to the parent cgroup before it gets child cgroups.
> 
> If the process doesn't exit after it created bpf-map, we have to
> migrate it around memcgs....
> The complexity in deployment can introduce unexpected issues easily.
> 
> > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > Lof of options.
> >
> > >
> > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > >
> > > > Beatiful? Not. Neither is the proposed solution.
> > > >
> > >
> > > Is it really hard to admit a fault?
> >
> > Yafang, you posted several versions and so far I haven't seen much of support
> > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > nacking a patchset with many acks, reviews and supporters.
> >
> > Still think you're solving an important problem in a reasonable way?
> > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > of blaming me.
> >
> 
> The best way so far is to introduce specific memcg for specific resources.
> Because not only the process owns its memcg, but also specific
> resources own their memcgs, for example bpf-map, or socket.
> 
> struct bpf_map {                                 <<<< memcg owner
>     struct memcg_cgroup *memcg;
> };
> 
> struct sock {                                       <<<< memcg owner
>     struct mem_cgroup *sk_memcg;
> };
> 
> These resources already have their own memcgs, so we should make this
> behavior formal.
> 
> The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.

This is a fundamental change: cgroups were always hierarchical groups
of processes/threads. You're basically suggesting to extend it to
hierarchical groups of processes and some other objects (what's a good
definition?). It's a huge change and it's scope is definetely larger
than bpf and even memory cgroups. It will raise a lot of questions:
e.g. what does it mean for other controllers (cpu, io, etc)?
Which objects can have dedicated cgroups and which not? How the interface will
look like? How the oom handling will work? Etc.

The history showed that starting small with one controller and/or specific
use case isn't working well for cgroups, because different resources and
controllers are not living independently. So if you really want to go this way
you need to present the whole picture and convince many people that it's worth
it. I doubt this specific bpf map accounting thing can justify it.

Personally I know some examples where such functionality could be useful,
but I'm not yet convinced it's worth the effort and potential problems.

Thanks!

Yafang Shao Sept. 18, 2022, 3:44 a.m. UTC | #15

On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >
> > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > Hello,
> > > > > >
> > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > ...
> > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > >
> > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > interface which requires the users to specify the cgroups directly.
> > > > > >
> > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > >
> > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > thread and continue there?
> > > > >
> > > >
> > > > Hi Roman,
> > > >
> > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > bpf-specific problem in a bpf-specific way.
> > > > >
> > > >
> > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > issue after so many discussions?
> > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > caches or slabs ?
> > > > No, you don't. You charge it in a bpf-specific way.
> > >
> >
> > Hi Roman,
> >
> > Sorry for the late response.
> > I've been on vacation in the past few days.
> >
> > > The only difference is that we charge the cgroup of the processes who
> > > created a map, not a process who is doing a specific allocation.
> >
> > This means the bpf-map can be indepent of process, IOW, the memcg of
> > bpf-map can be indepent of the memcg of the processes.
> > This is the fundamental difference between bpf-map and page caches, then...
> >
> > > Your patchset doesn't change this.
> >
> > We can make this behavior reasonable by introducing an independent
> > memcg, as what I did in the previous version.
> >
> > > There are pros and cons with this approach, we've discussed it back
> > > to the times when bpf memcg accounting was developed. If you want
> > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > new motivation appears), but I haven't seen any complaints yet except from you.
> > >
> >
> > memcg-base bpf accounting is a new feature, which may not be used widely.
> >
> > > >
> > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > by placing the task which creates the map into the desired cgroup.
> > > >
> > > > Are you serious ?
> > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > Internal Process Constraint".[1]
> > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > >
> > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > and your memcg will get reparented. You can attach this process and create
> > > a bpf map to the parent cgroup before it gets child cgroups.
> >
> > If the process doesn't exit after it created bpf-map, we have to
> > migrate it around memcgs....
> > The complexity in deployment can introduce unexpected issues easily.
> >
> > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > Lof of options.
> > >
> > > >
> > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > >
> > > > > Beatiful? Not. Neither is the proposed solution.
> > > > >
> > > >
> > > > Is it really hard to admit a fault?
> > >
> > > Yafang, you posted several versions and so far I haven't seen much of support
> > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > nacking a patchset with many acks, reviews and supporters.
> > >
> > > Still think you're solving an important problem in a reasonable way?
> > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > of blaming me.
> > >
> >
> > The best way so far is to introduce specific memcg for specific resources.
> > Because not only the process owns its memcg, but also specific
> > resources own their memcgs, for example bpf-map, or socket.
> >
> > struct bpf_map {                                 <<<< memcg owner
> >     struct memcg_cgroup *memcg;
> > };
> >
> > struct sock {                                       <<<< memcg owner
> >     struct mem_cgroup *sk_memcg;
> > };
> >
> > These resources already have their own memcgs, so we should make this
> > behavior formal.
> >
> > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
>
> This is a fundamental change: cgroups were always hierarchical groups
> of processes/threads. You're basically suggesting to extend it to
> hierarchical groups of processes and some other objects (what's a good
> definition?).

Kind of, but not exactly.
We can do it without breaking the cgroup hierarchy. Under current
cgroup hierarchy, the user can only echo processes/threads into a
cgroup, that won't be changed in the future. The specific resources
are not exposed to the user, the user can only control these specific
resources by controlling their associated processes/threads.
For example,

                Memcg-A
                       |---- Memcg-A1
                       |---- Memcg-A2

We can introduce a new file memory.owner into each memcg. Each bit of
memory.owner represents a specific resources,

 memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
                                         |               |
|------ bit0: bpf memory
                                         |
|-------------- bit1: socket memory
                                         |
                                         |---------------------------
bitN: a specific resource

There won't be too many specific resources which have to own their
memcgs, so I think 32bits is enough.

                Memcg-A : memory.owner == 0x1
                       |---- Memcg-A1 : memory.owner == 0
                       |---- Memcg-A2 : memory.owner == 0x1

Then the bpf created by processes in Memcg-A1 will be charged into
Memcg-A directly without charging into Memcg-A1.
But the bpf created by processes in Memcg-A2 will be charged into
Memcg-A2 as its memory.owner is 0x1.
That said, these specific resources are not fully independent of
process, while they are still associated with the processes which
create them.
Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
don't need to care about the possible migration issue.

I think we may also apply it to shared page caches.  For example,
      struct inode {
          struct mem_cgroup *memcg;          <<<< add a new member
      };

We define struct inode as a memcg owner, and use scope-based charge to
charge its pages into inode->memcg.
And then put all memcgs which shared these resources under the same
parent. The page caches of this inode will be charged into the parent
directly.
The shared page cache is more complicated than bpf memory, so I'm not
quite sure if it can apply to shared page cache, but it can work well
for bpf memory.

Regarding the observability, we can introduce a specific item into
memory.stat for this specific memory. For example a new item 'bpf' for
bpf memory.
That can be accounted/unaccounted for in the same way as we do in
set_active_memcg(). for example,

    struct task_struct {
        struct mem_cgroup  *active_memcg;
        int                             active_memcg_item;   <<<<
introduce a new member
    };

    bpf_memory_alloc()
    {
             old_memcg = set_active_memcg(memcg);
             old_item = set_active_memcg_item(MEMCG_BPF);
             alloc();
             set_active_memcg_item(old_item);
             set_active_memcg(old_memcg);
    }

    bpf_memory_free()
    {
             old = set_active_memcg_item(MEMCG_BPF);
             free();
             set_active_memcg_item(old);
    }

Then we can know the bpf memory size in each memcg, and possible
system-wide bpf-memory usage if it can be supported in root memcg.
(Currently  kmem is not charged into root memcg)

> It's a huge change and it's scope is definetely larger
> than bpf and even memory cgroups. It will raise a lot of questions:
> e.g. what does it mean for other controllers (cpu, io, etc)?
> Which objects can have dedicated cgroups and which not? How the interface will
> look like? How the oom handling will work? Etc.
>

With the above hierarchy, I think the oom handling can work as-is.

> The history showed that starting small with one controller and/or specific
> use case isn't working well for cgroups, because different resources and
> controllers are not living independently.

Agreed. That is why I still associate the specific resources with the
process which creates them.
If all the resources are still associated with the process, I think it
can work well.

> So if you really want to go this way
> you need to present the whole picture and convince many people that it's worth
> it. I doubt this specific bpf map accounting thing can justify it.
>
> Personally I know some examples where such functionality could be useful,
> but I'm not yet convinced it's worth the effort and potential problems.
>

Thanks for your reply.

Roman Gushchin Sept. 20, 2022, 2:40 a.m. UTC | #16

On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > ...
> > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > >
> > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > >
> > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > >
> > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > thread and continue there?
> > > > > >
> > > > >
> > > > > Hi Roman,
> > > > >
> > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > bpf-specific problem in a bpf-specific way.
> > > > > >
> > > > >
> > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > issue after so many discussions?
> > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > caches or slabs ?
> > > > > No, you don't. You charge it in a bpf-specific way.
> > > >
> > >
> > > Hi Roman,
> > >
> > > Sorry for the late response.
> > > I've been on vacation in the past few days.
> > >
> > > > The only difference is that we charge the cgroup of the processes who
> > > > created a map, not a process who is doing a specific allocation.
> > >
> > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > bpf-map can be indepent of the memcg of the processes.
> > > This is the fundamental difference between bpf-map and page caches, then...
> > >
> > > > Your patchset doesn't change this.
> > >
> > > We can make this behavior reasonable by introducing an independent
> > > memcg, as what I did in the previous version.
> > >
> > > > There are pros and cons with this approach, we've discussed it back
> > > > to the times when bpf memcg accounting was developed. If you want
> > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > >
> > >
> > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > >
> > > > >
> > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > by placing the task which creates the map into the desired cgroup.
> > > > >
> > > > > Are you serious ?
> > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > Internal Process Constraint".[1]
> > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > >
> > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > and your memcg will get reparented. You can attach this process and create
> > > > a bpf map to the parent cgroup before it gets child cgroups.
> > >
> > > If the process doesn't exit after it created bpf-map, we have to
> > > migrate it around memcgs....
> > > The complexity in deployment can introduce unexpected issues easily.
> > >
> > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > Lof of options.
> > > >
> > > > >
> > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > >
> > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > >
> > > > >
> > > > > Is it really hard to admit a fault?
> > > >
> > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > nacking a patchset with many acks, reviews and supporters.
> > > >
> > > > Still think you're solving an important problem in a reasonable way?
> > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > of blaming me.
> > > >
> > >
> > > The best way so far is to introduce specific memcg for specific resources.
> > > Because not only the process owns its memcg, but also specific
> > > resources own their memcgs, for example bpf-map, or socket.
> > >
> > > struct bpf_map {                                 <<<< memcg owner
> > >     struct memcg_cgroup *memcg;
> > > };
> > >
> > > struct sock {                                       <<<< memcg owner
> > >     struct mem_cgroup *sk_memcg;
> > > };
> > >
> > > These resources already have their own memcgs, so we should make this
> > > behavior formal.
> > >
> > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> >
> > This is a fundamental change: cgroups were always hierarchical groups
> > of processes/threads. You're basically suggesting to extend it to
> > hierarchical groups of processes and some other objects (what's a good
> > definition?).
> 
> Kind of, but not exactly.
> We can do it without breaking the cgroup hierarchy. Under current
> cgroup hierarchy, the user can only echo processes/threads into a
> cgroup, that won't be changed in the future. The specific resources
> are not exposed to the user, the user can only control these specific
> resources by controlling their associated processes/threads.
> For example,
> 
>                 Memcg-A
>                        |---- Memcg-A1
>                        |---- Memcg-A2
> 
> We can introduce a new file memory.owner into each memcg. Each bit of
> memory.owner represents a specific resources,
> 
>  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
>                                          |               |
> |------ bit0: bpf memory
>                                          |
> |-------------- bit1: socket memory
>                                          |
>                                          |---------------------------
> bitN: a specific resource
> 
> There won't be too many specific resources which have to own their
> memcgs, so I think 32bits is enough.
> 
>                 Memcg-A : memory.owner == 0x1
>                        |---- Memcg-A1 : memory.owner == 0
>                        |---- Memcg-A2 : memory.owner == 0x1
> 
> Then the bpf created by processes in Memcg-A1 will be charged into
> Memcg-A directly without charging into Memcg-A1.
> But the bpf created by processes in Memcg-A2 will be charged into
> Memcg-A2 as its memory.owner is 0x1.
> That said, these specific resources are not fully independent of
> process, while they are still associated with the processes which
> create them.
> Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> don't need to care about the possible migration issue.
> 
> I think we may also apply it to shared page caches.  For example,
>       struct inode {
>           struct mem_cgroup *memcg;          <<<< add a new member
>       };
> 
> We define struct inode as a memcg owner, and use scope-based charge to
> charge its pages into inode->memcg.
> And then put all memcgs which shared these resources under the same
> parent. The page caches of this inode will be charged into the parent
> directly.

Ok, so it's something like premature selective reparenting.

> The shared page cache is more complicated than bpf memory, so I'm not
> quite sure if it can apply to shared page cache, but it can work well
> for bpf memory.

Yeah, this is the problem. It feels like it's a problem very specific
to bpf maps and an exact way you use them. I don't think you can successfully
advocate for changes of these calibre without a more generic problem. I might
be wrong.

> 
> Regarding the observability, we can introduce a specific item into
> memory.stat for this specific memory. For example a new item 'bpf' for
> bpf memory.
> That can be accounted/unaccounted for in the same way as we do in
> set_active_memcg(). for example,
> 
>     struct task_struct {
>         struct mem_cgroup  *active_memcg;
>         int                             active_memcg_item;   <<<<
> introduce a new member
>     };
> 
>     bpf_memory_alloc()
>     {
>              old_memcg = set_active_memcg(memcg);
>              old_item = set_active_memcg_item(MEMCG_BPF);

I thought about something like this but for a different purpose:
to track the amount of memory consumed by bpf.

>              alloc();
>              set_active_memcg_item(old_item);
>              set_active_memcg(old_memcg);
>     }
> 
>     bpf_memory_free()
>     {
>              old = set_active_memcg_item(MEMCG_BPF);
>              free();
>              set_active_memcg_item(old);
>     }

But the problem is that we shoud very carefully mark all allocations and
releases, which is very error-prone. Interfaces which don't require annotating
releases are generally better, but require additional memory.

Thanks!

Yafang Shao Sept. 20, 2022, 12:42 p.m. UTC | #17

On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > > ...
> > > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > > >
> > > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > > >
> > > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > > >
> > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > > thread and continue there?
> > > > > > >
> > > > > >
> > > > > > Hi Roman,
> > > > > >
> > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > > bpf-specific problem in a bpf-specific way.
> > > > > > >
> > > > > >
> > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > > issue after so many discussions?
> > > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > > caches or slabs ?
> > > > > > No, you don't. You charge it in a bpf-specific way.
> > > > >
> > > >
> > > > Hi Roman,
> > > >
> > > > Sorry for the late response.
> > > > I've been on vacation in the past few days.
> > > >
> > > > > The only difference is that we charge the cgroup of the processes who
> > > > > created a map, not a process who is doing a specific allocation.
> > > >
> > > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > > bpf-map can be indepent of the memcg of the processes.
> > > > This is the fundamental difference between bpf-map and page caches, then...
> > > >
> > > > > Your patchset doesn't change this.
> > > >
> > > > We can make this behavior reasonable by introducing an independent
> > > > memcg, as what I did in the previous version.
> > > >
> > > > > There are pros and cons with this approach, we've discussed it back
> > > > > to the times when bpf memcg accounting was developed. If you want
> > > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > > >
> > > >
> > > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > > >
> > > > > >
> > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > > by placing the task which creates the map into the desired cgroup.
> > > > > >
> > > > > > Are you serious ?
> > > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > > Internal Process Constraint".[1]
> > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > > >
> > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > > and your memcg will get reparented. You can attach this process and create
> > > > > a bpf map to the parent cgroup before it gets child cgroups.
> > > >
> > > > If the process doesn't exit after it created bpf-map, we have to
> > > > migrate it around memcgs....
> > > > The complexity in deployment can introduce unexpected issues easily.
> > > >
> > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > > Lof of options.
> > > > >
> > > > > >
> > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > > >
> > > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > > >
> > > > > >
> > > > > > Is it really hard to admit a fault?
> > > > >
> > > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > > nacking a patchset with many acks, reviews and supporters.
> > > > >
> > > > > Still think you're solving an important problem in a reasonable way?
> > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > > of blaming me.
> > > > >
> > > >
> > > > The best way so far is to introduce specific memcg for specific resources.
> > > > Because not only the process owns its memcg, but also specific
> > > > resources own their memcgs, for example bpf-map, or socket.
> > > >
> > > > struct bpf_map {                                 <<<< memcg owner
> > > >     struct memcg_cgroup *memcg;
> > > > };
> > > >
> > > > struct sock {                                       <<<< memcg owner
> > > >     struct mem_cgroup *sk_memcg;
> > > > };
> > > >
> > > > These resources already have their own memcgs, so we should make this
> > > > behavior formal.
> > > >
> > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> > >
> > > This is a fundamental change: cgroups were always hierarchical groups
> > > of processes/threads. You're basically suggesting to extend it to
> > > hierarchical groups of processes and some other objects (what's a good
> > > definition?).
> >
> > Kind of, but not exactly.
> > We can do it without breaking the cgroup hierarchy. Under current
> > cgroup hierarchy, the user can only echo processes/threads into a
> > cgroup, that won't be changed in the future. The specific resources
> > are not exposed to the user, the user can only control these specific
> > resources by controlling their associated processes/threads.
> > For example,
> >
> >                 Memcg-A
> >                        |---- Memcg-A1
> >                        |---- Memcg-A2
> >
> > We can introduce a new file memory.owner into each memcg. Each bit of
> > memory.owner represents a specific resources,
> >
> >  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
> >                                          |               |
> > |------ bit0: bpf memory
> >                                          |
> > |-------------- bit1: socket memory
> >                                          |
> >                                          |---------------------------
> > bitN: a specific resource
> >
> > There won't be too many specific resources which have to own their
> > memcgs, so I think 32bits is enough.
> >
> >                 Memcg-A : memory.owner == 0x1
> >                        |---- Memcg-A1 : memory.owner == 0
> >                        |---- Memcg-A2 : memory.owner == 0x1
> >
> > Then the bpf created by processes in Memcg-A1 will be charged into
> > Memcg-A directly without charging into Memcg-A1.
> > But the bpf created by processes in Memcg-A2 will be charged into
> > Memcg-A2 as its memory.owner is 0x1.
> > That said, these specific resources are not fully independent of
> > process, while they are still associated with the processes which
> > create them.
> > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> > don't need to care about the possible migration issue.
> >
> > I think we may also apply it to shared page caches.  For example,
> >       struct inode {
> >           struct mem_cgroup *memcg;          <<<< add a new member
> >       };
> >
> > We define struct inode as a memcg owner, and use scope-based charge to
> > charge its pages into inode->memcg.
> > And then put all memcgs which shared these resources under the same
> > parent. The page caches of this inode will be charged into the parent
> > directly.
>
> Ok, so it's something like premature selective reparenting.
>

Right. I think it  may be a good way to handle the resources which may
outlive the process.

> > The shared page cache is more complicated than bpf memory, so I'm not
> > quite sure if it can apply to shared page cache, but it can work well
> > for bpf memory.
>
> Yeah, this is the problem. It feels like it's a problem very specific
> to bpf maps and an exact way you use them. I don't think you can successfully
> advocate for changes of these calibre without a more generic problem. I might
> be wrong.
>

What is your concern about this method? Are there any potential issues?

> >
> > Regarding the observability, we can introduce a specific item into
> > memory.stat for this specific memory. For example a new item 'bpf' for
> > bpf memory.
> > That can be accounted/unaccounted for in the same way as we do in
> > set_active_memcg(). for example,
> >
> >     struct task_struct {
> >         struct mem_cgroup  *active_memcg;
> >         int                             active_memcg_item;   <<<<
> > introduce a new member
> >     };
> >
> >     bpf_memory_alloc()
> >     {
> >              old_memcg = set_active_memcg(memcg);
> >              old_item = set_active_memcg_item(MEMCG_BPF);
>
> I thought about something like this but for a different purpose:
> to track the amount of memory consumed by bpf.
>

Right, we can use it to track bpf memory consumption.

> >              alloc();
> >              set_active_memcg_item(old_item);
> >              set_active_memcg(old_memcg);
> >     }
> >
> >     bpf_memory_free()
> >     {
> >              old = set_active_memcg_item(MEMCG_BPF);
> >              free();
> >              set_active_memcg_item(old);
> >     }
>
> But the problem is that we shoud very carefully mark all allocations and
> releases, which is very error-prone. Interfaces which don't require annotating
> releases are generally better, but require additional memory.
>

If we don't annotate the releases, we have to add something into the
struct page, which may not be worth it.
It is clear how the bpf memory is allocated and freed, so I think we
can start it with bpf memory.
If in the future we can figure out a lightweight way to avoid
annotating the releases, then we can remove the annotations in the bpf
memory releases.

Roman Gushchin Sept. 20, 2022, 11:15 p.m. UTC | #18

On Tue, Sep 20, 2022 at 08:42:36PM +0800, Yafang Shao wrote:
> On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> > > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> > > <roman.gushchin@linux.dev> wrote:
> > > >
> > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > >
> > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > > > ...
> > > > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > > > >
> > > > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > > > >
> > > > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > > > >
> > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > > > thread and continue there?
> > > > > > > >
> > > > > > >
> > > > > > > Hi Roman,
> > > > > > >
> > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > > > bpf-specific problem in a bpf-specific way.
> > > > > > > >
> > > > > > >
> > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > > > issue after so many discussions?
> > > > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > > > caches or slabs ?
> > > > > > > No, you don't. You charge it in a bpf-specific way.
> > > > > >
> > > > >
> > > > > Hi Roman,
> > > > >
> > > > > Sorry for the late response.
> > > > > I've been on vacation in the past few days.
> > > > >
> > > > > > The only difference is that we charge the cgroup of the processes who
> > > > > > created a map, not a process who is doing a specific allocation.
> > > > >
> > > > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > > > bpf-map can be indepent of the memcg of the processes.
> > > > > This is the fundamental difference between bpf-map and page caches, then...
> > > > >
> > > > > > Your patchset doesn't change this.
> > > > >
> > > > > We can make this behavior reasonable by introducing an independent
> > > > > memcg, as what I did in the previous version.
> > > > >
> > > > > > There are pros and cons with this approach, we've discussed it back
> > > > > > to the times when bpf memcg accounting was developed. If you want
> > > > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > > > >
> > > > >
> > > > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > > > >
> > > > > > >
> > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > > > by placing the task which creates the map into the desired cgroup.
> > > > > > >
> > > > > > > Are you serious ?
> > > > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > > > Internal Process Constraint".[1]
> > > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > > > >
> > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > > > and your memcg will get reparented. You can attach this process and create
> > > > > > a bpf map to the parent cgroup before it gets child cgroups.
> > > > >
> > > > > If the process doesn't exit after it created bpf-map, we have to
> > > > > migrate it around memcgs....
> > > > > The complexity in deployment can introduce unexpected issues easily.
> > > > >
> > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > > > Lof of options.
> > > > > >
> > > > > > >
> > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > > > >
> > > > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > > > >
> > > > > > >
> > > > > > > Is it really hard to admit a fault?
> > > > > >
> > > > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > > > nacking a patchset with many acks, reviews and supporters.
> > > > > >
> > > > > > Still think you're solving an important problem in a reasonable way?
> > > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > > > of blaming me.
> > > > > >
> > > > >
> > > > > The best way so far is to introduce specific memcg for specific resources.
> > > > > Because not only the process owns its memcg, but also specific
> > > > > resources own their memcgs, for example bpf-map, or socket.
> > > > >
> > > > > struct bpf_map {                                 <<<< memcg owner
> > > > >     struct memcg_cgroup *memcg;
> > > > > };
> > > > >
> > > > > struct sock {                                       <<<< memcg owner
> > > > >     struct mem_cgroup *sk_memcg;
> > > > > };
> > > > >
> > > > > These resources already have their own memcgs, so we should make this
> > > > > behavior formal.
> > > > >
> > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> > > >
> > > > This is a fundamental change: cgroups were always hierarchical groups
> > > > of processes/threads. You're basically suggesting to extend it to
> > > > hierarchical groups of processes and some other objects (what's a good
> > > > definition?).
> > >
> > > Kind of, but not exactly.
> > > We can do it without breaking the cgroup hierarchy. Under current
> > > cgroup hierarchy, the user can only echo processes/threads into a
> > > cgroup, that won't be changed in the future. The specific resources
> > > are not exposed to the user, the user can only control these specific
> > > resources by controlling their associated processes/threads.
> > > For example,
> > >
> > >                 Memcg-A
> > >                        |---- Memcg-A1
> > >                        |---- Memcg-A2
> > >
> > > We can introduce a new file memory.owner into each memcg. Each bit of
> > > memory.owner represents a specific resources,
> > >
> > >  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
> > >                                          |               |
> > > |------ bit0: bpf memory
> > >                                          |
> > > |-------------- bit1: socket memory
> > >                                          |
> > >                                          |---------------------------
> > > bitN: a specific resource
> > >
> > > There won't be too many specific resources which have to own their
> > > memcgs, so I think 32bits is enough.
> > >
> > >                 Memcg-A : memory.owner == 0x1
> > >                        |---- Memcg-A1 : memory.owner == 0
> > >                        |---- Memcg-A2 : memory.owner == 0x1
> > >
> > > Then the bpf created by processes in Memcg-A1 will be charged into
> > > Memcg-A directly without charging into Memcg-A1.
> > > But the bpf created by processes in Memcg-A2 will be charged into
> > > Memcg-A2 as its memory.owner is 0x1.
> > > That said, these specific resources are not fully independent of
> > > process, while they are still associated with the processes which
> > > create them.
> > > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> > > don't need to care about the possible migration issue.
> > >
> > > I think we may also apply it to shared page caches.  For example,
> > >       struct inode {
> > >           struct mem_cgroup *memcg;          <<<< add a new member
> > >       };
> > >
> > > We define struct inode as a memcg owner, and use scope-based charge to
> > > charge its pages into inode->memcg.
> > > And then put all memcgs which shared these resources under the same
> > > parent. The page caches of this inode will be charged into the parent
> > > directly.
> >
> > Ok, so it's something like premature selective reparenting.
> >
> 
> Right. I think it  may be a good way to handle the resources which may
> outlive the process.
> 
> > > The shared page cache is more complicated than bpf memory, so I'm not
> > > quite sure if it can apply to shared page cache, but it can work well
> > > for bpf memory.
> >
> > Yeah, this is the problem. It feels like it's a problem very specific
> > to bpf maps and an exact way you use them. I don't think you can successfully
> > advocate for changes of these calibre without a more generic problem. I might
> > be wrong.
> >
> 
> What is your concern about this method? Are there any potential issues?

The issue is simple: nobody wants to support a new non-trivial cgroup interface
to solve a specific bpf accounting issue in one particular setup. Any new
interface will become an API and has to be supported for many many years,
so it has to be generic and future-proof.

If you want to go this direction, please, show that it solves a _generic_
problem, not limited to a specific way how you use bpf maps in your specific
setup. Accounting of a bpf map shared by many cgroups, which should outlive
the original memory cgroups... Idk, maybe it's how many users are using bpf
maps, but I don't hear it yet.

There were some patches from Google folks about the tmpfs accounting, _maybe_
it's something to look at in order to get an idea about a more generic problem
and solution.

Thanks!

Yafang Shao Sept. 21, 2022, 9:36 a.m. UTC | #19

On Wed, Sep 21, 2022 at 7:15 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Tue, Sep 20, 2022 at 08:42:36PM +0800, Yafang Shao wrote:
> > On Tue, Sep 20, 2022 at 10:40 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> > >
> > > On Sun, Sep 18, 2022 at 11:44:48AM +0800, Yafang Shao wrote:
> > > > On Sat, Sep 17, 2022 at 12:53 AM Roman Gushchin
> > > > <roman.gushchin@linux.dev> wrote:
> > > > >
> > > > > On Tue, Sep 13, 2022 at 02:15:20PM +0800, Yafang Shao wrote:
> > > > > > On Fri, Sep 9, 2022 at 12:13 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > >
> > > > > > > On Thu, Sep 08, 2022 at 10:37:02AM +0800, Yafang Shao wrote:
> > > > > > > > On Thu, Sep 8, 2022 at 6:29 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Sep 07, 2022 at 05:43:31AM -1000, Tejun Heo wrote:
> > > > > > > > > > Hello,
> > > > > > > > > >
> > > > > > > > > > On Fri, Sep 02, 2022 at 02:29:50AM +0000, Yafang Shao wrote:
> > > > > > > > > > ...
> > > > > > > > > > > This patchset tries to resolve the above two issues by introducing a
> > > > > > > > > > > selectable memcg to limit the bpf memory. Currently we only allow to
> > > > > > > > > > > select its ancestor to avoid breaking the memcg hierarchy further.
> > > > > > > > > > > Possible use cases of the selectable memcg as follows,
> > > > > > > > > >
> > > > > > > > > > As discussed in the following thread, there are clear downsides to an
> > > > > > > > > > interface which requires the users to specify the cgroups directly.
> > > > > > > > > >
> > > > > > > > > >  https://lkml.kernel.org/r/YwNold0GMOappUxc@slm.duckdns.org
> > > > > > > > > >
> > > > > > > > > > So, I don't really think this is an interface we wanna go for. I was hoping
> > > > > > > > > > to hear more from memcg folks in the above thread. Maybe ping them in that
> > > > > > > > > > thread and continue there?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Hi Roman,
> > > > > > > >
> > > > > > > > > As I said previously, I don't like it, because it's an attempt to solve a non
> > > > > > > > > bpf-specific problem in a bpf-specific way.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Why do you still insist that bpf_map->memcg is not a bpf-specific
> > > > > > > > issue after so many discussions?
> > > > > > > > Do you charge the bpf-map's memory the same way as you charge the page
> > > > > > > > caches or slabs ?
> > > > > > > > No, you don't. You charge it in a bpf-specific way.
> > > > > > >
> > > > > >
> > > > > > Hi Roman,
> > > > > >
> > > > > > Sorry for the late response.
> > > > > > I've been on vacation in the past few days.
> > > > > >
> > > > > > > The only difference is that we charge the cgroup of the processes who
> > > > > > > created a map, not a process who is doing a specific allocation.
> > > > > >
> > > > > > This means the bpf-map can be indepent of process, IOW, the memcg of
> > > > > > bpf-map can be indepent of the memcg of the processes.
> > > > > > This is the fundamental difference between bpf-map and page caches, then...
> > > > > >
> > > > > > > Your patchset doesn't change this.
> > > > > >
> > > > > > We can make this behavior reasonable by introducing an independent
> > > > > > memcg, as what I did in the previous version.
> > > > > >
> > > > > > > There are pros and cons with this approach, we've discussed it back
> > > > > > > to the times when bpf memcg accounting was developed. If you want
> > > > > > > to revisit this, it's maybe possible (given there is a really strong and likely
> > > > > > > new motivation appears), but I haven't seen any complaints yet except from you.
> > > > > > >
> > > > > >
> > > > > > memcg-base bpf accounting is a new feature, which may not be used widely.
> > > > > >
> > > > > > > >
> > > > > > > > > Yes, memory cgroups are not great for accounting of shared resources, it's well
> > > > > > > > > known. This patchset looks like an attempt to "fix" it specifically for bpf maps
> > > > > > > > > in a particular cgroup setup. Honestly, I don't think it's worth the added
> > > > > > > > > complexity. Especially because a similar behaviour can be achieved simple
> > > > > > > > > by placing the task which creates the map into the desired cgroup.
> > > > > > > >
> > > > > > > > Are you serious ?
> > > > > > > > Have you ever read the cgroup doc? Which clearly describe the "No
> > > > > > > > Internal Process Constraint".[1]
> > > > > > > > Obviously you can't place the task in the desired cgroup, i.e. the parent memcg.
> > > > > > >
> > > > > > > But you can place it into another leaf cgroup. You can delete this leaf cgroup
> > > > > > > and your memcg will get reparented. You can attach this process and create
> > > > > > > a bpf map to the parent cgroup before it gets child cgroups.
> > > > > >
> > > > > > If the process doesn't exit after it created bpf-map, we have to
> > > > > > migrate it around memcgs....
> > > > > > The complexity in deployment can introduce unexpected issues easily.
> > > > > >
> > > > > > > You can revisit the idea of shared bpf maps and outlive specific cgroups.
> > > > > > > Lof of options.
> > > > > > >
> > > > > > > >
> > > > > > > > [1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > > > > > >
> > > > > > > > > Beatiful? Not. Neither is the proposed solution.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Is it really hard to admit a fault?
> > > > > > >
> > > > > > > Yafang, you posted several versions and so far I haven't seen much of support
> > > > > > > or excitement from anyone (please, fix me if I'm wrong). It's not like I'm
> > > > > > > nacking a patchset with many acks, reviews and supporters.
> > > > > > >
> > > > > > > Still think you're solving an important problem in a reasonable way?
> > > > > > > It seems like not many are convinced yet. I'd recommend to focus on this instead
> > > > > > > of blaming me.
> > > > > > >
> > > > > >
> > > > > > The best way so far is to introduce specific memcg for specific resources.
> > > > > > Because not only the process owns its memcg, but also specific
> > > > > > resources own their memcgs, for example bpf-map, or socket.
> > > > > >
> > > > > > struct bpf_map {                                 <<<< memcg owner
> > > > > >     struct memcg_cgroup *memcg;
> > > > > > };
> > > > > >
> > > > > > struct sock {                                       <<<< memcg owner
> > > > > >     struct mem_cgroup *sk_memcg;
> > > > > > };
> > > > > >
> > > > > > These resources already have their own memcgs, so we should make this
> > > > > > behavior formal.
> > > > > >
> > > > > > The selectable memcg is just a variant of 'echo ${proc} > cgroup.procs'.
> > > > >
> > > > > This is a fundamental change: cgroups were always hierarchical groups
> > > > > of processes/threads. You're basically suggesting to extend it to
> > > > > hierarchical groups of processes and some other objects (what's a good
> > > > > definition?).
> > > >
> > > > Kind of, but not exactly.
> > > > We can do it without breaking the cgroup hierarchy. Under current
> > > > cgroup hierarchy, the user can only echo processes/threads into a
> > > > cgroup, that won't be changed in the future. The specific resources
> > > > are not exposed to the user, the user can only control these specific
> > > > resources by controlling their associated processes/threads.
> > > > For example,
> > > >
> > > >                 Memcg-A
> > > >                        |---- Memcg-A1
> > > >                        |---- Memcg-A2
> > > >
> > > > We can introduce a new file memory.owner into each memcg. Each bit of
> > > > memory.owner represents a specific resources,
> > > >
> > > >  memory.owner: | bit31 | bitN | ... | bit1 | bit0 |
> > > >                                          |               |
> > > > |------ bit0: bpf memory
> > > >                                          |
> > > > |-------------- bit1: socket memory
> > > >                                          |
> > > >                                          |---------------------------
> > > > bitN: a specific resource
> > > >
> > > > There won't be too many specific resources which have to own their
> > > > memcgs, so I think 32bits is enough.
> > > >
> > > >                 Memcg-A : memory.owner == 0x1
> > > >                        |---- Memcg-A1 : memory.owner == 0
> > > >                        |---- Memcg-A2 : memory.owner == 0x1
> > > >
> > > > Then the bpf created by processes in Memcg-A1 will be charged into
> > > > Memcg-A directly without charging into Memcg-A1.
> > > > But the bpf created by processes in Memcg-A2 will be charged into
> > > > Memcg-A2 as its memory.owner is 0x1.
> > > > That said, these specific resources are not fully independent of
> > > > process, while they are still associated with the processes which
> > > > create them.
> > > > Luckily memory.move_charge_at_immigrate is disabled in cgroup2, so we
> > > > don't need to care about the possible migration issue.
> > > >
> > > > I think we may also apply it to shared page caches.  For example,
> > > >       struct inode {
> > > >           struct mem_cgroup *memcg;          <<<< add a new member
> > > >       };
> > > >
> > > > We define struct inode as a memcg owner, and use scope-based charge to
> > > > charge its pages into inode->memcg.
> > > > And then put all memcgs which shared these resources under the same
> > > > parent. The page caches of this inode will be charged into the parent
> > > > directly.
> > >
> > > Ok, so it's something like premature selective reparenting.
> > >
> >
> > Right. I think it  may be a good way to handle the resources which may
> > outlive the process.
> >
> > > > The shared page cache is more complicated than bpf memory, so I'm not
> > > > quite sure if it can apply to shared page cache, but it can work well
> > > > for bpf memory.
> > >
> > > Yeah, this is the problem. It feels like it's a problem very specific
> > > to bpf maps and an exact way you use them. I don't think you can successfully
> > > advocate for changes of these calibre without a more generic problem. I might
> > > be wrong.
> > >
> >
> > What is your concern about this method? Are there any potential issues?
>
> The issue is simple: nobody wants to support a new non-trivial cgroup interface
> to solve a specific bpf accounting issue in one particular setup. Any new
> interface will become an API and has to be supported for many many years,
> so it has to be generic and future-proof.
>
> If you want to go this direction, please, show that it solves a _generic_
> problem, not limited to a specific way how you use bpf maps in your specific
> setup. Accounting of a bpf map shared by many cgroups, which should outlive
> the original memory cgroups... Idk, maybe it's how many users are using bpf
> maps, but I don't hear it yet.
>
> There were some patches from Google folks about the tmpfs accounting, _maybe_
> it's something to look at in order to get an idea about a more generic problem
> and solution.
>

Hmm...
It seems that we are in a dilemma now.
We can't fix it in memcg way, because the issue we are fixing it a
bpf-specific issue.
But we can't fix it in a bpf-specific way neither...

[bpf-next,v3,00/13] bpf: Introduce selectable memcg for bpf map

Message

Comments