[RFC,bpf-next,0/5] bpf, cgroup: Enable cgroup_array map on cgroup1

Message ID	20230903142800.3870-1-laoar.shao@gmail.com (mailing list archive)
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B87992593 for <bpf@vger.kernel.org>; Sun, 3 Sep 2023 14:28:09 +0000 (UTC) From: Yafang Shao <laoar.shao@gmail.com> To: ast@kernel.org, daniel@iogearbox.net, john.fastabend@gmail.com, andrii@kernel.org, martin.lau@linux.dev, song@kernel.org, yonghong.song@linux.dev, kpsingh@kernel.org, sdf@google.com, haoluo@google.com, jolsa@kernel.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, yosryahmed@google.com Cc: cgroups@vger.kernel.org, bpf@vger.kernel.org, Yafang Shao <laoar.shao@gmail.com> Subject: [RFC PATCH bpf-next 0/5] bpf, cgroup: Enable cgroup_array map on cgroup1 Date: Sun, 3 Sep 2023 14:27:55 +0000 Message-Id: <20230903142800.3870-1-laoar.shao@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	bpf, cgroup: Enable cgroup_array map on cgroup1 \| expand [RFC,bpf-next,0/5] bpf, cgroup: Enable cgroup_array map on cgroup1 [RFC,bpf-next,1/5] cgroup: Enable task_under_cgroup_hierarchy() on cgroup1 [RFC,bpf-next,2/5] bpf: Enable cgroup_array map on cgroup1 [RFC,bpf-next,3/5] selftests/bpf: Fix issues in setup_classid_environment() [RFC,bpf-next,4/5] selftests/bpf: Add new cgroup helper open_classid() [RFC,bpf-next,5/5] selftests/bpf: Add selftests for current_under_cgroupv1v2

Yafang Shao Sept. 3, 2023, 2:27 p.m. UTC

Currently, the cgroup_array map serves as a critical component for
bpf_current_under_cgroup() and bpf_skb_under_cgroup() functions, allowing
us to determine whether a task or a socket buffer (skb) resides within a
specific cgroup. However, a limitation exists as we can only store cgroup2
file descriptors in the cgroup_array map. This limitation stems from the
fact that cgroup_get_from_fd() exclusively supports cgroup2 file
descriptors. Fortunately, an alternative solution presents itself by
leveraging cgroup_v1v2_get_from_fd(), which accommodates both cgroup1 and
cgroup2 file descriptors.

It is essential to note that it is safe to utilize a cgroup1 pointer within
both bpf_current_under_cgroup() and bpf_skb_under_cgroup(), with the result
of receiving a "false" return value when verifying a cgroup1 pointer. To
enable the checking of tasks under a cgroup1 hierarchy, we can make a minor
modification to task_under_cgroup_hierarchy() to add support for cgroup1.

In our specific use case, we intend to use bpf_current_under_cgroup() to
audit whether the current task resides within specific containers.
Subsequently, we can use this information to create distinct ACLs within
our LSM BPF programs, enabling us to control specific operations performed
by these tasks.

Considering the widespread use of cgroup1 in container environments,
coupled with the considerable time it will take to transition to cgroup2,
implementing this change will significantly enhance the utility of BPF
in container scenarios. This is especially noteworthy because the necessary
adjustments can be made with minimal alterations to both the cgroup
subsystem and the BPF subsystem.

Yafang Shao (5):
  cgroup: Enable task_under_cgroup_hierarchy() on cgroup1
  bpf: Enable cgroup_array map on cgroup1
  selftests/bpf: Fix issues in setup_classid_environment()
  selftests/bpf: Add new cgroup helper open_classid()
  selftests/bpf: Add selftests for current_under_cgroupv1v2

 include/linux/cgroup.h                             | 24 ++++++-
 kernel/bpf/arraymap.c                              |  2 +-
 tools/testing/selftests/bpf/cgroup_helpers.c       | 34 ++++++++--
 tools/testing/selftests/bpf/cgroup_helpers.h       |  1 +
 .../bpf/prog_tests/current_under_cgroupv1v2.c      | 76 ++++++++++++++++++++++
 .../bpf/progs/test_current_under_cgroupv1v2.c      | 31 +++++++++
 6 files changed, 160 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/current_under_cgroupv1v2.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_current_under_cgroupv1v2.c

Michal Koutný Sept. 7, 2023, 2:41 p.m. UTC | #1

Hello Yafang.

On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> In our specific use case, we intend to use bpf_current_under_cgroup() to
> audit whether the current task resides within specific containers.

I wonder -- how does this work in practice?

If it's systemd hybrid setup, you can get the information from the
unified hierarchy which represents the container membership.

If it's a setup without the unified hierarchy, you have to pick one
hieararchy as a representation of the membership. Which one will it be?

> Subsequently, we can use this information to create distinct ACLs within
> our LSM BPF programs, enabling us to control specific operations performed
> by these tasks.

If one was serious about container-based ACLs, it'd be best to have a
dedicated and maintained hierarchy for this (I mean a named v1
hiearchy). But your implementation omits this, so this hints to me that
this scenario may already be better covered with querying the unified
hierarchy.

> Considering the widespread use of cgroup1 in container environments,
> coupled with the considerable time it will take to transition to cgroup2,
> implementing this change will significantly enhance the utility of BPF
> in container scenarios.

If a change like this is not accepted, will it make the transition
period shorter? (As written above, the unified hierarchy seems a better
fit for your use case.)

Thanks,
Michal

Yafang Shao Sept. 8, 2023, 2:53 a.m. UTC | #2

On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello Yafang.
>
> On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> > In our specific use case, we intend to use bpf_current_under_cgroup() to
> > audit whether the current task resides within specific containers.
>
> I wonder -- how does this work in practice?

In our practice, the cgroup_array map serves as a shared map utilized
by both our LSM programs and the target pods. as follows,

    ----------------
    | target pod |
    ----------------
           |
           |
          V                                      ----------------
 /sys/fs/bpf/cgoup_array     <--- | LSM progs|
                                                  ----------------

Within the target pods, we employ a script to update its cgroup file
descriptor into the cgroup_array, for instance:

    cgrp_fd = open("/sys/fs/cgroup/cpu");
    cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array");
    bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0);

Next, we will validate the contents of the cgroup_array within our LSM
programs, as follows:

     if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx))
            return -1;

Within our Kubernetes deployment system, we will inject this script
into the target pods only if specific annotations, such as
"bpf_audit," are present. Consequently, users do not need to manually
modify their code; this process will be handled automatically.

Within our Kubernetes environment, there is only a single instance of
these target pods on each host. Consequently, we can conveniently
utilize the array index as the application ID. However, in scenarios
where you have multiple instances running on a single host, you will
need to manage the mapping of instances to array indexes
independently. For cases with multiple instances, a cgroup_hash may be
a more suitable approach, although that is a separate discussion
altogether.

>
> If it's systemd hybrid setup, you can get the information from the
> unified hierarchy which represents the container membership.
>
> If it's a setup without the unified hierarchy, you have to pick one
> hieararchy as a representation of the membership. Which one will it be?

We utilize the CPU subsystem, and all of our pods have this cgroup
subsystem enabled.

>
> > Subsequently, we can use this information to create distinct ACLs within
> > our LSM BPF programs, enabling us to control specific operations performed
> > by these tasks.
>
> If one was serious about container-based ACLs, it'd be best to have a
> dedicated and maintained hierarchy for this (I mean a named v1
> hiearchy). But your implementation omits this, so this hints to me that
> this scenario may already be better covered with querying the unified
> hierarchy.
>
> > Considering the widespread use of cgroup1 in container environments,
> > coupled with the considerable time it will take to transition to cgroup2,
> > implementing this change will significantly enhance the utility of BPF
> > in container scenarios.
>
> If a change like this is not accepted, will it make the transition
> period shorter? (As written above, the unified hierarchy seems a better
> fit for your use case.)

If that change is not accepted by upstream, we will need to
independently manage and maintain it within our local kernel :(

Alexei Starovoitov Sept. 8, 2023, 6:09 p.m. UTC | #3

On Thu, Sep 7, 2023 at 7:54 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > Hello Yafang.
> >
> > On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > In our specific use case, we intend to use bpf_current_under_cgroup() to
> > > audit whether the current task resides within specific containers.
> >
> > I wonder -- how does this work in practice?
>
> In our practice, the cgroup_array map serves as a shared map utilized
> by both our LSM programs and the target pods. as follows,
>
>     ----------------
>     | target pod |
>     ----------------
>            |
>            |
>           V                                      ----------------
>  /sys/fs/bpf/cgoup_array     <--- | LSM progs|
>                                                   ----------------
>
> Within the target pods, we employ a script to update its cgroup file
> descriptor into the cgroup_array, for instance:
>
>     cgrp_fd = open("/sys/fs/cgroup/cpu");
>     cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array");
>     bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0);
>
> Next, we will validate the contents of the cgroup_array within our LSM
> programs, as follows:
>
>      if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx))
>             return -1;
>
> Within our Kubernetes deployment system, we will inject this script
> into the target pods only if specific annotations, such as
> "bpf_audit," are present. Consequently, users do not need to manually
> modify their code; this process will be handled automatically.
>
> Within our Kubernetes environment, there is only a single instance of
> these target pods on each host. Consequently, we can conveniently
> utilize the array index as the application ID. However, in scenarios
> where you have multiple instances running on a single host, you will
> need to manage the mapping of instances to array indexes
> independently. For cases with multiple instances, a cgroup_hash may be
> a more suitable approach, although that is a separate discussion
> altogether.

Is there a reason you cannot use bpf_get_current_cgroup_id()
to associate task with cgroup in your lsm prog?

Yafang Shao Sept. 10, 2023, 3:17 a.m. UTC | #4

On Sat, Sep 9, 2023 at 2:09 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 7, 2023 at 7:54 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote:
> > >
> > > Hello Yafang.
> > >
> > > On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > In our specific use case, we intend to use bpf_current_under_cgroup() to
> > > > audit whether the current task resides within specific containers.
> > >
> > > I wonder -- how does this work in practice?
> >
> > In our practice, the cgroup_array map serves as a shared map utilized
> > by both our LSM programs and the target pods. as follows,
> >
> >     ----------------
> >     | target pod |
> >     ----------------
> >            |
> >            |
> >           V                                      ----------------
> >  /sys/fs/bpf/cgoup_array     <--- | LSM progs|
> >                                                   ----------------
> >
> > Within the target pods, we employ a script to update its cgroup file
> > descriptor into the cgroup_array, for instance:
> >
> >     cgrp_fd = open("/sys/fs/cgroup/cpu");
> >     cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array");
> >     bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0);
> >
> > Next, we will validate the contents of the cgroup_array within our LSM
> > programs, as follows:
> >
> >      if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx))
> >             return -1;
> >
> > Within our Kubernetes deployment system, we will inject this script
> > into the target pods only if specific annotations, such as
> > "bpf_audit," are present. Consequently, users do not need to manually
> > modify their code; this process will be handled automatically.
> >
> > Within our Kubernetes environment, there is only a single instance of
> > these target pods on each host. Consequently, we can conveniently
> > utilize the array index as the application ID. However, in scenarios
> > where you have multiple instances running on a single host, you will
> > need to manage the mapping of instances to array indexes
> > independently. For cases with multiple instances, a cgroup_hash may be
> > a more suitable approach, although that is a separate discussion
> > altogether.
>
> Is there a reason you cannot use bpf_get_current_cgroup_id()
> to associate task with cgroup in your lsm prog?

Using cgroup_id as the key serves as a temporary workaround;
nevertheless, employing bpf_get_current_cgroup_id() is impractical due
to its exclusive support for cgroup2.

To acquire the cgroup_id, we can resort to open coding, as exemplified below:

    task = bpf_get_current_task_btf();
    cgroups = task->cgroups;
    cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup;
    key = cgroup->kn->id;

Nonetheless, creating an open-coded version of
bpf_get_current_ancestor_cgroup_id() is unfeasible since the BPF
verifier prohibits access to "cgrp->ancestors[ancestor_level]."

--
Regards

Yafang

Alexei Starovoitov Sept. 11, 2023, 7:53 p.m. UTC | #5

On Sat, Sep 9, 2023 at 8:18 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Sep 9, 2023 at 2:09 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Sep 7, 2023 at 7:54 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote:
> > > >
> > > > Hello Yafang.
> > > >
> > > > On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > In our specific use case, we intend to use bpf_current_under_cgroup() to
> > > > > audit whether the current task resides within specific containers.
> > > >
> > > > I wonder -- how does this work in practice?
> > >
> > > In our practice, the cgroup_array map serves as a shared map utilized
> > > by both our LSM programs and the target pods. as follows,
> > >
> > >     ----------------
> > >     | target pod |
> > >     ----------------
> > >            |
> > >            |
> > >           V                                      ----------------
> > >  /sys/fs/bpf/cgoup_array     <--- | LSM progs|
> > >                                                   ----------------
> > >
> > > Within the target pods, we employ a script to update its cgroup file
> > > descriptor into the cgroup_array, for instance:
> > >
> > >     cgrp_fd = open("/sys/fs/cgroup/cpu");
> > >     cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array");
> > >     bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0);
> > >
> > > Next, we will validate the contents of the cgroup_array within our LSM
> > > programs, as follows:
> > >
> > >      if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx))
> > >             return -1;
> > >
> > > Within our Kubernetes deployment system, we will inject this script
> > > into the target pods only if specific annotations, such as
> > > "bpf_audit," are present. Consequently, users do not need to manually
> > > modify their code; this process will be handled automatically.
> > >
> > > Within our Kubernetes environment, there is only a single instance of
> > > these target pods on each host. Consequently, we can conveniently
> > > utilize the array index as the application ID. However, in scenarios
> > > where you have multiple instances running on a single host, you will
> > > need to manage the mapping of instances to array indexes
> > > independently. For cases with multiple instances, a cgroup_hash may be
> > > a more suitable approach, although that is a separate discussion
> > > altogether.
> >
> > Is there a reason you cannot use bpf_get_current_cgroup_id()
> > to associate task with cgroup in your lsm prog?
>
> Using cgroup_id as the key serves as a temporary workaround;
> nevertheless, employing bpf_get_current_cgroup_id() is impractical due
> to its exclusive support for cgroup2.
>
> To acquire the cgroup_id, we can resort to open coding, as exemplified below:
>
>     task = bpf_get_current_task_btf();
>     cgroups = task->cgroups;
>     cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup;
>     key = cgroup->kn->id;
>
> Nonetheless, creating an open-coded version of
> bpf_get_current_ancestor_cgroup_id() is unfeasible since the BPF
> verifier prohibits access to "cgrp->ancestors[ancestor_level]."

Both helpers can be extended to support v1 or not?
I mean can a task be part of v1 and v2 hierarchy at the same time?
If not then bpf_get_current_cgroup_id() can fallback to what you
describing above and return cgroup_id.
Same would apply to bpf_get_current_ancestor_cgroup_id.

If not, two new kfuncs for v1 could be another option.
prog_array for cgroups is an old design. We can and should do
more flexible interface nowadays.

Tejun Heo Sept. 11, 2023, 8:24 p.m. UTC | #6

On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote:
> To acquire the cgroup_id, we can resort to open coding, as exemplified below:
> 
>     task = bpf_get_current_task_btf();
>     cgroups = task->cgroups;
>     cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup;
>     key = cgroup->kn->id;

You can't hardcode it to a specific controller tree like that. You either
stick with fd based interface or need also add something to identify the
specifc cgroup1 tree.

Thanks.

Yafang Shao Sept. 12, 2023, 3:30 a.m. UTC | #7

On Tue, Sep 12, 2023 at 4:24 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote:
> > To acquire the cgroup_id, we can resort to open coding, as exemplified below:
> >
> >     task = bpf_get_current_task_btf();
> >     cgroups = task->cgroups;
> >     cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup;
> >     key = cgroup->kn->id;
>
> You can't hardcode it to a specific controller tree like that. You either
> stick with fd based interface or need also add something to identify the
> specifc cgroup1 tree.

As pointed out by Alexei, I think we can introduce some
cgroup_id-based kfuncs which can work for both cgroup1 and cgroup2.

Something as follows (untested),

__bpf_kfunc u64 bpf_current_cgroup_id_from_subsys(int subsys)
{
        struct cgroup *cgroup;

        cgroup = task_cgroup(current, subsys);
        if (!cgroup)
            return 0;
        return cgroup_id(cgroup);
}

__bpf_kfunc struct cgroup *bpf_cgroup_from_subsys_id(u64 cgid, int subsys)
{
        struct cgroup_subsys_state *css = init_css_set.subsys[subsys];
        struct cgroup *subsys_root = css->cgroup;

        // We should introduce a new helper cgroup_get_from_subsys_id()
        // in the cgroup subsystem.
        return cgroup_get_from_subsys_id(subsys_root, cgid);
}

And change task_under_cgroup_hierarchy() as follows,

 static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
                                               struct cgroup *ancestor)
 {
        struct css_set *cset = task_css_set(task);
-
-       return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
+       struct cgroup *cgrp;
+       bool ret = false;
+
+       if (ancestor->root == &cgrp_dfl_root)
+               return cgroup_is_descendant(cset->dfl_cgrp, ancestor);
+
+       cgroup_lock();
+       spin_lock_irq(&css_set_lock);
+       cgrp = task_cgroup_from_root(task, ancestor->root);
+       if (cgrp && cgroup_is_descendant(cgrp, ancestor))
+               ret = true;
+       spin_unlock_irq(&css_set_lock);
+       cgroup_unlock();
+       return ret;
 }

With the above changes, I think it can meet most use cases with BPF on cgroup1.
What do you think ?

--
Regards

Yafang

Michal Koutný Sept. 15, 2023, 5:01 p.m. UTC | #8

Hello.

On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> With the above changes, I think it can meet most use cases with BPF on cgroup1.
> What do you think ?

I think the presented use case of LSM hooks is better served by the
default hierarchy (see also [1]).
Relying on a chosen subsys v1 hierarchy is not systematic. And extending
ancestry checking on named v1 hierarchies seems backwards given
the existence of the default hierarchy.

Michal

[1] https://docs.kernel.org/admin-guide/cgroup-v2.html#delegation-containment

Tejun Heo Sept. 15, 2023, 5:31 p.m. UTC | #9

On Fri, Sep 15, 2023 at 07:01:28PM +0200, Michal Koutný wrote:
> Hello.
> 
> On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > With the above changes, I think it can meet most use cases with BPF on cgroup1.
> > What do you think ?
> 
> I think the presented use case of LSM hooks is better served by the
> default hierarchy (see also [1]).
> Relying on a chosen subsys v1 hierarchy is not systematic. And extending
> ancestry checking on named v1 hierarchies seems backwards given
> the existence of the default hierarchy.

Yeah, identifying cgroup1 hierarchies by subsys leave out pretty good chunk
of usecases - e.g. systemd used to use a named hierarchy for primary process
organization on cgroup1.

Also, you don't have to switch to cgroup2 wholesale. You can just build the
same hierarchy in cgroup2 for process organization and combine that with any
cgroup1 hierarchies.

Thanks.

Hao Luo Sept. 15, 2023, 6:57 p.m. UTC | #10

On Mon, Sep 11, 2023 at 8:31 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Sep 12, 2023 at 4:24 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote:
> > > To acquire the cgroup_id, we can resort to open coding, as exemplified below:
> > >
> > >     task = bpf_get_current_task_btf();
> > >     cgroups = task->cgroups;
> > >     cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup;
> > >     key = cgroup->kn->id;
> >
> > You can't hardcode it to a specific controller tree like that. You either
> > stick with fd based interface or need also add something to identify the
> > specifc cgroup1 tree.
>
> As pointed out by Alexei, I think we can introduce some
> cgroup_id-based kfuncs which can work for both cgroup1 and cgroup2.
>
> Something as follows (untested),
>
> __bpf_kfunc u64 bpf_current_cgroup_id_from_subsys(int subsys)
> {
>         struct cgroup *cgroup;
>
>         cgroup = task_cgroup(current, subsys);
>         if (!cgroup)
>             return 0;
>         return cgroup_id(cgroup);
> }
>

Can we also support checking arbitrary tasks, instead of just current?
I find myself often needing to find the cgroup only given a
task_struct. For example, when attaching to context switch, I want to
know whether the next task is under a cgroup. Having such a kfunc
would be very useful. It can also be used in task_iter programs.

Hao

Yafang Shao Sept. 17, 2023, 7:19 a.m. UTC | #11

On Sat, Sep 16, 2023 at 1:01 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > With the above changes, I think it can meet most use cases with BPF on cgroup1.
> > What do you think ?
>
> I think the presented use case of LSM hooks is better served by the
> default hierarchy (see also [1]).

Hi Michal,

The crucial issue at hand is not whether the LSM hooks are better
suited for the cgroup default hierarchy. What truly matters is the
effort and time required to migrate all cgroup1-based applications to
cgroup2-based ones. While transitioning a single component from
cgroup1-based to cgroup2-based is a straightforward task, the
complexity arises when multiple interdependent components in a
production environment necessitate this transition. In such cases, the
work becomes significantly challenging.

> Relying on a chosen subsys v1 hierarchy is not systematic. And extending
> ancestry checking on named v1 hierarchies seems backwards given
> the existence of the default hierarchy.

The cgroup becomes active only when it has one or more of its
controllers enabled. In a production environment, a task is invariably
governed by at least one cgroup controller. Even in hybrid cgroup
mode, a task is subject to either a cgroup1 controller or a cgroup2
controller. Our objective is to enhance BPF support for
controller-based scenarios, eliminating the need to concern ourselves
with hierarchies, whether they involve cgroup1 or cgroup2. This change
seems quite reasonable, in my opinion.

>
>
> Michal
>
> [1] https://docs.kernel.org/admin-guide/cgroup-v2.html#delegation-containment

--
Regards
Yafang

Yafang Shao Sept. 17, 2023, 7:28 a.m. UTC | #12

On Sat, Sep 16, 2023 at 1:31 AM Tejun Heo <tj@kernel.org> wrote:
>
> On Fri, Sep 15, 2023 at 07:01:28PM +0200, Michal Koutný wrote:
> > Hello.
> >
> > On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> > > With the above changes, I think it can meet most use cases with BPF on cgroup1.
> > > What do you think ?
> >
> > I think the presented use case of LSM hooks is better served by the
> > default hierarchy (see also [1]).
> > Relying on a chosen subsys v1 hierarchy is not systematic. And extending
> > ancestry checking on named v1 hierarchies seems backwards given
> > the existence of the default hierarchy.
>
> Yeah, identifying cgroup1 hierarchies by subsys leave out pretty good chunk
> of usecases - e.g. systemd used to use a named hierarchy for primary process
> organization on cgroup1.

Systemd-managed tasks invariably have one or more cgroup controllers
enabled, as exemplified by entries like
"/sys/fs/cgroup/cpu/{system.slice, user.slice, XXX.service}".
Consequently, the presence of a cgroup controller can be employed as
an indicator to identify a systemd-managed task.

>
> Also, you don't have to switch to cgroup2 wholesale. You can just build the
> same hierarchy in cgroup2 for process organization and combine that with any
> cgroup1 hierarchies.

The challenge lies in the need to adapt a multitude of applications to
this system, and furthermore, not all of these applications are under
our direct management. This poses a formidable task.

Yafang Shao Sept. 17, 2023, 7:30 a.m. UTC | #13

On Sat, Sep 16, 2023 at 2:57 AM Hao Luo <haoluo@google.com> wrote:
>
> On Mon, Sep 11, 2023 at 8:31 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Sep 12, 2023 at 4:24 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote:
> > > > To acquire the cgroup_id, we can resort to open coding, as exemplified below:
> > > >
> > > >     task = bpf_get_current_task_btf();
> > > >     cgroups = task->cgroups;
> > > >     cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup;
> > > >     key = cgroup->kn->id;
> > >
> > > You can't hardcode it to a specific controller tree like that. You either
> > > stick with fd based interface or need also add something to identify the
> > > specifc cgroup1 tree.
> >
> > As pointed out by Alexei, I think we can introduce some
> > cgroup_id-based kfuncs which can work for both cgroup1 and cgroup2.
> >
> > Something as follows (untested),
> >
> > __bpf_kfunc u64 bpf_current_cgroup_id_from_subsys(int subsys)
> > {
> >         struct cgroup *cgroup;
> >
> >         cgroup = task_cgroup(current, subsys);
> >         if (!cgroup)
> >             return 0;
> >         return cgroup_id(cgroup);
> > }
> >
>
> Can we also support checking arbitrary tasks, instead of just current?
> I find myself often needing to find the cgroup only given a
> task_struct. For example, when attaching to context switch, I want to
> know whether the next task is under a cgroup. Having such a kfunc
> would be very useful. It can also be used in task_iter programs.

Agree. Will do it.

Michal Koutný Sept. 18, 2023, 2:44 p.m. UTC | #14

On Sun, Sep 17, 2023 at 03:19:06PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote:
> The crucial issue at hand is not whether the LSM hooks are better
> suited for the cgroup default hierarchy. What truly matters is the
> effort and time required to migrate all cgroup1-based applications to
> cgroup2-based ones. While transitioning a single component from
> cgroup1-based to cgroup2-based is a straightforward task, the
> complexity arises when multiple interdependent components in a
> production environment necessitate this transition. In such cases, the
> work becomes significantly challenging.

systemd's hybrid mode is the approach helping such combined
environments. (I understand that it's not warranted with all container
runtimes but FYI.) On v1-only deployments BPF predicates couldn't be
used at all currently.

Transition is transitional but accompanying complexity in the code would
have to be kept much longer.

> Our objective is to enhance BPF support for controller-based
> scenarios, eliminating the need to concern ourselves with hierarchies,
> whether they involve cgroup1 or cgroup2.

I'm posting some notes on this to the 1st patch.

Regards,
Michal

[RFC,bpf-next,0/5] bpf, cgroup: Enable cgroup_array map on cgroup1

Message

Comments