Message ID | 20230903142800.3870-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf, cgroup: Enable cgroup_array map on cgroup1 | expand |
Hello Yafang. On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote: > In our specific use case, we intend to use bpf_current_under_cgroup() to > audit whether the current task resides within specific containers. I wonder -- how does this work in practice? If it's systemd hybrid setup, you can get the information from the unified hierarchy which represents the container membership. If it's a setup without the unified hierarchy, you have to pick one hieararchy as a representation of the membership. Which one will it be? > Subsequently, we can use this information to create distinct ACLs within > our LSM BPF programs, enabling us to control specific operations performed > by these tasks. If one was serious about container-based ACLs, it'd be best to have a dedicated and maintained hierarchy for this (I mean a named v1 hiearchy). But your implementation omits this, so this hints to me that this scenario may already be better covered with querying the unified hierarchy. > Considering the widespread use of cgroup1 in container environments, > coupled with the considerable time it will take to transition to cgroup2, > implementing this change will significantly enhance the utility of BPF > in container scenarios. If a change like this is not accepted, will it make the transition period shorter? (As written above, the unified hierarchy seems a better fit for your use case.) Thanks, Michal
On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote: > > Hello Yafang. > > On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote: > > In our specific use case, we intend to use bpf_current_under_cgroup() to > > audit whether the current task resides within specific containers. > > I wonder -- how does this work in practice? In our practice, the cgroup_array map serves as a shared map utilized by both our LSM programs and the target pods. as follows, ---------------- | target pod | ---------------- | | V ---------------- /sys/fs/bpf/cgoup_array <--- | LSM progs| ---------------- Within the target pods, we employ a script to update its cgroup file descriptor into the cgroup_array, for instance: cgrp_fd = open("/sys/fs/cgroup/cpu"); cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array"); bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0); Next, we will validate the contents of the cgroup_array within our LSM programs, as follows: if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx)) return -1; Within our Kubernetes deployment system, we will inject this script into the target pods only if specific annotations, such as "bpf_audit," are present. Consequently, users do not need to manually modify their code; this process will be handled automatically. Within our Kubernetes environment, there is only a single instance of these target pods on each host. Consequently, we can conveniently utilize the array index as the application ID. However, in scenarios where you have multiple instances running on a single host, you will need to manage the mapping of instances to array indexes independently. For cases with multiple instances, a cgroup_hash may be a more suitable approach, although that is a separate discussion altogether. > > If it's systemd hybrid setup, you can get the information from the > unified hierarchy which represents the container membership. > > If it's a setup without the unified hierarchy, you have to pick one > hieararchy as a representation of the membership. Which one will it be? We utilize the CPU subsystem, and all of our pods have this cgroup subsystem enabled. > > > Subsequently, we can use this information to create distinct ACLs within > > our LSM BPF programs, enabling us to control specific operations performed > > by these tasks. > > If one was serious about container-based ACLs, it'd be best to have a > dedicated and maintained hierarchy for this (I mean a named v1 > hiearchy). But your implementation omits this, so this hints to me that > this scenario may already be better covered with querying the unified > hierarchy. > > > Considering the widespread use of cgroup1 in container environments, > > coupled with the considerable time it will take to transition to cgroup2, > > implementing this change will significantly enhance the utility of BPF > > in container scenarios. > > If a change like this is not accepted, will it make the transition > period shorter? (As written above, the unified hierarchy seems a better > fit for your use case.) If that change is not accepted by upstream, we will need to independently manage and maintain it within our local kernel :(
On Thu, Sep 7, 2023 at 7:54 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote: > > > > Hello Yafang. > > > > On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote: > > > In our specific use case, we intend to use bpf_current_under_cgroup() to > > > audit whether the current task resides within specific containers. > > > > I wonder -- how does this work in practice? > > In our practice, the cgroup_array map serves as a shared map utilized > by both our LSM programs and the target pods. as follows, > > ---------------- > | target pod | > ---------------- > | > | > V ---------------- > /sys/fs/bpf/cgoup_array <--- | LSM progs| > ---------------- > > Within the target pods, we employ a script to update its cgroup file > descriptor into the cgroup_array, for instance: > > cgrp_fd = open("/sys/fs/cgroup/cpu"); > cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array"); > bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0); > > Next, we will validate the contents of the cgroup_array within our LSM > programs, as follows: > > if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx)) > return -1; > > Within our Kubernetes deployment system, we will inject this script > into the target pods only if specific annotations, such as > "bpf_audit," are present. Consequently, users do not need to manually > modify their code; this process will be handled automatically. > > Within our Kubernetes environment, there is only a single instance of > these target pods on each host. Consequently, we can conveniently > utilize the array index as the application ID. However, in scenarios > where you have multiple instances running on a single host, you will > need to manage the mapping of instances to array indexes > independently. For cases with multiple instances, a cgroup_hash may be > a more suitable approach, although that is a separate discussion > altogether. Is there a reason you cannot use bpf_get_current_cgroup_id() to associate task with cgroup in your lsm prog?
On Sat, Sep 9, 2023 at 2:09 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Sep 7, 2023 at 7:54 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote: > > > > > > Hello Yafang. > > > > > > On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote: > > > > In our specific use case, we intend to use bpf_current_under_cgroup() to > > > > audit whether the current task resides within specific containers. > > > > > > I wonder -- how does this work in practice? > > > > In our practice, the cgroup_array map serves as a shared map utilized > > by both our LSM programs and the target pods. as follows, > > > > ---------------- > > | target pod | > > ---------------- > > | > > | > > V ---------------- > > /sys/fs/bpf/cgoup_array <--- | LSM progs| > > ---------------- > > > > Within the target pods, we employ a script to update its cgroup file > > descriptor into the cgroup_array, for instance: > > > > cgrp_fd = open("/sys/fs/cgroup/cpu"); > > cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array"); > > bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0); > > > > Next, we will validate the contents of the cgroup_array within our LSM > > programs, as follows: > > > > if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx)) > > return -1; > > > > Within our Kubernetes deployment system, we will inject this script > > into the target pods only if specific annotations, such as > > "bpf_audit," are present. Consequently, users do not need to manually > > modify their code; this process will be handled automatically. > > > > Within our Kubernetes environment, there is only a single instance of > > these target pods on each host. Consequently, we can conveniently > > utilize the array index as the application ID. However, in scenarios > > where you have multiple instances running on a single host, you will > > need to manage the mapping of instances to array indexes > > independently. For cases with multiple instances, a cgroup_hash may be > > a more suitable approach, although that is a separate discussion > > altogether. > > Is there a reason you cannot use bpf_get_current_cgroup_id() > to associate task with cgroup in your lsm prog? Using cgroup_id as the key serves as a temporary workaround; nevertheless, employing bpf_get_current_cgroup_id() is impractical due to its exclusive support for cgroup2. To acquire the cgroup_id, we can resort to open coding, as exemplified below: task = bpf_get_current_task_btf(); cgroups = task->cgroups; cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup; key = cgroup->kn->id; Nonetheless, creating an open-coded version of bpf_get_current_ancestor_cgroup_id() is unfeasible since the BPF verifier prohibits access to "cgrp->ancestors[ancestor_level]." -- Regards Yafang
On Sat, Sep 9, 2023 at 8:18 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Sat, Sep 9, 2023 at 2:09 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Sep 7, 2023 at 7:54 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Thu, Sep 7, 2023 at 10:41 PM Michal Koutný <mkoutny@suse.com> wrote: > > > > > > > > Hello Yafang. > > > > > > > > On Sun, Sep 03, 2023 at 02:27:55PM +0000, Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > In our specific use case, we intend to use bpf_current_under_cgroup() to > > > > > audit whether the current task resides within specific containers. > > > > > > > > I wonder -- how does this work in practice? > > > > > > In our practice, the cgroup_array map serves as a shared map utilized > > > by both our LSM programs and the target pods. as follows, > > > > > > ---------------- > > > | target pod | > > > ---------------- > > > | > > > | > > > V ---------------- > > > /sys/fs/bpf/cgoup_array <--- | LSM progs| > > > ---------------- > > > > > > Within the target pods, we employ a script to update its cgroup file > > > descriptor into the cgroup_array, for instance: > > > > > > cgrp_fd = open("/sys/fs/cgroup/cpu"); > > > cgrp_map_fd = bpf_obj_get("/sys/fs/bpf/cgroup_array"); > > > bpf_map_update_elem(cgrp_map_fd, &app_idx, &cgrp_fd, 0); > > > > > > Next, we will validate the contents of the cgroup_array within our LSM > > > programs, as follows: > > > > > > if (!bpf_current_task_under_cgroup(&cgroup_array, app_idx)) > > > return -1; > > > > > > Within our Kubernetes deployment system, we will inject this script > > > into the target pods only if specific annotations, such as > > > "bpf_audit," are present. Consequently, users do not need to manually > > > modify their code; this process will be handled automatically. > > > > > > Within our Kubernetes environment, there is only a single instance of > > > these target pods on each host. Consequently, we can conveniently > > > utilize the array index as the application ID. However, in scenarios > > > where you have multiple instances running on a single host, you will > > > need to manage the mapping of instances to array indexes > > > independently. For cases with multiple instances, a cgroup_hash may be > > > a more suitable approach, although that is a separate discussion > > > altogether. > > > > Is there a reason you cannot use bpf_get_current_cgroup_id() > > to associate task with cgroup in your lsm prog? > > Using cgroup_id as the key serves as a temporary workaround; > nevertheless, employing bpf_get_current_cgroup_id() is impractical due > to its exclusive support for cgroup2. > > To acquire the cgroup_id, we can resort to open coding, as exemplified below: > > task = bpf_get_current_task_btf(); > cgroups = task->cgroups; > cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup; > key = cgroup->kn->id; > > Nonetheless, creating an open-coded version of > bpf_get_current_ancestor_cgroup_id() is unfeasible since the BPF > verifier prohibits access to "cgrp->ancestors[ancestor_level]." Both helpers can be extended to support v1 or not? I mean can a task be part of v1 and v2 hierarchy at the same time? If not then bpf_get_current_cgroup_id() can fallback to what you describing above and return cgroup_id. Same would apply to bpf_get_current_ancestor_cgroup_id. If not, two new kfuncs for v1 could be another option. prog_array for cgroups is an old design. We can and should do more flexible interface nowadays.
On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote: > To acquire the cgroup_id, we can resort to open coding, as exemplified below: > > task = bpf_get_current_task_btf(); > cgroups = task->cgroups; > cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup; > key = cgroup->kn->id; You can't hardcode it to a specific controller tree like that. You either stick with fd based interface or need also add something to identify the specifc cgroup1 tree. Thanks.
On Tue, Sep 12, 2023 at 4:24 AM Tejun Heo <tj@kernel.org> wrote: > > On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote: > > To acquire the cgroup_id, we can resort to open coding, as exemplified below: > > > > task = bpf_get_current_task_btf(); > > cgroups = task->cgroups; > > cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup; > > key = cgroup->kn->id; > > You can't hardcode it to a specific controller tree like that. You either > stick with fd based interface or need also add something to identify the > specifc cgroup1 tree. As pointed out by Alexei, I think we can introduce some cgroup_id-based kfuncs which can work for both cgroup1 and cgroup2. Something as follows (untested), __bpf_kfunc u64 bpf_current_cgroup_id_from_subsys(int subsys) { struct cgroup *cgroup; cgroup = task_cgroup(current, subsys); if (!cgroup) return 0; return cgroup_id(cgroup); } __bpf_kfunc struct cgroup *bpf_cgroup_from_subsys_id(u64 cgid, int subsys) { struct cgroup_subsys_state *css = init_css_set.subsys[subsys]; struct cgroup *subsys_root = css->cgroup; // We should introduce a new helper cgroup_get_from_subsys_id() // in the cgroup subsystem. return cgroup_get_from_subsys_id(subsys_root, cgid); } And change task_under_cgroup_hierarchy() as follows, static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) { struct css_set *cset = task_css_set(task); - - return cgroup_is_descendant(cset->dfl_cgrp, ancestor); + struct cgroup *cgrp; + bool ret = false; + + if (ancestor->root == &cgrp_dfl_root) + return cgroup_is_descendant(cset->dfl_cgrp, ancestor); + + cgroup_lock(); + spin_lock_irq(&css_set_lock); + cgrp = task_cgroup_from_root(task, ancestor->root); + if (cgrp && cgroup_is_descendant(cgrp, ancestor)) + ret = true; + spin_unlock_irq(&css_set_lock); + cgroup_unlock(); + return ret; } With the above changes, I think it can meet most use cases with BPF on cgroup1. What do you think ? -- Regards Yafang
Hello. On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote: > With the above changes, I think it can meet most use cases with BPF on cgroup1. > What do you think ? I think the presented use case of LSM hooks is better served by the default hierarchy (see also [1]). Relying on a chosen subsys v1 hierarchy is not systematic. And extending ancestry checking on named v1 hierarchies seems backwards given the existence of the default hierarchy. Michal [1] https://docs.kernel.org/admin-guide/cgroup-v2.html#delegation-containment
On Fri, Sep 15, 2023 at 07:01:28PM +0200, Michal Koutný wrote: > Hello. > > On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote: > > With the above changes, I think it can meet most use cases with BPF on cgroup1. > > What do you think ? > > I think the presented use case of LSM hooks is better served by the > default hierarchy (see also [1]). > Relying on a chosen subsys v1 hierarchy is not systematic. And extending > ancestry checking on named v1 hierarchies seems backwards given > the existence of the default hierarchy. Yeah, identifying cgroup1 hierarchies by subsys leave out pretty good chunk of usecases - e.g. systemd used to use a named hierarchy for primary process organization on cgroup1. Also, you don't have to switch to cgroup2 wholesale. You can just build the same hierarchy in cgroup2 for process organization and combine that with any cgroup1 hierarchies. Thanks.
On Mon, Sep 11, 2023 at 8:31 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Tue, Sep 12, 2023 at 4:24 AM Tejun Heo <tj@kernel.org> wrote: > > > > On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote: > > > To acquire the cgroup_id, we can resort to open coding, as exemplified below: > > > > > > task = bpf_get_current_task_btf(); > > > cgroups = task->cgroups; > > > cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup; > > > key = cgroup->kn->id; > > > > You can't hardcode it to a specific controller tree like that. You either > > stick with fd based interface or need also add something to identify the > > specifc cgroup1 tree. > > As pointed out by Alexei, I think we can introduce some > cgroup_id-based kfuncs which can work for both cgroup1 and cgroup2. > > Something as follows (untested), > > __bpf_kfunc u64 bpf_current_cgroup_id_from_subsys(int subsys) > { > struct cgroup *cgroup; > > cgroup = task_cgroup(current, subsys); > if (!cgroup) > return 0; > return cgroup_id(cgroup); > } > Can we also support checking arbitrary tasks, instead of just current? I find myself often needing to find the cgroup only given a task_struct. For example, when attaching to context switch, I want to know whether the next task is under a cgroup. Having such a kfunc would be very useful. It can also be used in task_iter programs. Hao
On Sat, Sep 16, 2023 at 1:01 AM Michal Koutný <mkoutny@suse.com> wrote: > > Hello. > > On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote: > > With the above changes, I think it can meet most use cases with BPF on cgroup1. > > What do you think ? > > I think the presented use case of LSM hooks is better served by the > default hierarchy (see also [1]). Hi Michal, The crucial issue at hand is not whether the LSM hooks are better suited for the cgroup default hierarchy. What truly matters is the effort and time required to migrate all cgroup1-based applications to cgroup2-based ones. While transitioning a single component from cgroup1-based to cgroup2-based is a straightforward task, the complexity arises when multiple interdependent components in a production environment necessitate this transition. In such cases, the work becomes significantly challenging. > Relying on a chosen subsys v1 hierarchy is not systematic. And extending > ancestry checking on named v1 hierarchies seems backwards given > the existence of the default hierarchy. The cgroup becomes active only when it has one or more of its controllers enabled. In a production environment, a task is invariably governed by at least one cgroup controller. Even in hybrid cgroup mode, a task is subject to either a cgroup1 controller or a cgroup2 controller. Our objective is to enhance BPF support for controller-based scenarios, eliminating the need to concern ourselves with hierarchies, whether they involve cgroup1 or cgroup2. This change seems quite reasonable, in my opinion. > > > Michal > > [1] https://docs.kernel.org/admin-guide/cgroup-v2.html#delegation-containment -- Regards Yafang
On Sat, Sep 16, 2023 at 1:31 AM Tejun Heo <tj@kernel.org> wrote: > > On Fri, Sep 15, 2023 at 07:01:28PM +0200, Michal Koutný wrote: > > Hello. > > > > On Tue, Sep 12, 2023 at 11:30:32AM +0800, Yafang Shao <laoar.shao@gmail.com> wrote: > > > With the above changes, I think it can meet most use cases with BPF on cgroup1. > > > What do you think ? > > > > I think the presented use case of LSM hooks is better served by the > > default hierarchy (see also [1]). > > Relying on a chosen subsys v1 hierarchy is not systematic. And extending > > ancestry checking on named v1 hierarchies seems backwards given > > the existence of the default hierarchy. > > Yeah, identifying cgroup1 hierarchies by subsys leave out pretty good chunk > of usecases - e.g. systemd used to use a named hierarchy for primary process > organization on cgroup1. Systemd-managed tasks invariably have one or more cgroup controllers enabled, as exemplified by entries like "/sys/fs/cgroup/cpu/{system.slice, user.slice, XXX.service}". Consequently, the presence of a cgroup controller can be employed as an indicator to identify a systemd-managed task. > > Also, you don't have to switch to cgroup2 wholesale. You can just build the > same hierarchy in cgroup2 for process organization and combine that with any > cgroup1 hierarchies. The challenge lies in the need to adapt a multitude of applications to this system, and furthermore, not all of these applications are under our direct management. This poses a formidable task.
On Sat, Sep 16, 2023 at 2:57 AM Hao Luo <haoluo@google.com> wrote: > > On Mon, Sep 11, 2023 at 8:31 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Tue, Sep 12, 2023 at 4:24 AM Tejun Heo <tj@kernel.org> wrote: > > > > > > On Sun, Sep 10, 2023 at 11:17:48AM +0800, Yafang Shao wrote: > > > > To acquire the cgroup_id, we can resort to open coding, as exemplified below: > > > > > > > > task = bpf_get_current_task_btf(); > > > > cgroups = task->cgroups; > > > > cgroup = cgroups->subsys[cpu_cgrp_id]->cgroup; > > > > key = cgroup->kn->id; > > > > > > You can't hardcode it to a specific controller tree like that. You either > > > stick with fd based interface or need also add something to identify the > > > specifc cgroup1 tree. > > > > As pointed out by Alexei, I think we can introduce some > > cgroup_id-based kfuncs which can work for both cgroup1 and cgroup2. > > > > Something as follows (untested), > > > > __bpf_kfunc u64 bpf_current_cgroup_id_from_subsys(int subsys) > > { > > struct cgroup *cgroup; > > > > cgroup = task_cgroup(current, subsys); > > if (!cgroup) > > return 0; > > return cgroup_id(cgroup); > > } > > > > Can we also support checking arbitrary tasks, instead of just current? > I find myself often needing to find the cgroup only given a > task_struct. For example, when attaching to context switch, I want to > know whether the next task is under a cgroup. Having such a kfunc > would be very useful. It can also be used in task_iter programs. Agree. Will do it.
On Sun, Sep 17, 2023 at 03:19:06PM +0800, Yafang Shao <laoar.shao@gmail.com> wrote: > The crucial issue at hand is not whether the LSM hooks are better > suited for the cgroup default hierarchy. What truly matters is the > effort and time required to migrate all cgroup1-based applications to > cgroup2-based ones. While transitioning a single component from > cgroup1-based to cgroup2-based is a straightforward task, the > complexity arises when multiple interdependent components in a > production environment necessitate this transition. In such cases, the > work becomes significantly challenging. systemd's hybrid mode is the approach helping such combined environments. (I understand that it's not warranted with all container runtimes but FYI.) On v1-only deployments BPF predicates couldn't be used at all currently. Transition is transitional but accompanying complexity in the code would have to be kept much longer. > Our objective is to enhance BPF support for controller-based > scenarios, eliminating the need to concern ourselves with hierarchies, > whether they involve cgroup1 or cgroup2. I'm posting some notes on this to the 1st patch. Regards, Michal