Message ID | 20230607235352.1723243-1-andrii@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | BPF token | expand |
On 06/07, Andrii Nakryiko wrote: > This patch set introduces new BPF object, BPF token, which allows to delegate > a subset of BPF functionality from privileged system-wide daemon (e.g., > systemd or any other container manager) to a *trusted* unprivileged > application. Trust is the key here. This functionality is not about allowing > unconditional unprivileged BPF usage. Establishing trust, though, is > completely up to the discretion of respective privileged application that > would create a BPF token. > > The main motivation for BPF token is a desire to enable containerized > BPF applications to be used together with user namespaces. This is currently > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > arbitrary memory, and it's impossible to ensure that they only read memory of > processes belonging to any given namespace. This means that it's impossible to > have namespace-aware CAP_BPF capability, and as such another mechanism to > allow safe usage of BPF functionality is necessary. BPF token and delegation > of it to a trusted unprivileged applications is such mechanism. Kernel makes > no assumption about what "trusted" constitutes in any particular case, and > it's up to specific privileged applications and their surrounding > infrastructure to decide that. What kernel provides is a set of APIs to create > and tune BPF token, and pass it around to privileged BPF commands that are > creating new BPF objects like BPF programs, BPF maps, etc. > > Previous attempt at addressing this very same problem ([0]) attempted to > utilize authoritative LSM approach, but was conclusively rejected by upstream > LSM maintainers. BPF token concept is not changing anything about LSM > approach, but can be combined with LSM hooks for very fine-grained security > policy. Some ideas about making BPF token more convenient to use with LSM (in > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > 2023 presentation ([1]). E.g., an ability to specify user-provided data > (context), which in combination with BPF LSM would allow implementing a very > dynamic and fine-granular custom security policies on top of BPF token. In the > interest of minimizing API surface area discussions this is going to be > added in follow up patches, as it's not essential to the fundamental concept > of delegatable BPF token. > > It should be noted that BPF token is conceptually quite similar to the idea of > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > difference is the idea of using virtual anon_inode file to hold BPF token and > allowing multiple independent instances of them, each with its own set of > restrictions. BPF pinning solves the problem of exposing such BPF token > through file system (BPF FS, in this case) for cases where transferring FDs > over Unix domain sockets is not convenient. And also, crucially, BPF token > approach is not using any special stateful task-scoped flags. Instead, bpf() > syscall accepts token_fd parameters explicitly for each relevant BPF command. > This addresses main concerns brought up during the /dev/bpf discussion, and > fits better with overall BPF subsystem design. > > This patch set adds a basic minimum of functionality to make BPF token useful > and to discuss API and functionality. Currently only low-level libbpf APIs > support passing BPF token around, allowing to test kernel functionality, but > for the most part is not sufficient for real-world applications, which > typically use high-level libbpf APIs based on `struct bpf_object` type. This > was done with the intent to limit the size of patch set and concentrate on > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > as a separate follow up patch set kernel support makes it upstream. > > Another part that should happen once kernel-side BPF token is established, is > a set of conventions between applications (e.g., systemd), tools (e.g., > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > at well-defined locations to allow applications take advantage of this in > automatic fashion without explicit code changes on BPF application's side. > But I'd like to postpone this discussion to after BPF token concept lands. > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > v1->v2: > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). I went through v2, everything makes sense, the only thing that is slightly confusing to me is the bpf_token_capable() call. The name somehow implies that the token is capable of something where in reality the function does "return token || capable(x)". IMO, it would be less confusing if we do something like the following, explicitly, instead of calling a function: if (token || {bpf_,perfmon_,}capable(x)) ... (or rename to something like bpf_token_or_capable(x)) Up to you on whether to take any action on that. OTOH, once you grasp what bpf_token_capable really does, it's not really a problem.
On Thu, Jun 8, 2023 at 11:49 AM Stanislav Fomichev <sdf@google.com> wrote: > > On 06/07, Andrii Nakryiko wrote: > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > LSM maintainers. BPF token concept is not changing anything about LSM > > approach, but can be combined with LSM hooks for very fine-grained security > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > (context), which in combination with BPF LSM would allow implementing a very > > dynamic and fine-granular custom security policies on top of BPF token. In the > > interest of minimizing API surface area discussions this is going to be > > added in follow up patches, as it's not essential to the fundamental concept > > of delegatable BPF token. > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > difference is the idea of using virtual anon_inode file to hold BPF token and > > allowing multiple independent instances of them, each with its own set of > > restrictions. BPF pinning solves the problem of exposing such BPF token > > through file system (BPF FS, in this case) for cases where transferring FDs > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > This addresses main concerns brought up during the /dev/bpf discussion, and > > fits better with overall BPF subsystem design. > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > and to discuss API and functionality. Currently only low-level libbpf APIs > > support passing BPF token around, allowing to test kernel functionality, but > > for the most part is not sufficient for real-world applications, which > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > was done with the intent to limit the size of patch set and concentrate on > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > as a separate follow up patch set kernel support makes it upstream. > > > > Another part that should happen once kernel-side BPF token is established, is > > a set of conventions between applications (e.g., systemd), tools (e.g., > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > at well-defined locations to allow applications take advantage of this in > > automatic fashion without explicit code changes on BPF application's side. > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > > > v1->v2: > > - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset; > > - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav). > > I went through v2, everything makes sense, the only thing that is > slightly confusing to me is the bpf_token_capable() call. > The name somehow implies that the token is capable of something > where in reality the function does "return token || capable(x)". heh, "bpf_token_" part is sort of like namespace/object prefix. The intent here was to have a token-aware capable check. And yes, if we get a token during prog/map/etc construction, the assumption is that it provides all relevant permissions. > > IMO, it would be less confusing if we do something like the following, > explicitly, instead of calling a function: > > if (token || {bpf_,perfmon_,}capable(x)) ... > > (or rename to something like bpf_token_or_capable(x)) I'd rather not open-code `if (token || ...)` checks everywhere, but I can rename to `bpf_token_or_capable()` if people prefer. I erred on the side of succinctness, but if it's confusing, then best to rename? > > Up to you on whether to take any action on that. OTOH, once you > grasp what bpf_token_capable really does, it's not really a problem. Cool, thanks for taking a look!
Andrii Nakryiko <andrii@kernel.org> writes: > This patch set introduces new BPF object, BPF token, which allows to delegate > a subset of BPF functionality from privileged system-wide daemon (e.g., > systemd or any other container manager) to a *trusted* unprivileged > application. Trust is the key here. This functionality is not about allowing > unconditional unprivileged BPF usage. Establishing trust, though, is > completely up to the discretion of respective privileged application that > would create a BPF token. I am not convinced that this token-based approach is a good way to solve this: having the delegation mechanism be one where you can basically only grant a perpetual delegation with no way to retract it, no way to check what exactly it's being used for, and that is transitive (can be passed on to others with no restrictions) seems like a recipe for disaster. I believe this was basically the point Casey was making as well in response to v1. If the goal is to enable a privileged application (such as a container manager) to grant another unprivileged application the permission to perform certain bpf() operations, why not just proxy the operations themselves over some RPC mechanism? That way the granting application can perform authentication checks on every operation and ensure its origins are sound at the time it is being made. Instead of just writing a blank check (in the form of a token) and hoping the receiver of it is not compromised... -Toke
On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > > Andrii Nakryiko <andrii@kernel.org> writes: > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > I am not convinced that this token-based approach is a good way to solve > this: having the delegation mechanism be one where you can basically > only grant a perpetual delegation with no way to retract it, no way to > check what exactly it's being used for, and that is transitive (can be > passed on to others with no restrictions) seems like a recipe for > disaster. I believe this was basically the point Casey was making as > well in response to v1. Most of this can be added, if we really need to. Ability to revoke BPF token is easy to implement (though of course it will apply only for subsequent operations). We can allocate ID for BPF token just like we do for BPF prog/map/link and let tools iterate and fetch information about it. As for controlling who's passing what and where, I don't think the situation is different for any other FD-based mechanism. You might as well create a BPF map/prog/link, pass it through SCM_RIGHTS or BPF FS, and that application can keep doing the same to other processes. Ultimately, currently we have root permissions for applications that need BPF. That's already very dangerous. But just because something might be misused or abused doesn't prevent us from making a good practical use of it, right? Also, there is LSM on top of all of this to override and control how the BPF subsystem is used, regardless of BPF token. It can override any of the privileges mechanism, capabilities, BPF token, whatnot. > > If the goal is to enable a privileged application (such as a container > manager) to grant another unprivileged application the permission to > perform certain bpf() operations, why not just proxy the operations > themselves over some RPC mechanism? That way the granting application It's explicitly what we *do not* want to do, as it is a major problem and logistical complication. Every single application will have to be rewritten to use such a special daemon/service and its API, which is completely different from bpf() syscall API. It invalidates the use of all the libbpf (and other bpf libraries') APIs, BPF skeleton is incompatible with this. It's a nightmare. I've got feedback from people in another company that do have BPF service with just a tiny subset of BPF functionality delegated to such service, and it's a pain and definitely not a preferred way to do things. Just think about having to mirror a big chunk of bpf() syscall as an RPC. So no, BPF proxy is definitely not a good solution. > can perform authentication checks on every operation and ensure its > origins are sound at the time it is being made. Instead of just writing > a blank check (in the form of a token) and hoping the receiver of it is > not compromised... All this could and should be done through LSM in much more decoupled and transparent (to application) way. BPF token doesn't prevent this. It actually helps with this, because organizations can actually dictate that operations that do not provide BPF token are automatically rejected, and those that do provide BPF token can be further checked and granted or rejected based on specific BPF token instance. > > -Toke
On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote: > This patch set introduces new BPF object, BPF token, which allows to delegate > a subset of BPF functionality from privileged system-wide daemon (e.g., > systemd or any other container manager) to a *trusted* unprivileged > application. Trust is the key here. This functionality is not about allowing > unconditional unprivileged BPF usage. Establishing trust, though, is > completely up to the discretion of respective privileged application that > would create a BPF token. > I skimmed the description and the LSFMM slides. Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such). It went nowhere. Where does BPF token fit in? Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container? Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote: > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > > I skimmed the description and the LSFMM slides. > > Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such). It went nowhere. > > Where does BPF token fit in? Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container? Yes?.. In the sense that it is possible to create BPF programs and BPF maps from inside the container (with BPF token). Right now under user namespace it's impossible no matter what you do. > Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. BPF is still a privileged thing. You can't just say that any unprivileged application should be able to use BPF. That's why BPF token is about trusting unpriv application in a controlled environment (production) to not do something crazy. It can be enforced further through LSM usage, but in a lot of cases, when dealing with internal production applications it's enough to have a proper application design and rely on code review process to avoid any negative effects. So privileged daemon (container manager) will be configured with the knowledge of which services/containers are allowed to use BPF, and will grant BPF token only to those that were explicitly allowlisted.
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> Andrii Nakryiko <andrii@kernel.org> writes: >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate >> > a subset of BPF functionality from privileged system-wide daemon (e.g., >> > systemd or any other container manager) to a *trusted* unprivileged >> > application. Trust is the key here. This functionality is not about allowing >> > unconditional unprivileged BPF usage. Establishing trust, though, is >> > completely up to the discretion of respective privileged application that >> > would create a BPF token. >> >> I am not convinced that this token-based approach is a good way to solve >> this: having the delegation mechanism be one where you can basically >> only grant a perpetual delegation with no way to retract it, no way to >> check what exactly it's being used for, and that is transitive (can be >> passed on to others with no restrictions) seems like a recipe for >> disaster. I believe this was basically the point Casey was making as >> well in response to v1. > > Most of this can be added, if we really need to. Ability to revoke BPF > token is easy to implement (though of course it will apply only for > subsequent operations). We can allocate ID for BPF token just like we > do for BPF prog/map/link and let tools iterate and fetch information > about it. As for controlling who's passing what and where, I don't > think the situation is different for any other FD-based mechanism. You > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS > or BPF FS, and that application can keep doing the same to other > processes. No, but every other fd-based mechanism is limited in scope. E.g., if you pass a map fd that's one specific map that can be passed around, with a token it's all operations (of a specific type) which is way broader. > Ultimately, currently we have root permissions for applications that > need BPF. That's already very dangerous. But just because something > might be misused or abused doesn't prevent us from making a good > practical use of it, right? That's not a given. It's always a trade-off, and if the mechanism is likely to open up the system to additional risk that's not a good trade-off even if it helps in some case. I basically worry that this is the case here. > Also, there is LSM on top of all of this to override and control how > the BPF subsystem is used, regardless of BPF token. It can override > any of the privileges mechanism, capabilities, BPF token, whatnot. If this mechanism needs an LSM to be used safely, that's not incredibly confidence-inspiring. Security mechanisms should fail safe, which this one does not. I'm also worried that an LSM policy is the only way to disable the ability to create a token; with this in the kernel, I suddenly have to trust not only that all applications with BPF privileges will not load malicious code, but also that they won't (accidentally or maliciously) conveys extra privileges on someone else. Seems a bit broad to have this ability (to issue tokens) available to everyone with access to the bpf() syscall, when (IIUC) it's only a single daemon in the system that would legitimately do this in the deployment you're envisioning. >> If the goal is to enable a privileged application (such as a container >> manager) to grant another unprivileged application the permission to >> perform certain bpf() operations, why not just proxy the operations >> themselves over some RPC mechanism? That way the granting application > > It's explicitly what we *do not* want to do, as it is a major problem > and logistical complication. Every single application will have to be > rewritten to use such a special daemon/service and its API, which is > completely different from bpf() syscall API. It invalidates the use of > all the libbpf (and other bpf libraries') APIs, BPF skeleton is > incompatible with this. It's a nightmare. I've got feedback from > people in another company that do have BPF service with just a tiny > subset of BPF functionality delegated to such service, and it's a pain > and definitely not a preferred way to do things. But weren't you proposing that libbpf should be able to transparently look for tokens and load them without any application changes? Why can't libbpf be taught to use an RPC socket in a similar fashion? It basically boils down to something like: static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size) { if (!stat("/run/bpf.sock")) { sock = open_socket("/run/bpf.sock"); write_to(sock, cmd, attr, size); return read_response(sock); } else { return syscall(__NR_bpf, cmd, attr, size); } } > Just think about having to mirror a big chunk of bpf() syscall as an > RPC. So no, BPF proxy is definitely not a good solution. The daemon at the other side of the socket in the example above doesn't *have* to be taught all the semantics of the syscall, it can just look at the command name and make a decision based on that and the identity of the socket peer, then just pass the whole thing to the kernel if the permission check passes. >> can perform authentication checks on every operation and ensure its >> origins are sound at the time it is being made. Instead of just writing >> a blank check (in the form of a token) and hoping the receiver of it is >> not compromised... > > All this could and should be done through LSM in much more decoupled > and transparent (to application) way. BPF token doesn't prevent this. > It actually helps with this, because organizations can actually > dictate that operations that do not provide BPF token are > automatically rejected, and those that do provide BPF token can be > further checked and granted or rejected based on specific BPF token > instance. See above re: needing an LSM policy to make this safe... -Toke
On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > >> > >> Andrii Nakryiko <andrii@kernel.org> writes: > >> > >> > This patch set introduces new BPF object, BPF token, which allows to delegate > >> > a subset of BPF functionality from privileged system-wide daemon (e.g., > >> > systemd or any other container manager) to a *trusted* unprivileged > >> > application. Trust is the key here. This functionality is not about allowing > >> > unconditional unprivileged BPF usage. Establishing trust, though, is > >> > completely up to the discretion of respective privileged application that > >> > would create a BPF token. > >> > >> I am not convinced that this token-based approach is a good way to solve > >> this: having the delegation mechanism be one where you can basically > >> only grant a perpetual delegation with no way to retract it, no way to > >> check what exactly it's being used for, and that is transitive (can be > >> passed on to others with no restrictions) seems like a recipe for > >> disaster. I believe this was basically the point Casey was making as > >> well in response to v1. > > > > Most of this can be added, if we really need to. Ability to revoke BPF > > token is easy to implement (though of course it will apply only for > > subsequent operations). We can allocate ID for BPF token just like we > > do for BPF prog/map/link and let tools iterate and fetch information > > about it. As for controlling who's passing what and where, I don't > > think the situation is different for any other FD-based mechanism. You > > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS > > or BPF FS, and that application can keep doing the same to other > > processes. > > No, but every other fd-based mechanism is limited in scope. E.g., if you > pass a map fd that's one specific map that can be passed around, with a > token it's all operations (of a specific type) which is way broader. It's not black and white. Once you have a BPF program FD, you can attach it many times, for example, and cause regressions. Sure, here we are talking about creating multiple BPF maps or loading multiple BPF programs, so it's wider in scope, but still, it's not that fundamentally different. > > > Ultimately, currently we have root permissions for applications that > > need BPF. That's already very dangerous. But just because something > > might be misused or abused doesn't prevent us from making a good > > practical use of it, right? > > That's not a given. It's always a trade-off, and if the mechanism is > likely to open up the system to additional risk that's not a good > trade-off even if it helps in some case. I basically worry that this is > the case here. > > > Also, there is LSM on top of all of this to override and control how > > the BPF subsystem is used, regardless of BPF token. It can override > > any of the privileges mechanism, capabilities, BPF token, whatnot. > > If this mechanism needs an LSM to be used safely, that's not incredibly > confidence-inspiring. Security mechanisms should fail safe, which this > one does not. I proposed to add authoritative LSM hooks that would selectively allow some of BPF operations on a case-by-case basis. This was rejected, claiming that the best approach is to give process privilege to do whatever it needs to do and then restrict it with LSM. Ok, if not for user namespaces, that would mean giving application CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it with LSM. Except with user namespace that doesn't work. So that's where BPF token comes in, but allows it to do it more safely by allowing to coarsely tune what subset of BPF operations is granted. And then LSM should be used to further restrict it. > > I'm also worried that an LSM policy is the only way to disable the > ability to create a token; with this in the kernel, I suddenly have to > trust not only that all applications with BPF privileges will not load > malicious code, but also that they won't (accidentally or maliciously) > conveys extra privileges on someone else. Seems a bit broad to have this > ability (to issue tokens) available to everyone with access to the bpf() > syscall, when (IIUC) it's only a single daemon in the system that would > legitimately do this in the deployment you're envisioning. Note, any process with real CAP_SYS_ADMIN. Let's not forget that. But would you feel better if BPF_TOKEN_CREATE was guarded behind sysctl or Kconfig? Ultimately, worrying is fine, but there are real problems that need to be solved. And not doing anything isn't a great option. > > >> If the goal is to enable a privileged application (such as a container > >> manager) to grant another unprivileged application the permission to > >> perform certain bpf() operations, why not just proxy the operations > >> themselves over some RPC mechanism? That way the granting application > > > > It's explicitly what we *do not* want to do, as it is a major problem > > and logistical complication. Every single application will have to be > > rewritten to use such a special daemon/service and its API, which is > > completely different from bpf() syscall API. It invalidates the use of > > all the libbpf (and other bpf libraries') APIs, BPF skeleton is > > incompatible with this. It's a nightmare. I've got feedback from > > people in another company that do have BPF service with just a tiny > > subset of BPF functionality delegated to such service, and it's a pain > > and definitely not a preferred way to do things. > > But weren't you proposing that libbpf should be able to transparently > look for tokens and load them without any application changes? Why can't > libbpf be taught to use an RPC socket in a similar fashion? It basically > boils down to something like: > > static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, > unsigned int size) > { > if (!stat("/run/bpf.sock")) { > sock = open_socket("/run/bpf.sock"); > write_to(sock, cmd, attr, size); > return read_response(sock); > } else { > return syscall(__NR_bpf, cmd, attr, size); > } > } > Well, for one, Meta we'll use its own Thrift-based RPC protocol. Google might use something internal for them using GRPC, someone else would want to utilize systemd, yet others will use yet another implementation. RPC introduces more failure modes. While with syscall we know that operation either succeeded or failed, with RPC we'll have to deal with "maybe", if it was some communication error. Let's not trivialize adding, using, and supporting the RPC version of bpf() syscall. > > Just think about having to mirror a big chunk of bpf() syscall as an > > RPC. So no, BPF proxy is definitely not a good solution. > > The daemon at the other side of the socket in the example above doesn't > *have* to be taught all the semantics of the syscall, it can just look > at the command name and make a decision based on that and the identity > of the socket peer, then just pass the whole thing to the kernel if the > permission check passes. Let's not trivialize the consequences of adding an RPC protocol to all this, please. No matter in what form or shape. > > >> can perform authentication checks on every operation and ensure its > >> origins are sound at the time it is being made. Instead of just writing > >> a blank check (in the form of a token) and hoping the receiver of it is > >> not compromised... > > > > All this could and should be done through LSM in much more decoupled > > and transparent (to application) way. BPF token doesn't prevent this. > > It actually helps with this, because organizations can actually > > dictate that operations that do not provide BPF token are > > automatically rejected, and those that do provide BPF token can be > > further checked and granted or rejected based on specific BPF token > > instance. > > See above re: needing an LSM policy to make this safe... See above. We are talking about the CAP_SYS_ADMIN-enabled process. It's not safe by definition already. > > -Toke
Hi Andrii, On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > This patch set introduces new BPF object, BPF token, which allows to delegate > a subset of BPF functionality from privileged system-wide daemon (e.g., > systemd or any other container manager) to a *trusted* unprivileged > application. Trust is the key here. This functionality is not about allowing > unconditional unprivileged BPF usage. Establishing trust, though, is > completely up to the discretion of respective privileged application that > would create a BPF token. > > The main motivation for BPF token is a desire to enable containerized > BPF applications to be used together with user namespaces. This is currently > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > arbitrary memory, and it's impossible to ensure that they only read memory of > processes belonging to any given namespace. This means that it's impossible to > have namespace-aware CAP_BPF capability, and as such another mechanism to > allow safe usage of BPF functionality is necessary. BPF token and delegation > of it to a trusted unprivileged applications is such mechanism. Kernel makes > no assumption about what "trusted" constitutes in any particular case, and > it's up to specific privileged applications and their surrounding > infrastructure to decide that. What kernel provides is a set of APIs to create > and tune BPF token, and pass it around to privileged BPF commands that are > creating new BPF objects like BPF programs, BPF maps, etc. Is there a reason for coupling this only with the userns? The "trusted unprivileged" assumed by systemd can be in init userns? > Previous attempt at addressing this very same problem ([0]) attempted to > utilize authoritative LSM approach, but was conclusively rejected by upstream > LSM maintainers. BPF token concept is not changing anything about LSM > approach, but can be combined with LSM hooks for very fine-grained security > policy. Some ideas about making BPF token more convenient to use with LSM (in > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > 2023 presentation ([1]). E.g., an ability to specify user-provided data > (context), which in combination with BPF LSM would allow implementing a very > dynamic and fine-granular custom security policies on top of BPF token. In the > interest of minimizing API surface area discussions this is going to be > added in follow up patches, as it's not essential to the fundamental concept > of delegatable BPF token. > > It should be noted that BPF token is conceptually quite similar to the idea of > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > difference is the idea of using virtual anon_inode file to hold BPF token and > allowing multiple independent instances of them, each with its own set of > restrictions. BPF pinning solves the problem of exposing such BPF token > through file system (BPF FS, in this case) for cases where transferring FDs > over Unix domain sockets is not convenient. And also, crucially, BPF token > approach is not using any special stateful task-scoped flags. Instead, bpf() What's the use case for transfering over unix domain sockets? Will BPF token translation happen if you cross the different namespaces? If the token is pinned into different bpffs, will the token share the same context?
On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > Hi Andrii, > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > Is there a reason for coupling this only with the userns? There is no coupling. Without userns it is at least possible to grant CAP_BPF and other capabilities from init ns. With user namespace that becomes impossible. > The "trusted unprivileged" assumed by systemd can be in init userns? It doesn't have to be systemd, but yes, BPF token can be created only when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family of commands). > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > LSM maintainers. BPF token concept is not changing anything about LSM > > approach, but can be combined with LSM hooks for very fine-grained security > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > (context), which in combination with BPF LSM would allow implementing a very > > dynamic and fine-granular custom security policies on top of BPF token. In the > > interest of minimizing API surface area discussions this is going to be > > added in follow up patches, as it's not essential to the fundamental concept > > of delegatable BPF token. > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > difference is the idea of using virtual anon_inode file to hold BPF token and > > allowing multiple independent instances of them, each with its own set of > > restrictions. BPF pinning solves the problem of exposing such BPF token > > through file system (BPF FS, in this case) for cases where transferring FDs > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > What's the use case for transfering over unix domain sockets? I'm not sure I understand the question. Unix domain socket (specifically its SCM_RIGHTS ancillary message) allows to transfer files between processes, which is one way to pass BPF object (like prog/map/link, and now token). BPF FS is the other one. In practice it's usually BPF FS, but there is no presumption about how file reference is transferred. > > Will BPF token translation happen if you cross the different namespaces? What does BPF token translation mean specifically? Currently it's a very simple kernel object with refcnt and a few flags, so there is nothing to translate? > > If the token is pinned into different bpffs, will the token share the > same context? So I was planning to allow a user process creating a BPF token to specify custom user-provided data (context). This is not in this patch set, but is it what you are asking about? Regardless, pinning BPF object in BPF FS is just basically bumping a refcnt and exposes that object in a way that can be looked up through file system path (using bpf() syscall's BPF_OBJ_GET command). Underlying object isn't cloned or copied, it's exactly the same object with the same shared internal state.
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> >> >> Andrii Nakryiko <andrii@kernel.org> writes: >> >> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g., >> >> > systemd or any other container manager) to a *trusted* unprivileged >> >> > application. Trust is the key here. This functionality is not about allowing >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is >> >> > completely up to the discretion of respective privileged application that >> >> > would create a BPF token. >> >> >> >> I am not convinced that this token-based approach is a good way to solve >> >> this: having the delegation mechanism be one where you can basically >> >> only grant a perpetual delegation with no way to retract it, no way to >> >> check what exactly it's being used for, and that is transitive (can be >> >> passed on to others with no restrictions) seems like a recipe for >> >> disaster. I believe this was basically the point Casey was making as >> >> well in response to v1. >> > >> > Most of this can be added, if we really need to. Ability to revoke BPF >> > token is easy to implement (though of course it will apply only for >> > subsequent operations). We can allocate ID for BPF token just like we >> > do for BPF prog/map/link and let tools iterate and fetch information >> > about it. As for controlling who's passing what and where, I don't >> > think the situation is different for any other FD-based mechanism. You >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS >> > or BPF FS, and that application can keep doing the same to other >> > processes. >> >> No, but every other fd-based mechanism is limited in scope. E.g., if you >> pass a map fd that's one specific map that can be passed around, with a >> token it's all operations (of a specific type) which is way broader. > > It's not black and white. Once you have a BPF program FD, you can > attach it many times, for example, and cause regressions. Sure, here > we are talking about creating multiple BPF maps or loading multiple > BPF programs, so it's wider in scope, but still, it's not that > fundamentally different. Right, but the difference is that a single BPF program is a known entity, so even if the application you pass the fd to can attach it multiple times, it can't make it do new things (e.g., bpf_probe_read() stuff it is not supposed to). Whereas with bpf_token you have no such guarantee. >> >> > Ultimately, currently we have root permissions for applications that >> > need BPF. That's already very dangerous. But just because something >> > might be misused or abused doesn't prevent us from making a good >> > practical use of it, right? >> >> That's not a given. It's always a trade-off, and if the mechanism is >> likely to open up the system to additional risk that's not a good >> trade-off even if it helps in some case. I basically worry that this is >> the case here. >> >> > Also, there is LSM on top of all of this to override and control how >> > the BPF subsystem is used, regardless of BPF token. It can override >> > any of the privileges mechanism, capabilities, BPF token, whatnot. >> >> If this mechanism needs an LSM to be used safely, that's not incredibly >> confidence-inspiring. Security mechanisms should fail safe, which this >> one does not. > > I proposed to add authoritative LSM hooks that would selectively allow > some of BPF operations on a case-by-case basis. This was rejected, > claiming that the best approach is to give process privilege to do > whatever it needs to do and then restrict it with LSM. > > Ok, if not for user namespaces, that would mean giving application > CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it > with LSM. Except with user namespace that doesn't work. So that's > where BPF token comes in, but allows it to do it more safely by > allowing to coarsely tune what subset of BPF operations is granted. > And then LSM should be used to further restrict it. Right, I do understand the use case, my worry is that we're creating a privilege escalation model that is really broad if it is *not* coupled with an LSM to restrict it. Which will be the default outside of controlled environments that really know what they are doing. So I dunno, maybe some way to restrict the token so it only grants privilege if there is *also* an explicit LSM verdict on it? I guess that's still too close to an authoritative LSM hook that it'll pass? I do think the "explicit grant" model of an authoritative LSM is a better fit for this kind of thing... >> I'm also worried that an LSM policy is the only way to disable the >> ability to create a token; with this in the kernel, I suddenly have to >> trust not only that all applications with BPF privileges will not load >> malicious code, but also that they won't (accidentally or maliciously) >> conveys extra privileges on someone else. Seems a bit broad to have this >> ability (to issue tokens) available to everyone with access to the bpf() >> syscall, when (IIUC) it's only a single daemon in the system that would >> legitimately do this in the deployment you're envisioning. > > Note, any process with real CAP_SYS_ADMIN. Let's not forget that. > > But would you feel better if BPF_TOKEN_CREATE was guarded behind > sysctl or Kconfig? Hmm, yeah, some way to make sure it's off by default would be preferable, IMO. > Ultimately, worrying is fine, but there are real problems that need to > be solved. And not doing anything isn't a great option. Right, it would be good if some of the security folks could chime in with their view of how this is best achieved without running into any of the "bad ideas" they are opposed to. >> >> If the goal is to enable a privileged application (such as a container >> >> manager) to grant another unprivileged application the permission to >> >> perform certain bpf() operations, why not just proxy the operations >> >> themselves over some RPC mechanism? That way the granting application >> > >> > It's explicitly what we *do not* want to do, as it is a major problem >> > and logistical complication. Every single application will have to be >> > rewritten to use such a special daemon/service and its API, which is >> > completely different from bpf() syscall API. It invalidates the use of >> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is >> > incompatible with this. It's a nightmare. I've got feedback from >> > people in another company that do have BPF service with just a tiny >> > subset of BPF functionality delegated to such service, and it's a pain >> > and definitely not a preferred way to do things. >> >> But weren't you proposing that libbpf should be able to transparently >> look for tokens and load them without any application changes? Why can't >> libbpf be taught to use an RPC socket in a similar fashion? It basically >> boils down to something like: >> >> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, >> unsigned int size) >> { >> if (!stat("/run/bpf.sock")) { >> sock = open_socket("/run/bpf.sock"); >> write_to(sock, cmd, attr, size); >> return read_response(sock); >> } else { >> return syscall(__NR_bpf, cmd, attr, size); >> } >> } >> > > Well, for one, Meta we'll use its own Thrift-based RPC protocol. > Google might use something internal for them using GRPC, someone else > would want to utilize systemd, yet others will use yet another > implementation. RPC introduces more failure modes. While with syscall > we know that operation either succeeded or failed, with RPC we'll have > to deal with "maybe", if it was some communication error. > > Let's not trivialize adding, using, and supporting the RPC version of > bpf() syscall. I am not trying to trivialise it, I am well aware that it is more complicated in practice than just adding a wrapper like the above. I am just arguing with your point that "all applications need to change, so we can't do RPC". Any mechanism we add along there lines will require application changes, including the BPF token. And if the way we're going to avoid that is by baking the support into libbpf, then that can be done regardless of the mechanism we choose. Or to put it another way: as you say it may be more *complicated* to add an RPC-based path to libbpf, but it's not fundamentally impossible, it's just another technical problem to be solved. And if that added complexity buys us better security properties, maybe that is a good trade-off. At least we shouldn't dismiss it out of hand. -Toke
On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > Hi Andrii, > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > ... > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > Is there a reason for coupling this only with the userns? > > There is no coupling. Without userns it is at least possible to grant > CAP_BPF and other capabilities from init ns. With user namespace that > becomes impossible. But these are not the same: delegate full cap vs delegate an fd mask? One can argue unprivileged in init userns is the same privileged in nested userns Getting to delegate fd in init userns, then in nested ones seems logical... > > The "trusted unprivileged" assumed by systemd can be in init userns? > > It doesn't have to be systemd, but yes, BPF token can be created only > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family > of commands). I'm more into getting fd delegation work also in the first init userns... I can't understand why it's not possible or doable? > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > approach, but can be combined with LSM hooks for very fine-grained security > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > (context), which in combination with BPF LSM would allow implementing a very > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > interest of minimizing API surface area discussions this is going to be > > > added in follow up patches, as it's not essential to the fundamental concept > > > of delegatable BPF token. > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > allowing multiple independent instances of them, each with its own set of > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > What's the use case for transfering over unix domain sockets? > > I'm not sure I understand the question. Unix domain socket > (specifically its SCM_RIGHTS ancillary message) allows to transfer > files between processes, which is one way to pass BPF object (like > prog/map/link, and now token). BPF FS is the other one. In practice > it's usually BPF FS, but there is no presumption about how file > reference is transferred. Got it. IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving userns, no ? I assume such which allows to set up things in a hierarchical way... If I set up the environment to lock things down the line, I find it strange if a received fd would allow me to do more things than what was planned when I created the environment: namespaces, mounts, etc I think you have to add the owning userns context to the fd or "token", and on the receiving part if the current userns is the same or a nested one of the current userns hierarchy then allow bpf operation, otherwise fail with -EACCESS or something similar... > > > > Will BPF token translation happen if you cross the different namespaces? > > What does BPF token translation mean specifically? Currently it's a > very simple kernel object with refcnt and a few flags, so there is > nothing to translate? Please see above comment about the owning userns context > > > > If the token is pinned into different bpffs, will the token share the > > same context? > > So I was planning to allow a user process creating a BPF token to > specify custom user-provided data (context). This is not in this patch > set, but is it what you are asking about? Exactly, define what you can access inside the container... this would align with Andy's suggestion "making BPF behave sensibly in that container seems like it should also be necessary." I do agree on this. Again I think LSM and bpf+lsm should have the final word on this too... > Regardless, pinning BPF object in BPF FS is just basically bumping a > refcnt and exposes that object in a way that can be looked up through > file system path (using bpf() syscall's BPF_OBJ_GET command). > Underlying object isn't cloned or copied, it's exactly the same object > with the same shared internal state. This is the part I also find strange, I can understand pinning a bpf program, map, etc, but an fd that gives some access rights should be part of the filesystem from the start, I don't get the extra pinning. Also it seems bpffs is per superblock mount so why not allow privileged to mount bpffs with the corresponding information, then privileged can open the fd, set it up and pass it down the line when executing the main program? or even allow unprivileged to open it on bpffs with some restrictive conditions? Then it would be the business of the privileged to bind mount bpffs in some other places, share it, etc Having the fd or "token" that gives access rights pinned in two separate bpffs mounts seems too much, it crosses namespaces (mount, userns etc), environments setup by privileged... I would just make it per bpffs mount and that's it, nothing more. If a program wants to bind mount it somewhere else then it's not a bpf problem.
> On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote: > > This patch set introduces new BPF object, BPF token, which allows to delegate > a subset of BPF functionality from privileged system-wide daemon (e.g., > systemd or any other container manager) to a *trusted* unprivileged > application. Trust is the key here. This functionality is not about allowing > unconditional unprivileged BPF usage. Establishing trust, though, is > completely up to the discretion of respective privileged application that > would create a BPF token. Hello! Author of a bpfd[1] here. > The main motivation for BPF token is a desire to enable containerized > BPF applications to be used together with user namespaces. This is currently > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > arbitrary memory, and it's impossible to ensure that they only read memory of > processes belonging to any given namespace. This means that it's impossible to > have namespace-aware CAP_BPF capability, and as such another mechanism to > allow safe usage of BPF functionality is necessary. BPF token and delegation > of it to a trusted unprivileged applications is such mechanism. Kernel makes > no assumption about what "trusted" constitutes in any particular case, and > it's up to specific privileged applications and their surrounding > infrastructure to decide that. What kernel provides is a set of APIs to create > and tune BPF token, and pass it around to privileged BPF commands that are > creating new BPF objects like BPF programs, BPF maps, etc. You could do that… but the problem is created due to the pattern of having a single binary that is responsible for: - Loading and attaching the BPF program in question - Interacting with maps Let’s set aside some of the other fun concerns of eBPF in containers: - Requiring mounting of vmlinux, bpffs, traces etc… - How fs permissions on host translate into permissions in containers While your proposal lets you grant a subset of CAP_BPF to some other process, which I imagine could also be done with SELinux, it doesn’t stop you from needing other required permissions for attaching tracing programs in such an environment. For example, say container A wants to attach a uprobe to a process in container B. Container A needs to be able to nsenter into container B’s pidns in order for attachment to succeed… but then what I can do with CAP_BPF is the least of my concerns since I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges much scarier than CAP_BPF in the first place. If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd) then with recent kernels your container workload should be fine to run entirely unprivileged, or worst case with only CAP_BPF since all you need to do is read/write maps. Policy control - which process can request to load programs that monitor which other processes - would happen within this system daemon and you wouldn’t need tokens. Since it’s easy enough to do this in userspace, I’d be strongly against adding more complexity into BPF to support this usecase. > Previous attempt at addressing this very same problem ([0]) attempted to > utilize authoritative LSM approach, but was conclusively rejected by upstream > LSM maintainers. BPF token concept is not changing anything about LSM > approach, but can be combined with LSM hooks for very fine-grained security > policy. Some ideas about making BPF token more convenient to use with LSM (in > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > 2023 presentation ([1]). E.g., an ability to specify user-provided data > (context), which in combination with BPF LSM would allow implementing a very > dynamic and fine-granular custom security policies on top of BPF token. In the > interest of minimizing API surface area discussions this is going to be > added in follow up patches, as it's not essential to the fundamental concept > of delegatable BPF token. > > It should be noted that BPF token is conceptually quite similar to the idea of > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > difference is the idea of using virtual anon_inode file to hold BPF token and > allowing multiple independent instances of them, each with its own set of > restrictions. BPF pinning solves the problem of exposing such BPF token > through file system (BPF FS, in this case) for cases where transferring FDs > over Unix domain sockets is not convenient. And also, crucially, BPF token > approach is not using any special stateful task-scoped flags. Instead, bpf() > syscall accepts token_fd parameters explicitly for each relevant BPF command. > This addresses main concerns brought up during the /dev/bpf discussion, and > fits better with overall BPF subsystem design. > > This patch set adds a basic minimum of functionality to make BPF token useful > and to discuss API and functionality. Currently only low-level libbpf APIs > support passing BPF token around, allowing to test kernel functionality, but > for the most part is not sufficient for real-world applications, which > typically use high-level libbpf APIs based on `struct bpf_object` type. This > was done with the intent to limit the size of patch set and concentrate on > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > as a separate follow up patch set kernel support makes it upstream. > > Another part that should happen once kernel-side BPF token is established, is > a set of conventions between applications (e.g., systemd), tools (e.g., > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > at well-defined locations to allow applications take advantage of this in > automatic fashion without explicit code changes on BPF application's side. > But I'd like to postpone this discussion to after BPF token concept lands. > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > - Dave [1]: https://github.com/bpfd-dev/bpfd
On Mon, Jun 12, 2023 at 2:02 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > ... > > I'm not sure I understand the question. Unix domain socket > > (specifically its SCM_RIGHTS ancillary message) allows to transfer > > files between processes, which is one way to pass BPF object (like > > prog/map/link, and now token). BPF FS is the other one. In practice > > it's usually BPF FS, but there is no presumption about how file > > reference is transferred. > > Got it. > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving > userns, no ? > > I assume such which allows to set up things in a hierarchical way... > > If I set up the environment to lock things down the line, I find it > strange if a received fd would allow me to do more things than what > was planned when I created the environment: namespaces, mounts, etc > > I think you have to add the owning userns context to the fd or > "token", and on the receiving part if the current userns is the same > or a nested one of the current userns hierarchy then allow bpf > operation, otherwise fail with -EACCESS or something similar... Andrii to make it clear: the owning userns that is owner/creator of the bpffs mount (better this one since you prevent the inherit fd and do bad things with it cases...) lets call it userns A, and the receiving process is in userns B, so when transfering the fd if userns B == userns A or if A is an ancestor of B then allow to do things with fd token, otherwise just deny it... At least that's how I see things now, but maybe there are corner cases...
On Mon, Jun 12, 2023 at 2:45 PM Dave Tucker <datucker@redhat.com> wrote: > > > > > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote: > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > Hello! Author of a bpfd[1] here. > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > You could do that… but the problem is created due to the pattern of having a > single binary that is responsible for: > > - Loading and attaching the BPF program in question > - Interacting with maps > > Let’s set aside some of the other fun concerns of eBPF in containers: > - Requiring mounting of vmlinux, bpffs, traces etc… > - How fs permissions on host translate into permissions in containers > > While your proposal lets you grant a subset of CAP_BPF to some other process, > which I imagine could also be done with SELinux, it doesn’t stop you from needing > > other required permissions for attaching tracing programs in such an > environment. > > For example, say container A wants to attach a uprobe to a process in container B. > Container A needs to be able to nsenter into container B’s pidns in order for attachment > to succeed… but then what I can do with CAP_BPF is the least of my concerns since > I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges > much scarier than CAP_BPF in the first place. > > If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd) > then with recent kernels your container workload should be fine to run entirely unprivileged, > or worst case with only CAP_BPF since all you need to do is read/write maps. > > Policy control - which process can request to load programs that monitor which other > processes - would happen within this system daemon and you wouldn’t need tokens. > > Since it’s easy enough to do this in userspace, I’d be strongly against adding more > complexity into BPF to support this usecase. For some cases complexity could be the other way, bpf by design are small programs that can be loaded/unloaded dynamically and work on their own... easily adaptable to dynamic workload... not all bpf are the same... Stuffing *everything* together and performing round trips between main container and container transfering, loading and attaching bpf programs would question what's the advantage?
On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > >> > >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > >> > >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > >> >> > >> >> Andrii Nakryiko <andrii@kernel.org> writes: > >> >> > >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate > >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g., > >> >> > systemd or any other container manager) to a *trusted* unprivileged > >> >> > application. Trust is the key here. This functionality is not about allowing > >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is > >> >> > completely up to the discretion of respective privileged application that > >> >> > would create a BPF token. > >> >> > >> >> I am not convinced that this token-based approach is a good way to solve > >> >> this: having the delegation mechanism be one where you can basically > >> >> only grant a perpetual delegation with no way to retract it, no way to > >> >> check what exactly it's being used for, and that is transitive (can be > >> >> passed on to others with no restrictions) seems like a recipe for > >> >> disaster. I believe this was basically the point Casey was making as > >> >> well in response to v1. > >> > > >> > Most of this can be added, if we really need to. Ability to revoke BPF > >> > token is easy to implement (though of course it will apply only for > >> > subsequent operations). We can allocate ID for BPF token just like we > >> > do for BPF prog/map/link and let tools iterate and fetch information > >> > about it. As for controlling who's passing what and where, I don't > >> > think the situation is different for any other FD-based mechanism. You > >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS > >> > or BPF FS, and that application can keep doing the same to other > >> > processes. > >> > >> No, but every other fd-based mechanism is limited in scope. E.g., if you > >> pass a map fd that's one specific map that can be passed around, with a > >> token it's all operations (of a specific type) which is way broader. > > > > It's not black and white. Once you have a BPF program FD, you can > > attach it many times, for example, and cause regressions. Sure, here > > we are talking about creating multiple BPF maps or loading multiple > > BPF programs, so it's wider in scope, but still, it's not that > > fundamentally different. > > Right, but the difference is that a single BPF program is a known > entity, so even if the application you pass the fd to can attach it > multiple times, it can't make it do new things (e.g., bpf_probe_read() > stuff it is not supposed to). Whereas with bpf_token you have no such > guarantee. Sure, I'm not claiming BPF token is just like passing BPF program FD around. My point is that anything in the kernel that is representable by FD can be passed around to an unintended process through SCM_RIGHTS. And if you want to have tighter control over who's passing what, you'd probably need LSM. But it's not a requirement. With BPF token it is important to trust the application you are passing BPF token to. This is not a mechanism to just freely pass around the ability to do BPF. You do it only to applications you control. You can initiate BPF token from under CAP_SYS_ADMIN only. If you give CAP_SYS_ADMIN to some application that might pass BPF token to some random application, you should probably revisit the whole approach. You can do a lot of harm with that CAP_SYS_ADMIN beyond the BPF subsystem. On the other hand, the more correct comparison would be whether to give some unprivileged application a BPF token versus giving it CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYSADMIN (or the necessary subset of it). With BPF token you can narrow down to what exact types of programs and maps it can use, if at all. BPF token applies to BPF subsystem only. With caps, you are giving that application way more power than you'd like, but that's ok in practice, because a) you need that application to do something useful with BPF, so you take that risk, and b) you normally would control that application, so you are mitigating this risk even without any LSM or something like that on top. We do the latter all the time because we have to. BPF token gives us more well-scoped alternatively. With user namespaces, if we could grant CAP_BPF and co to use BPF, we'd do that. But we can't. BPF token at least gives us this opportunity. So while I understand your concerns in principle, I think they are a bit overblown in practice. > > >> > >> > Ultimately, currently we have root permissions for applications that > >> > need BPF. That's already very dangerous. But just because something > >> > might be misused or abused doesn't prevent us from making a good > >> > practical use of it, right? > >> > >> That's not a given. It's always a trade-off, and if the mechanism is > >> likely to open up the system to additional risk that's not a good > >> trade-off even if it helps in some case. I basically worry that this is > >> the case here. > >> > >> > Also, there is LSM on top of all of this to override and control how > >> > the BPF subsystem is used, regardless of BPF token. It can override > >> > any of the privileges mechanism, capabilities, BPF token, whatnot. > >> > >> If this mechanism needs an LSM to be used safely, that's not incredibly > >> confidence-inspiring. Security mechanisms should fail safe, which this > >> one does not. > > > > I proposed to add authoritative LSM hooks that would selectively allow > > some of BPF operations on a case-by-case basis. This was rejected, > > claiming that the best approach is to give process privilege to do > > whatever it needs to do and then restrict it with LSM. > > > > Ok, if not for user namespaces, that would mean giving application > > CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it > > with LSM. Except with user namespace that doesn't work. So that's > > where BPF token comes in, but allows it to do it more safely by > > allowing to coarsely tune what subset of BPF operations is granted. > > And then LSM should be used to further restrict it. > > Right, I do understand the use case, my worry is that we're creating a > privilege escalation model that is really broad if it is *not* coupled > with an LSM to restrict it. Which will be the default outside of > controlled environments that really know what they are doing. Look, you are worried that you gave some process root permissions and that process delegated a small portion of that (BPF token) to an unprivileged process, which abuses it somehow. Beyond the question of "why did you grant root permissions to something you can't trust to do the right thing", isn't there a more dangerous stuff (I don't know, setuid, chmod/chown, etc) that root process can perform to grant unprivileged process unintended and uncontrolled privileges? Why BPF token is the one singled out that would have to require mandatory LSM to be installed? > > So I dunno, maybe some way to restrict the token so it only grants > privilege if there is *also* an explicit LSM verdict on it? I guess > that's still too close to an authoritative LSM hook that it'll pass? I > do think the "explicit grant" model of an authoritative LSM is a better > fit for this kind of thing... > I proposed an authoritative LSM, it was pretty plainly rejected and the model of "grant a lot + restrict with LSM" was suggested. > >> I'm also worried that an LSM policy is the only way to disable the > >> ability to create a token; with this in the kernel, I suddenly have to > >> trust not only that all applications with BPF privileges will not load > >> malicious code, but also that they won't (accidentally or maliciously) > >> conveys extra privileges on someone else. Seems a bit broad to have this > >> ability (to issue tokens) available to everyone with access to the bpf() > >> syscall, when (IIUC) it's only a single daemon in the system that would > >> legitimately do this in the deployment you're envisioning. > > > > Note, any process with real CAP_SYS_ADMIN. Let's not forget that. > > > > But would you feel better if BPF_TOKEN_CREATE was guarded behind > > sysctl or Kconfig? > > Hmm, yeah, some way to make sure it's off by default would be > preferable, IMO. > > > Ultimately, worrying is fine, but there are real problems that need to > > be solved. And not doing anything isn't a great option. > > Right, it would be good if some of the security folks could chime in > with their view of how this is best achieved without running into any of > the "bad ideas" they are opposed to. agreed > > >> >> If the goal is to enable a privileged application (such as a container > >> >> manager) to grant another unprivileged application the permission to > >> >> perform certain bpf() operations, why not just proxy the operations > >> >> themselves over some RPC mechanism? That way the granting application > >> > > >> > It's explicitly what we *do not* want to do, as it is a major problem > >> > and logistical complication. Every single application will have to be > >> > rewritten to use such a special daemon/service and its API, which is > >> > completely different from bpf() syscall API. It invalidates the use of > >> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is > >> > incompatible with this. It's a nightmare. I've got feedback from > >> > people in another company that do have BPF service with just a tiny > >> > subset of BPF functionality delegated to such service, and it's a pain > >> > and definitely not a preferred way to do things. > >> > >> But weren't you proposing that libbpf should be able to transparently > >> look for tokens and load them without any application changes? Why can't > >> libbpf be taught to use an RPC socket in a similar fashion? It basically > >> boils down to something like: > >> > >> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, > >> unsigned int size) > >> { > >> if (!stat("/run/bpf.sock")) { > >> sock = open_socket("/run/bpf.sock"); > >> write_to(sock, cmd, attr, size); > >> return read_response(sock); > >> } else { > >> return syscall(__NR_bpf, cmd, attr, size); > >> } > >> } > >> > > > > Well, for one, Meta we'll use its own Thrift-based RPC protocol. > > Google might use something internal for them using GRPC, someone else > > would want to utilize systemd, yet others will use yet another > > implementation. RPC introduces more failure modes. While with syscall > > we know that operation either succeeded or failed, with RPC we'll have > > to deal with "maybe", if it was some communication error. > > > > Let's not trivialize adding, using, and supporting the RPC version of > > bpf() syscall. > > I am not trying to trivialise it, I am well aware that it is more > complicated in practice than just adding a wrapper like the above. I am > just arguing with your point that "all applications need to change, so > we can't do RPC". Any mechanism we add along there lines will require > application changes, including the BPF token. And if the way we're going Well, it depends on what kinds of changes we are talking about. E.g., in most explicit case, it would be something like: int token_fd = bpf_token_get("/sys/fs/bpf/my_granted_token"); if (token_fd < 0) /* we can bail out or just assume no token */ LIBBPF_OPTS(bpf_object_open_opts, .token_fd = token_fd); struct my_skel *skel = my_skel__open_opts(&opts); That's literally it. And if we have some convention that libbpf will try to open, say, /sys/fs/bpf/.token automatically, there will be zero code changes. And I'm not simplifying this. > to avoid that is by baking the support into libbpf, then that can be > done regardless of the mechanism we choose. > > Or to put it another way: as you say it may be more *complicated* to add > an RPC-based path to libbpf, but it's not fundamentally impossible, it's > just another technical problem to be solved. And if that added > complexity buys us better security properties, maybe that is a good > trade-off. At least we shouldn't dismiss it out of hand. You are oversimplifying this. There is a huge difference between syscall and RPC and interfaces. The former (syscall approach) will error out only on invalid inputs (and highly improbable if kernel runs out of memory, which means your app is dead anyways). You don't code against syscall interface with expectation that it can fail at any point and you should be able to recover it. With RPC you have to bake in into your application that any RPC can fail transiently, for many reasons. Service could be down, restarted, slow, etc, etc. This changes *everything* in how you develop application, how you write code, how you handle errors, how you monitor stuff. Everything. It's impossible to just swap out syscall with RPC transparently without introducing horrible consequences. This is not some technical difficulty, it's a fundamental impedance mismatch. One of the early distributed systems mistakes was to pretend that remote procedure calls could be reliable and assume errors are rare and could be pretended to behave like syscalls or local in-process APIs. It has been recognized many times over how bad such approaches were. It's outside of the scope of this discussion to go into more details. Suffice it to say that libbpf is not going to pretend that syscall and some RPC are equivalent and can be interchangeable in a transparent way. And then, even if we were crazy enough to do the above, there is no way everyone will settle on one single implementation and/or RPC protocol and API such that libbpf could implement it in its upstream version. Big companies most probably will go with their own internal ones that would give them better integration with internal infrastructure, better overvability, etc. And even in open-source there probably won't be one single implementation everyone will be happy with. > > -Toke
On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote: > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > Hi Andrii, > > > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > ... > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > Is there a reason for coupling this only with the userns? > > > > There is no coupling. Without userns it is at least possible to grant > > CAP_BPF and other capabilities from init ns. With user namespace that > > becomes impossible. > > But these are not the same: delegate full cap vs delegate an fd mask? What FD mask are we talking about here? I don't recall us talking about any FD masks, so this one is a bit confusing without more context. > > One can argue unprivileged in init userns is the same privileged in > nested userns > Getting to delegate fd in init userns, then in nested ones seems logical... Again, sorry, I'm not following. Can you please elaborate what you mean? > > > > The "trusted unprivileged" assumed by systemd can be in init userns? > > > > It doesn't have to be systemd, but yes, BPF token can be created only > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family > > of commands). > > I'm more into getting fd delegation work also in the first init userns... > > I can't understand why it's not possible or doable? > I don't know what you are proposing, as I mentioned above, so it's hard to answer this question. > > > > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > > approach, but can be combined with LSM hooks for very fine-grained security > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > > (context), which in combination with BPF LSM would allow implementing a very > > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > > interest of minimizing API surface area discussions this is going to be > > > > added in follow up patches, as it's not essential to the fundamental concept > > > > of delegatable BPF token. > > > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > > allowing multiple independent instances of them, each with its own set of > > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > > > What's the use case for transfering over unix domain sockets? > > > > I'm not sure I understand the question. Unix domain socket > > (specifically its SCM_RIGHTS ancillary message) allows to transfer > > files between processes, which is one way to pass BPF object (like > > prog/map/link, and now token). BPF FS is the other one. In practice > > it's usually BPF FS, but there is no presumption about how file > > reference is transferred. > > Got it. > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving > userns, no ? > > I assume such which allows to set up things in a hierarchical way... > > If I set up the environment to lock things down the line, I find it > strange if a received fd would allow me to do more things than what > was planned when I created the environment: namespaces, mounts, etc > > I think you have to add the owning userns context to the fd or > "token", and on the receiving part if the current userns is the same > or a nested one of the current userns hierarchy then allow bpf > operation, otherwise fail with -EACCESS or something similar... > I think I mentioned problems with namespacing BPF itself. It's just fundamentally impossible due to a system-wide nature of BPF. So we can pretend to somehow attach/restrict BPF token to some namespace, but it still allows BPF programs to peek at any kernel state or user-space process. So I'd rather us not pretend we can do something that we actually cannot enforce. > > > > > > > Will BPF token translation happen if you cross the different namespaces? > > > > What does BPF token translation mean specifically? Currently it's a > > very simple kernel object with refcnt and a few flags, so there is > > nothing to translate? > > Please see above comment about the owning userns context > > > > > > > If the token is pinned into different bpffs, will the token share the > > > same context? > > > > So I was planning to allow a user process creating a BPF token to > > specify custom user-provided data (context). This is not in this patch > > set, but is it what you are asking about? > > Exactly, define what you can access inside the container... this would > align with Andy's suggestion "making BPF behave sensibly in that > container seems like it should also be necessary." I do agree on this. > I don't know what Andy's suggestion actually is (as I honestly can't make out what your proposal is, sorry; you guys are not making it easy on me by being pretty vague and nonspecific). But see above about pretending to contain BPF within a container. There is no such thing. BPF is system-wide. > Again I think LSM and bpf+lsm should have the final word on this too... > Yes, I also think that having LSM on top is beneficial. But not a strict requirement and more or less orthogonal. > > > Regardless, pinning BPF object in BPF FS is just basically bumping a > > refcnt and exposes that object in a way that can be looked up through > > file system path (using bpf() syscall's BPF_OBJ_GET command). > > Underlying object isn't cloned or copied, it's exactly the same object > > with the same shared internal state. > > This is the part I also find strange, I can understand pinning a bpf > program, map, etc, but an fd that gives some access rights should be > part of the filesystem from the start, I don't get the extra pinning. BPF pinning of BPF token is optional. Everything still works without any BPF FS mount at all. It's an FD, BPF FS is just one of the means to pass FD to another process. I actually don't see why coupling BPF FS and BPF token is simpler. Now, BPF token is a kernel object, with its own state. It has an FD associated with it. It can be passed around and provided as an argument to bpf() syscall. In that sense it's just like BPF prog/map/link, just another BPF object. > Also it seems bpffs is per superblock mount so why not allow > privileged to mount bpffs with the corresponding information, then > privileged can open the fd, set it up and pass it down the line when > executing the main program? or even allow unprivileged to open it on > bpffs with some restrictive conditions? > > Then it would be the business of the privileged to bind mount bpffs in > some other places, share it, etc How is this fundamentally different from BPF token pinning by *privileged* process? Except we are not conflating BPF FS as a way to pin/get many different BPF objects with BPF token itself. In both cases it's up to privileged process to set up sharing of BPF token appropriately. > > Having the fd or "token" that gives access rights pinned in two > separate bpffs mounts seems too much, it crosses namespaces (mount, > userns etc), environments setup by privileged... See above, there is nothing namespaceable about BPF itself, and BPF token as well. If some production setup benefits from pinning one BPF token in multiple places, I don't see the problem with that. > > I would just make it per bpffs mount and that's it, nothing more. If a > program wants to bind mount it somewhere else then it's not a bpf > problem. And if some application wants to pin BPF token, why would that be BPF subsystem's problem as well?
On Mon, Jun 12, 2023 at 5:45 AM Dave Tucker <datucker@redhat.com> wrote: > > > > > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote: > > > > This patch set introduces new BPF object, BPF token, which allows to delegate > > a subset of BPF functionality from privileged system-wide daemon (e.g., > > systemd or any other container manager) to a *trusted* unprivileged > > application. Trust is the key here. This functionality is not about allowing > > unconditional unprivileged BPF usage. Establishing trust, though, is > > completely up to the discretion of respective privileged application that > > would create a BPF token. > > > Hello! Author of a bpfd[1] here. > > > The main motivation for BPF token is a desire to enable containerized > > BPF applications to be used together with user namespaces. This is currently > > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced > > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF > > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read > > arbitrary memory, and it's impossible to ensure that they only read memory of > > processes belonging to any given namespace. This means that it's impossible to > > have namespace-aware CAP_BPF capability, and as such another mechanism to > > allow safe usage of BPF functionality is necessary. BPF token and delegation > > of it to a trusted unprivileged applications is such mechanism. Kernel makes > > no assumption about what "trusted" constitutes in any particular case, and > > it's up to specific privileged applications and their surrounding > > infrastructure to decide that. What kernel provides is a set of APIs to create > > and tune BPF token, and pass it around to privileged BPF commands that are > > creating new BPF objects like BPF programs, BPF maps, etc. > > You could do that… but the problem is created due to the pattern of having a > single binary that is responsible for: > > - Loading and attaching the BPF program in question > - Interacting with maps It is a very desirable property to couple and deploy user process and its BPF programs/maps together and manage their lifecycle directly. All of Meta's production applications are using this model. This allows for a simple and reliable versioning story. This allows using BPF skeleton and BPF global variables naturally. It makes it simple and easy to develop, debug, version, deploy, monitor BPF applications. It also couples BPF program attachment (link) with lifetime of the user space process. So if it crashes or restarts without clean detachment, we don't end up with orphaned BPF programs and maps. We've had pretty bad issues due to such orphaned programs, and that's why the whole BPF link concept was formalized. So it's actually a desirable approach in a real-world production setup. > > Let’s set aside some of the other fun concerns of eBPF in containers: > - Requiring mounting of vmlinux, bpffs, traces etc… > - How fs permissions on host translate into permissions in containers > > While your proposal lets you grant a subset of CAP_BPF to some other process, > which I imagine could also be done with SELinux, it doesn’t stop you from needing > other required permissions for attaching tracing programs in such an > environment. In some cases yes, there are other parts of the kernel that would require some more work to be able to be used. But a lot of things are possible within bpf() syscall already, including tracing stuff. > > For example, say container A wants to attach a uprobe to a process in container B. > Container A needs to be able to nsenter into container B’s pidns in order for attachment > to succeed… but then what I can do with CAP_BPF is the least of my concerns since > I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges > much scarier than CAP_BPF in the first place. You'd wager, or you know for sure? I haven't tried, so I won't make any claims. I do know, though, that our systemd-wide profiling agent (not running under user namespace), can attach to and profile namespaced applications running inside containers without any nsenter. But again, uprobe'ing some other container is just one of possible use cases. Even if some scenarios would require more stuff beyond the BPF token, it doesn't invalidate the need and usefulness of the BPF token. > > If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd) > then with recent kernels your container workload should be fine to run entirely unprivileged, > or worst case with only CAP_BPF since all you need to do is read/write maps. Except we explicitly want to avoid the need for some external entity loading BPF programs on my behalf, like I explained in replies to Toke. > > Policy control - which process can request to load programs that monitor which other > processes - would happen within this system daemon and you wouldn’t need tokens. And we can do the same through controlling which containers/services are issued BPF tokens. And in addition to that could employ LSM for more dynamic and fine-granular control. Doing this through a centralized daemon is one way of doing this. But it's not the universally better way to do this. > > Since it’s easy enough to do this in userspace, I’d be strongly against adding more > complexity into BPF to support this usecase. I appreciate you trying to get more customers for bpfd, there is nothing wrong with that. But this approach has major (good and bad) implications and is not the most appropriate solution in a lot of cases and setups. As for complexity. If you looked at the code, you saw that it's a completely optional feature as far as BPF UAPI goes, so your customers won't need to care about BPF token existence, if they are happy using bpfd solution. > > > Previous attempt at addressing this very same problem ([0]) attempted to > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > LSM maintainers. BPF token concept is not changing anything about LSM > > approach, but can be combined with LSM hooks for very fine-grained security > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > (context), which in combination with BPF LSM would allow implementing a very > > dynamic and fine-granular custom security policies on top of BPF token. In the > > interest of minimizing API surface area discussions this is going to be > > added in follow up patches, as it's not essential to the fundamental concept > > of delegatable BPF token. > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > difference is the idea of using virtual anon_inode file to hold BPF token and > > allowing multiple independent instances of them, each with its own set of > > restrictions. BPF pinning solves the problem of exposing such BPF token > > through file system (BPF FS, in this case) for cases where transferring FDs > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > syscall accepts token_fd parameters explicitly for each relevant BPF command. > > This addresses main concerns brought up during the /dev/bpf discussion, and > > fits better with overall BPF subsystem design. > > > > This patch set adds a basic minimum of functionality to make BPF token useful > > and to discuss API and functionality. Currently only low-level libbpf APIs > > support passing BPF token around, allowing to test kernel functionality, but > > for the most part is not sufficient for real-world applications, which > > typically use high-level libbpf APIs based on `struct bpf_object` type. This > > was done with the intent to limit the size of patch set and concentrate on > > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent > > as a separate follow up patch set kernel support makes it upstream. > > > > Another part that should happen once kernel-side BPF token is established, is > > a set of conventions between applications (e.g., systemd), tools (e.g., > > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS > > at well-defined locations to allow applications take advantage of this in > > automatic fashion without explicit code changes on BPF application's side. > > But I'd like to postpone this discussion to after BPF token concept lands. > > > > [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/ > > [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf > > [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/ > > > > - Dave > > [1]: https://github.com/bpfd-dev/bpfd >
On Mon, Jun 12, 2023 at 3:08 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > > <...> > > to avoid that is by baking the support into libbpf, then that can be > > done regardless of the mechanism we choose. > > > > Or to put it another way: as you say it may be more *complicated* to add > > an RPC-based path to libbpf, but it's not fundamentally impossible, it's > > just another technical problem to be solved. And if that added > > complexity buys us better security properties, maybe that is a good > > trade-off. At least we shouldn't dismiss it out of hand. > > You are oversimplifying this. There is a huge difference between > syscall and RPC and interfaces. > > The former (syscall approach) will error out only on invalid inputs > (and highly improbable if kernel runs out of memory, which means your > app is dead anyways). You don't code against syscall interface with > expectation that it can fail at any point and you should be able to > recover it. > > With RPC you have to bake in into your application that any RPC can > fail transiently, for many reasons. Service could be down, restarted, > slow, etc, etc. This changes *everything* in how you develop > application, how you write code, how you handle errors, how you > monitor stuff. Everything. > > It's impossible to just swap out syscall with RPC transparently > without introducing horrible consequences. This is not some technical > difficulty, it's a fundamental impedance mismatch. One of the early > distributed systems mistakes was to pretend that remote procedure > calls could be reliable and assume errors are rare and could be > pretended to behave like syscalls or local in-process APIs. It has > been recognized many times over how bad such approaches were. It's > outside of the scope of this discussion to go into more details. > Suffice it to say that libbpf is not going to pretend that syscall and > some RPC are equivalent and can be interchangeable in a transparent > way. > > And then, even if we were crazy enough to do the above, there is no > way everyone will settle on one single implementation and/or RPC > protocol and API such that libbpf could implement it in its upstream > version. Big companies most probably will go with their own internal > ones that would give them better integration with internal > infrastructure, better overvability, etc. And even in open-source > there probably won't be one single implementation everyone will be > happy with. > Hello Toke and Andrii, I agree with Andrii here. In Google, we have several years of experience building and using BPF RPC service. We delegate BPF operations to this service. From our experience, the RPC approach is quite limiting and becomes impractical for many BPF use cases. For programs that do not require much user interaction, it works just fine. It just loads and attaches the programs, that's all. The problem is the programs that require much user interaction, for example, the ones doing observability, which may often read maps or poll on bpf ringbuf. Overhead and reliability of RPC is one concern. Another problem is the BPF operations based on mmap, for example, directly updating/reading BPF global variables as used in skeleton. We still haven't figured out how to fully support bpf skeleton. We also haven't figured out how to support BPF ringbuf using RPC. There are also problems maintaining this service to catch up with some new features in libbpf. Anyway, I think the syscall interface has been heavily baked in libbpf and bpf kernel interfaces today. There are many BPF use cases where delegating all BPF operations to a service can't work well. IMHO, to achieve a good balance between flexibility and security, some abstraction that conveys controlled trust from priv to unpriv is necessary. The idea of BPF token makes sense to me. With token, libbpf interface requires only minimal change, unpriv user can call libbpf and bpf syscall natively, wins on efficiency and less maintenance burden for libbpf developers. Thanks, Hao
On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > > > Hi Andrii, > > > > > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > ... > > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > > > Is there a reason for coupling this only with the userns? > > > > > > There is no coupling. Without userns it is at least possible to grant > > > CAP_BPF and other capabilities from init ns. With user namespace that > > > becomes impossible. > > > > But these are not the same: delegate full cap vs delegate an fd mask? > > What FD mask are we talking about here? I don't recall us talking > about any FD masks, so this one is a bit confusing without more > context. Ah err, sorry yes referring to fd token (which I assumed is a mask of allowed operations or something like that). So I want the possibility to delegate the fd token in the init userns. > > > > One can argue unprivileged in init userns is the same privileged in > > nested userns > > Getting to delegate fd in init userns, then in nested ones seems logical... > > Again, sorry, I'm not following. Can you please elaborate what you mean? I mean can we use the fd token in the init user namespace too? not only in the nested user namespaces but in the first one? Sorry I didn't check the code. > > > > > > The "trusted unprivileged" assumed by systemd can be in init userns? > > > > > > It doesn't have to be systemd, but yes, BPF token can be created only > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family > > > of commands). > > > > I'm more into getting fd delegation work also in the first init userns... > > > > I can't understand why it's not possible or doable? > > > > I don't know what you are proposing, as I mentioned above, so it's > hard to answer this question. > > > > > > > > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > > > approach, but can be combined with LSM hooks for very fine-grained security > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > > > (context), which in combination with BPF LSM would allow implementing a very > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > > > interest of minimizing API surface area discussions this is going to be > > > > > added in follow up patches, as it's not essential to the fundamental concept > > > > > of delegatable BPF token. > > > > > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > > > allowing multiple independent instances of them, each with its own set of > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > > > > > What's the use case for transfering over unix domain sockets? > > > > > > I'm not sure I understand the question. Unix domain socket > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer > > > files between processes, which is one way to pass BPF object (like > > > prog/map/link, and now token). BPF FS is the other one. In practice > > > it's usually BPF FS, but there is no presumption about how file > > > reference is transferred. > > > > Got it. > > > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving > > userns, no ? > > > > I assume such which allows to set up things in a hierarchical way... > > > > If I set up the environment to lock things down the line, I find it > > strange if a received fd would allow me to do more things than what > > was planned when I created the environment: namespaces, mounts, etc > > > > I think you have to add the owning userns context to the fd or > > "token", and on the receiving part if the current userns is the same > > or a nested one of the current userns hierarchy then allow bpf > > operation, otherwise fail with -EACCESS or something similar... > > > > I think I mentioned problems with namespacing BPF itself. It's just > fundamentally impossible due to a system-wide nature of BPF. So we can > pretend to somehow attach/restrict BPF token to some namespace, but it > still allows BPF programs to peek at any kernel state or user-space > process. I'm not referring to namespacing BPF, but about the same token that can fly between containers... More or less problems mentioned by Casey https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823 I think that a token or the fd should be part of the bpffs and should not be shared between containers or crosse namespaces by default without control... hence the suggested protection: https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd > So I'd rather us not pretend we can do something that we actually > cannot enforce. Actually it is to protect against accidental token sharing or abuse... so completely different things. > > > > > > > > > > Will BPF token translation happen if you cross the different namespaces? > > > > > > What does BPF token translation mean specifically? Currently it's a > > > very simple kernel object with refcnt and a few flags, so there is > > > nothing to translate? > > > > Please see above comment about the owning userns context > > > > > > > > > > If the token is pinned into different bpffs, will the token share the > > > > same context? > > > > > > So I was planning to allow a user process creating a BPF token to > > > specify custom user-provided data (context). This is not in this patch > > > set, but is it what you are asking about? > > > > Exactly, define what you can access inside the container... this would > > align with Andy's suggestion "making BPF behave sensibly in that > > container seems like it should also be necessary." I do agree on this. > > > > I don't know what Andy's suggestion actually is (as I honestly can't > make out what your proposal is, sorry; you guys are not making it easy > on me by being pretty vague and nonspecific). But see above about > pretending to contain BPF within a container. There is no such thing. > BPF is system-wide. Sorry about that, I can quickly put: you may restrict types of bpf programs, you may disable or nop probes if they are running without a process context, if the triggered probe is owned by root by specific uid? if the process is under a specific cgroup hierarchy etc... Are the above possible? > > Again I think LSM and bpf+lsm should have the final word on this too... > > > > Yes, I also think that having LSM on top is beneficial. But not a > strict requirement and more or less orthogonal. I do think there should be LSM hooks to tighten this, as LSMs have more context outside of BPF... > > > > > Regardless, pinning BPF object in BPF FS is just basically bumping a > > > refcnt and exposes that object in a way that can be looked up through > > > file system path (using bpf() syscall's BPF_OBJ_GET command). > > > Underlying object isn't cloned or copied, it's exactly the same object > > > with the same shared internal state. > > > > This is the part I also find strange, I can understand pinning a bpf > > program, map, etc, but an fd that gives some access rights should be > > part of the filesystem from the start, I don't get the extra pinning. > > BPF pinning of BPF token is optional. Everything still works without > any BPF FS mount at all. It's an FD, BPF FS is just one of the means > to pass FD to another process. I actually don't see why coupling BPF > FS and BPF token is simpler. I think it's better the other way around since bpffs is per super block and separate mount then it is already solved, you just get that special fd from the fs and pass it... > Now, BPF token is a kernel object, with its own state. It has an FD > associated with it. It can be passed around and provided as an > argument to bpf() syscall. In that sense it's just like BPF > prog/map/link, just another BPF object. > > > Also it seems bpffs is per superblock mount so why not allow > > privileged to mount bpffs with the corresponding information, then > > privileged can open the fd, set it up and pass it down the line when > > executing the main program? or even allow unprivileged to open it on > > bpffs with some restrictive conditions? > > > > Then it would be the business of the privileged to bind mount bpffs in > > some other places, share it, etc > > How is this fundamentally different from BPF token pinning by > *privileged* process? Except we are not conflating BPF FS as a way to > pin/get many different BPF objects with BPF token itself. In both > cases it's up to privileged process to set up sharing of BPF token > appropriately. I'm not convinced about the use case of sharing BPF tokens between containers or services... Every container or service has its own separate bpffs, what's the point of pinning a shared token created by a different container compared to mounting separate bpffs with an fd token prepared to be used for that specific container? Then the container/service can delegate it to child processes, etc... but sharing between containers and crossing user namespaces, mount namespaces of such containers where bpffs is already separate in that context? I don't see the point, and it just opens the room to token misuse... > > > > Having the fd or "token" that gives access rights pinned in two > > separate bpffs mounts seems too much, it crosses namespaces (mount, > > userns etc), environments setup by privileged... > > See above, there is nothing namespaceable about BPF itself, and BPF > token as well. If some production setup benefits from pinning one BPF > token in multiple places, I don't see the problem with that. > > > > > I would just make it per bpffs mount and that's it, nothing more. If a > > program wants to bind mount it somewhere else then it's not a bpf > > problem. > > And if some application wants to pin BPF token, why would that be BPF > subsystem's problem as well? The credentials, capabilities, keyring, different namespaces, etc are all attached to the owning user namespace, if the BPF subsystem goes its own way and creates a token to split up CAP_BPF without following that model, then it's definitely a BPF subsystem problem... I don't recommend that. Feels it's going more of a system-wide approach opening BPF functionality where ultimately it clashes with the argument: delegate a subset of BPF functionality to a *trusted* unprivileged application. My reading of delegation is within a container/service hierarchy nothing more.
On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote: > On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > > > > > Hi Andrii, > > > > > > > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > > > ... > > > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > > > > > Is there a reason for coupling this only with the userns? > > > > > > > > There is no coupling. Without userns it is at least possible to grant > > > > CAP_BPF and other capabilities from init ns. With user namespace that > > > > becomes impossible. > > > > > > But these are not the same: delegate full cap vs delegate an fd mask? > > > > What FD mask are we talking about here? I don't recall us talking > > about any FD masks, so this one is a bit confusing without more > > context. > > Ah err, sorry yes referring to fd token (which I assumed is a mask of > allowed operations or something like that). > > So I want the possibility to delegate the fd token in the init userns. > > > > > > > One can argue unprivileged in init userns is the same privileged in > > > nested userns > > > Getting to delegate fd in init userns, then in nested ones seems logical... > > > > Again, sorry, I'm not following. Can you please elaborate what you mean? > > I mean can we use the fd token in the init user namespace too? not > only in the nested user namespaces but in the first one? Sorry I > didn't check the code. > > > > > > > > > > The "trusted unprivileged" assumed by systemd can be in init userns? > > > > > > > > It doesn't have to be systemd, but yes, BPF token can be created only > > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions > > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family > > > > of commands). > > > > > > I'm more into getting fd delegation work also in the first init userns... > > > > > > I can't understand why it's not possible or doable? > > > > > > > I don't know what you are proposing, as I mentioned above, so it's > > hard to answer this question. > > > > > > > > > > > > > > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > > > > approach, but can be combined with LSM hooks for very fine-grained security > > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > > > > (context), which in combination with BPF LSM would allow implementing a very > > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > > > > interest of minimizing API surface area discussions this is going to be > > > > > > added in follow up patches, as it's not essential to the fundamental concept > > > > > > of delegatable BPF token. > > > > > > > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > > > > allowing multiple independent instances of them, each with its own set of > > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > > > > > > > What's the use case for transfering over unix domain sockets? > > > > > > > > I'm not sure I understand the question. Unix domain socket > > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer > > > > files between processes, which is one way to pass BPF object (like > > > > prog/map/link, and now token). BPF FS is the other one. In practice > > > > it's usually BPF FS, but there is no presumption about how file > > > > reference is transferred. > > > > > > Got it. > > > > > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving > > > userns, no ? > > > > > > I assume such which allows to set up things in a hierarchical way... > > > > > > If I set up the environment to lock things down the line, I find it > > > strange if a received fd would allow me to do more things than what > > > was planned when I created the environment: namespaces, mounts, etc > > > > > > I think you have to add the owning userns context to the fd or > > > "token", and on the receiving part if the current userns is the same > > > or a nested one of the current userns hierarchy then allow bpf > > > operation, otherwise fail with -EACCESS or something similar... > > > > > > > I think I mentioned problems with namespacing BPF itself. It's just > > fundamentally impossible due to a system-wide nature of BPF. So we can > > pretend to somehow attach/restrict BPF token to some namespace, but it > > still allows BPF programs to peek at any kernel state or user-space > > process. > > I'm not referring to namespacing BPF, but about the same token that > can fly between containers... > More or less problems mentioned by Casey > https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823 > > I think that a token or the fd should be part of the bpffs and should > not be shared between containers or crosse namespaces by default > without control... hence the suggested protection: > https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd > > > > So I'd rather us not pretend we can do something that we actually > > cannot enforce. > > Actually it is to protect against accidental token sharing or abuse... > so completely different things. > > > > > > > > > > > > > > > Will BPF token translation happen if you cross the different namespaces? > > > > > > > > What does BPF token translation mean specifically? Currently it's a > > > > very simple kernel object with refcnt and a few flags, so there is > > > > nothing to translate? > > > > > > Please see above comment about the owning userns context > > > > > > > > > > > > > If the token is pinned into different bpffs, will the token share the > > > > > same context? > > > > > > > > So I was planning to allow a user process creating a BPF token to > > > > specify custom user-provided data (context). This is not in this patch > > > > set, but is it what you are asking about? > > > > > > Exactly, define what you can access inside the container... this would > > > align with Andy's suggestion "making BPF behave sensibly in that > > > container seems like it should also be necessary." I do agree on this. > > > > > > > I don't know what Andy's suggestion actually is (as I honestly can't > > make out what your proposal is, sorry; you guys are not making it easy > > on me by being pretty vague and nonspecific). But see above about > > pretending to contain BPF within a container. There is no such thing. > > BPF is system-wide. > > Sorry about that, I can quickly put: you may restrict types of bpf > programs, you may disable or nop probes if they are running without a > process context, if the triggered probe is owned by root by specific > uid? if the process is under a specific cgroup hierarchy etc... Are > the above possible? > > > > > Again I think LSM and bpf+lsm should have the final word on this too... > > > > > > > Yes, I also think that having LSM on top is beneficial. But not a > > strict requirement and more or less orthogonal. > > I do think there should be LSM hooks to tighten this, as LSMs have > more context outside of BPF... > > > > > > > > > Regardless, pinning BPF object in BPF FS is just basically bumping a > > > > refcnt and exposes that object in a way that can be looked up through > > > > file system path (using bpf() syscall's BPF_OBJ_GET command). > > > > Underlying object isn't cloned or copied, it's exactly the same object > > > > with the same shared internal state. > > > > > > This is the part I also find strange, I can understand pinning a bpf > > > program, map, etc, but an fd that gives some access rights should be > > > part of the filesystem from the start, I don't get the extra pinning. > > > > BPF pinning of BPF token is optional. Everything still works without > > any BPF FS mount at all. It's an FD, BPF FS is just one of the means > > to pass FD to another process. I actually don't see why coupling BPF > > FS and BPF token is simpler. > > I think it's better the other way around since bpffs is per super > block and separate mount then it is already solved, you just get that > special fd from the fs and pass it... > > > > Now, BPF token is a kernel object, with its own state. It has an FD > > associated with it. It can be passed around and provided as an > > argument to bpf() syscall. In that sense it's just like BPF > > prog/map/link, just another BPF object. > > > > > Also it seems bpffs is per superblock mount so why not allow > > > privileged to mount bpffs with the corresponding information, then > > > privileged can open the fd, set it up and pass it down the line when > > > executing the main program? or even allow unprivileged to open it on > > > bpffs with some restrictive conditions? > > > > > > Then it would be the business of the privileged to bind mount bpffs in > > > some other places, share it, etc > > > > How is this fundamentally different from BPF token pinning by > > *privileged* process? Except we are not conflating BPF FS as a way to > > pin/get many different BPF objects with BPF token itself. In both > > cases it's up to privileged process to set up sharing of BPF token > > appropriately. > > I'm not convinced about the use case of sharing BPF tokens between > containers or services... > > Every container or service has its own separate bpffs, what's the > point of pinning a shared token created by a different container > compared to mounting separate bpffs with an fd token prepared to be > used for that specific container? > > Then the container/service can delegate it to child processes, etc... > but sharing between containers and crossing user namespaces, mount > namespaces of such containers where bpffs is already separate in that > context? I don't see the point, and it just opens the room to token > misuse... > > > > > > > > Having the fd or "token" that gives access rights pinned in two > > > separate bpffs mounts seems too much, it crosses namespaces (mount, > > > userns etc), environments setup by privileged... > > > > See above, there is nothing namespaceable about BPF itself, and BPF > > token as well. If some production setup benefits from pinning one BPF > > token in multiple places, I don't see the problem with that. > > > > > > > > I would just make it per bpffs mount and that's it, nothing more. If a > > > program wants to bind mount it somewhere else then it's not a bpf > > > problem. > > > > And if some application wants to pin BPF token, why would that be BPF > > subsystem's problem as well? > > The credentials, capabilities, keyring, different namespaces, etc are > all attached to the owning user namespace, if the BPF subsystem goes > its own way and creates a token to split up CAP_BPF without following > that model, then it's definitely a BPF subsystem problem... I don't > recommend that. > > Feels it's going more of a system-wide approach opening BPF > functionality where ultimately it clashes with the argument: delegate > a subset of BPF functionality to a *trusted* unprivileged application. > My reading of delegation is within a container/service hierarchy > nothing more. You're making the exact arguments that Lennart, Aleksa, and I have been making in the LSFMM presentation about this topic. It's even recorded: https://youtu.be/4CCRTWEZLpw?t=1546 So we fully agree with you here.
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> >> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: >> >> >> >> >> >> Andrii Nakryiko <andrii@kernel.org> writes: >> >> >> >> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate >> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g., >> >> >> > systemd or any other container manager) to a *trusted* unprivileged >> >> >> > application. Trust is the key here. This functionality is not about allowing >> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is >> >> >> > completely up to the discretion of respective privileged application that >> >> >> > would create a BPF token. >> >> >> >> >> >> I am not convinced that this token-based approach is a good way to solve >> >> >> this: having the delegation mechanism be one where you can basically >> >> >> only grant a perpetual delegation with no way to retract it, no way to >> >> >> check what exactly it's being used for, and that is transitive (can be >> >> >> passed on to others with no restrictions) seems like a recipe for >> >> >> disaster. I believe this was basically the point Casey was making as >> >> >> well in response to v1. >> >> > >> >> > Most of this can be added, if we really need to. Ability to revoke BPF >> >> > token is easy to implement (though of course it will apply only for >> >> > subsequent operations). We can allocate ID for BPF token just like we >> >> > do for BPF prog/map/link and let tools iterate and fetch information >> >> > about it. As for controlling who's passing what and where, I don't >> >> > think the situation is different for any other FD-based mechanism. You >> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS >> >> > or BPF FS, and that application can keep doing the same to other >> >> > processes. >> >> >> >> No, but every other fd-based mechanism is limited in scope. E.g., if you >> >> pass a map fd that's one specific map that can be passed around, with a >> >> token it's all operations (of a specific type) which is way broader. >> > >> > It's not black and white. Once you have a BPF program FD, you can >> > attach it many times, for example, and cause regressions. Sure, here >> > we are talking about creating multiple BPF maps or loading multiple >> > BPF programs, so it's wider in scope, but still, it's not that >> > fundamentally different. >> >> Right, but the difference is that a single BPF program is a known >> entity, so even if the application you pass the fd to can attach it >> multiple times, it can't make it do new things (e.g., bpf_probe_read() >> stuff it is not supposed to). Whereas with bpf_token you have no such >> guarantee. > > Sure, I'm not claiming BPF token is just like passing BPF program FD > around. My point is that anything in the kernel that is representable > by FD can be passed around to an unintended process through > SCM_RIGHTS. And if you want to have tighter control over who's passing > what, you'd probably need LSM. But it's not a requirement. > > With BPF token it is important to trust the application you are > passing BPF token to. This is not a mechanism to just freely pass > around the ability to do BPF. You do it only to applications you > control. Trust is not binary, though. "Do I trust this application to perform this specific action" is different from "do I trust this application to perform any action in the future". A security mechanism should grant the minimum required privileges required to perform the operation; this token thing encourages (defaults to) broader grants, which is worrysome. > With user namespaces, if we could grant CAP_BPF and co to use BPF, > we'd do that. But we can't. BPF token at least gives us this > opportunity. If the use case is to punch holes in the user namespace isolation I feel like that is better solved at the user namespace level than the BPF subsystem level... -Toke (Ran out of time and I'm about to leave for PTO, so dropping the RPC discussion for now)
On Tue, Jun 13, 2023 at 5:23 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > > > > > Hi Andrii, > > > > > > > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > > > ... > > > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > > > > > Is there a reason for coupling this only with the userns? > > > > > > > > There is no coupling. Without userns it is at least possible to grant > > > > CAP_BPF and other capabilities from init ns. With user namespace that > > > > becomes impossible. > > > > > > But these are not the same: delegate full cap vs delegate an fd mask? > > > > What FD mask are we talking about here? I don't recall us talking > > about any FD masks, so this one is a bit confusing without more > > context. > > Ah err, sorry yes referring to fd token (which I assumed is a mask of > allowed operations or something like that). Ok, so your "FD masks" aka "fd token" is actually a BPF token as referenced to in this patch set, right? Thanks for clarifying! > > So I want the possibility to delegate the fd token in the init userns. > So as it is right now, BPF token has no association with userns, so yes, you can delegate it in init userns. It's just a kernel object with its own FD, which you pass to bpf() syscall operations. > > > > > > One can argue unprivileged in init userns is the same privileged in > > > nested userns > > > Getting to delegate fd in init userns, then in nested ones seems logical... > > > > Again, sorry, I'm not following. Can you please elaborate what you mean? > > I mean can we use the fd token in the init user namespace too? not > only in the nested user namespaces but in the first one? Sorry I > didn't check the code. Yes, absolutely. > > > > > > > > > > The "trusted unprivileged" assumed by systemd can be in init userns? > > > > > > > > It doesn't have to be systemd, but yes, BPF token can be created only > > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions > > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family > > > > of commands). > > > > > > I'm more into getting fd delegation work also in the first init userns... > > > > > > I can't understand why it's not possible or doable? > > > > > > > I don't know what you are proposing, as I mentioned above, so it's > > hard to answer this question. > > > > > > > > > > > > > > > > > > > > Previous attempt at addressing this very same problem ([0]) attempted to > > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream > > > > > > LSM maintainers. BPF token concept is not changing anything about LSM > > > > > > approach, but can be combined with LSM hooks for very fine-grained security > > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in > > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF > > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data > > > > > > (context), which in combination with BPF LSM would allow implementing a very > > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the > > > > > > interest of minimizing API surface area discussions this is going to be > > > > > > added in follow up patches, as it's not essential to the fundamental concept > > > > > > of delegatable BPF token. > > > > > > > > > > > > It should be noted that BPF token is conceptually quite similar to the idea of > > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest > > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and > > > > > > allowing multiple independent instances of them, each with its own set of > > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token > > > > > > through file system (BPF FS, in this case) for cases where transferring FDs > > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token > > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf() > > > > > > > > > > What's the use case for transfering over unix domain sockets? > > > > > > > > I'm not sure I understand the question. Unix domain socket > > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer > > > > files between processes, which is one way to pass BPF object (like > > > > prog/map/link, and now token). BPF FS is the other one. In practice > > > > it's usually BPF FS, but there is no presumption about how file > > > > reference is transferred. > > > > > > Got it. > > > > > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving > > > userns, no ? > > > > > > I assume such which allows to set up things in a hierarchical way... > > > > > > If I set up the environment to lock things down the line, I find it > > > strange if a received fd would allow me to do more things than what > > > was planned when I created the environment: namespaces, mounts, etc > > > > > > I think you have to add the owning userns context to the fd or > > > "token", and on the receiving part if the current userns is the same > > > or a nested one of the current userns hierarchy then allow bpf > > > operation, otherwise fail with -EACCESS or something similar... > > > > > > > I think I mentioned problems with namespacing BPF itself. It's just > > fundamentally impossible due to a system-wide nature of BPF. So we can > > pretend to somehow attach/restrict BPF token to some namespace, but it > > still allows BPF programs to peek at any kernel state or user-space > > process. > > I'm not referring to namespacing BPF, but about the same token that > can fly between containers... > More or less problems mentioned by Casey > https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823 > > I think that a token or the fd should be part of the bpffs and should > not be shared between containers or crosse namespaces by default > without control... hence the suggested protection: > https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd > Ok, cool, thanks for clarifying! I think we are getting somewhere in this discussion. It seems like you are not worried about the BPF token concept per se, rather that it's not bound to namespace and thus can be "leaked" outside of the intended container. Got it. This makes it more concrete to talk about, but I'll reply in the email to Christian, to keep my reply in one place. > > > So I'd rather us not pretend we can do something that we actually > > cannot enforce. > > Actually it is to protect against accidental token sharing or abuse... > so completely different things. > Ok, got it. I was worried that there is a perception that BPF token allows to sandbox BPF application somehow (which is not the case), so wanted to make sure we are not conflating things. With your latest reply it's clear that the problem that most of the discussion is revolving around is containing BPF token *sharing* within the container. > > > > > > > > > > > > > > Will BPF token translation happen if you cross the different namespaces? > > > > > > > > What does BPF token translation mean specifically? Currently it's a > > > > very simple kernel object with refcnt and a few flags, so there is > > > > nothing to translate? > > > > > > Please see above comment about the owning userns context > > > > > > > > > > > > > If the token is pinned into different bpffs, will the token share the > > > > > same context? > > > > > > > > So I was planning to allow a user process creating a BPF token to > > > > specify custom user-provided data (context). This is not in this patch > > > > set, but is it what you are asking about? > > > > > > Exactly, define what you can access inside the container... this would > > > align with Andy's suggestion "making BPF behave sensibly in that > > > container seems like it should also be necessary." I do agree on this. > > > > > > > I don't know what Andy's suggestion actually is (as I honestly can't > > make out what your proposal is, sorry; you guys are not making it easy > > on me by being pretty vague and nonspecific). But see above about > > pretending to contain BPF within a container. There is no such thing. > > BPF is system-wide. > > Sorry about that, I can quickly put: you may restrict types of bpf > programs, you may disable or nop probes if they are running without a > process context, if the triggered probe is owned by root by specific > uid? if the process is under a specific cgroup hierarchy etc... Are > the above possible? Yes, about restricting BPF program types. Definitely "No" for "probes if they are running without a process context, if the triggered probe is owned by root by specific uid". "Maybe" for "under a specific cgroup hierarchy", which we could add in some form, but we can only control where BPF program is attached. Nothing will still prevent BPF program from reading random kernel memory. But at least such BPF programs won't be able to control, say, network traffic of unintended cgroups. But the last part is not implemented in this patch set and should be discussed separately. > > > > > Again I think LSM and bpf+lsm should have the final word on this too... > > > > > > > Yes, I also think that having LSM on top is beneficial. But not a > > strict requirement and more or less orthogonal. > > I do think there should be LSM hooks to tighten this, as LSMs have > more context outside of BPF... Agreed, but it should be added on top as a separate follow up patch set. > > > > > > > > > Regardless, pinning BPF object in BPF FS is just basically bumping a > > > > refcnt and exposes that object in a way that can be looked up through > > > > file system path (using bpf() syscall's BPF_OBJ_GET command). > > > > Underlying object isn't cloned or copied, it's exactly the same object > > > > with the same shared internal state. > > > > > > This is the part I also find strange, I can understand pinning a bpf > > > program, map, etc, but an fd that gives some access rights should be > > > part of the filesystem from the start, I don't get the extra pinning. > > > > BPF pinning of BPF token is optional. Everything still works without > > any BPF FS mount at all. It's an FD, BPF FS is just one of the means > > to pass FD to another process. I actually don't see why coupling BPF > > FS and BPF token is simpler. > > I think it's better the other way around since bpffs is per super > block and separate mount then it is already solved, you just get that > special fd from the fs and pass it... > Ok, I see your point, I have a slightly alternative proposal for some parts of it, but I'll explain in reply to Christian. > > > Now, BPF token is a kernel object, with its own state. It has an FD > > associated with it. It can be passed around and provided as an > > argument to bpf() syscall. In that sense it's just like BPF > > prog/map/link, just another BPF object. > > > > > Also it seems bpffs is per superblock mount so why not allow > > > privileged to mount bpffs with the corresponding information, then > > > privileged can open the fd, set it up and pass it down the line when > > > executing the main program? or even allow unprivileged to open it on > > > bpffs with some restrictive conditions? > > > > > > Then it would be the business of the privileged to bind mount bpffs in > > > some other places, share it, etc > > > > How is this fundamentally different from BPF token pinning by > > *privileged* process? Except we are not conflating BPF FS as a way to > > pin/get many different BPF objects with BPF token itself. In both > > cases it's up to privileged process to set up sharing of BPF token > > appropriately. > > I'm not convinced about the use case of sharing BPF tokens between > containers or services... > > Every container or service has its own separate bpffs, what's the > point of pinning a shared token created by a different container > compared to mounting separate bpffs with an fd token prepared to be > used for that specific container? > > Then the container/service can delegate it to child processes, etc... > but sharing between containers and crossing user namespaces, mount > namespaces of such containers where bpffs is already separate in that > context? I don't see the point, and it just opens the room to token > misuse... > I don't have a specific use case or need for this. It's more of a principle that API should not be assuming or dictating how exactly user-space is going to use it, so I'd say we should prevent whatever crazy scenario that doesn't violate common sense. But I get that lots of people are concerned about BPF token leaking into unintended neighboring containers, so maybe we should bake in a mechanism to make this impossible. Again, let's talk in the next reply. > > > > > > > Having the fd or "token" that gives access rights pinned in two > > > separate bpffs mounts seems too much, it crosses namespaces (mount, > > > userns etc), environments setup by privileged... > > > > See above, there is nothing namespaceable about BPF itself, and BPF > > token as well. If some production setup benefits from pinning one BPF > > token in multiple places, I don't see the problem with that. > > > > > > > > I would just make it per bpffs mount and that's it, nothing more. If a > > > program wants to bind mount it somewhere else then it's not a bpf > > > problem. > > > > And if some application wants to pin BPF token, why would that be BPF > > subsystem's problem as well? > > The credentials, capabilities, keyring, different namespaces, etc are > all attached to the owning user namespace, if the BPF subsystem goes > its own way and creates a token to split up CAP_BPF without following > that model, then it's definitely a BPF subsystem problem... I don't > recommend that. > > Feels it's going more of a system-wide approach opening BPF > functionality where ultimately it clashes with the argument: delegate > a subset of BPF functionality to a *trusted* unprivileged application. > My reading of delegation is within a container/service hierarchy > nothing more.
On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote: > > On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote: > > On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > > > > > > > > > > > > Hi Andrii, > > > > > > > > > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > > > > > ... > > > > > > > creating new BPF objects like BPF programs, BPF maps, etc. > > > > > > > > > > > > Is there a reason for coupling this only with the userns? > > > > > > > > > > There is no coupling. Without userns it is at least possible to grant > > > > > CAP_BPF and other capabilities from init ns. With user namespace that > > > > > becomes impossible. > > > > > > > > But these are not the same: delegate full cap vs delegate an fd mask? > > > > > > What FD mask are we talking about here? I don't recall us talking > > > about any FD masks, so this one is a bit confusing without more > > > context. > > > > Ah err, sorry yes referring to fd token (which I assumed is a mask of > > allowed operations or something like that). > > > > So I want the possibility to delegate the fd token in the init userns. > > > > > > > > > > One can argue unprivileged in init userns is the same privileged in > > > > nested userns > > > > Getting to delegate fd in init userns, then in nested ones seems logical... > > > > > > Again, sorry, I'm not following. Can you please elaborate what you mean? > > > > I mean can we use the fd token in the init user namespace too? not > > only in the nested user namespaces but in the first one? Sorry I > > didn't check the code. > > [...] > > > > > > > > > > Having the fd or "token" that gives access rights pinned in two > > > > separate bpffs mounts seems too much, it crosses namespaces (mount, > > > > userns etc), environments setup by privileged... > > > > > > See above, there is nothing namespaceable about BPF itself, and BPF > > > token as well. If some production setup benefits from pinning one BPF > > > token in multiple places, I don't see the problem with that. > > > > > > > > > > > I would just make it per bpffs mount and that's it, nothing more. If a > > > > program wants to bind mount it somewhere else then it's not a bpf > > > > problem. > > > > > > And if some application wants to pin BPF token, why would that be BPF > > > subsystem's problem as well? > > > > The credentials, capabilities, keyring, different namespaces, etc are > > all attached to the owning user namespace, if the BPF subsystem goes > > its own way and creates a token to split up CAP_BPF without following > > that model, then it's definitely a BPF subsystem problem... I don't > > recommend that. > > > > Feels it's going more of a system-wide approach opening BPF > > functionality where ultimately it clashes with the argument: delegate > > a subset of BPF functionality to a *trusted* unprivileged application. > > My reading of delegation is within a container/service hierarchy > > nothing more. > > You're making the exact arguments that Lennart, Aleksa, and I have been > making in the LSFMM presentation about this topic. It's even recorded: Alright, so (I think) I get a pretty good feel now for what the main concerns are, and why people are trying to push this to be an FS. And it's not so much that BPF token grants bpf() syscall usage to unpriv (but trusted) workloads or that BPF itself is not namespaceable. The main worry is that BPF token, once issues, could be illegally/uncontrollably passed outside of container, intentionally or not. And by having this association with mount namespace (through BPF FS) we automatically limit the sharing to only contain that has access to that BPF FS. So I agree that it makes sense to have this mount namespace association, but I also would like to keep BPF token to be a separate entity from BPF FS itself, and have the ability to have multiple different BPF tokens exposed in a single BPF FS instance. I think the latter is important. So how about this slight modification: when a BPF token is created using BPF_TOKEN_CREATE command, the user has to provide an FD for "associated" BPF FS instance (superblock). What that does is allows BPF token to be created with BPF FS and/or mount namespace association set in stone. After that BPF token can only be pinned in that BPF FS instance and cannot leave the boundaries of that mount namespace (specific details to be worked out, this is new area for me, so I'm sorry if I'm missing nuances). What this slight tweak gives us is that we can still have multiple BPF token instances within a single BPF FS. It is still pinnable/gettable through common bpf() syscall's BPF_OBJ_PIN/BPF_OBJ_GET commands. You still can have more nuances file permission and getting BPF token can be controlled further through LSM. Also we still get to use an extensible and familiar (to BPF users) bpf_attr binary approach. Basically, it is very much native to BPF subsystem, but it is mount namespace-bound like was requested by proponents of merging BPF token and BPF FS together. I assume that this BPF FS fd can be fetched using fsopen() or fspick() syscalls, is that right? WDYT? Does that sound like it would address all the above concerns? Please point to any important details I might be missing (as I mentioned, very unfamiliar territory). > > https://youtu.be/4CCRTWEZLpw?t=1546 > > So we fully agree with you here. I actually just rewatched that entire discussion. :) And after talking about BPF token at length in the halls of the conference and email discussions on this patch set, it was very useful to relisten (again) all the finer points that were made back then. Thanks for the remainder and the link.
On Wed, Jun 14, 2023 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > >> > >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > >> > >> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > >> >> > >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > >> >> > >> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote: > >> >> >> > >> >> >> Andrii Nakryiko <andrii@kernel.org> writes: > >> >> >> > >> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate > >> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g., > >> >> >> > systemd or any other container manager) to a *trusted* unprivileged > >> >> >> > application. Trust is the key here. This functionality is not about allowing > >> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is > >> >> >> > completely up to the discretion of respective privileged application that > >> >> >> > would create a BPF token. > >> >> >> > >> >> >> I am not convinced that this token-based approach is a good way to solve > >> >> >> this: having the delegation mechanism be one where you can basically > >> >> >> only grant a perpetual delegation with no way to retract it, no way to > >> >> >> check what exactly it's being used for, and that is transitive (can be > >> >> >> passed on to others with no restrictions) seems like a recipe for > >> >> >> disaster. I believe this was basically the point Casey was making as > >> >> >> well in response to v1. > >> >> > > >> >> > Most of this can be added, if we really need to. Ability to revoke BPF > >> >> > token is easy to implement (though of course it will apply only for > >> >> > subsequent operations). We can allocate ID for BPF token just like we > >> >> > do for BPF prog/map/link and let tools iterate and fetch information > >> >> > about it. As for controlling who's passing what and where, I don't > >> >> > think the situation is different for any other FD-based mechanism. You > >> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS > >> >> > or BPF FS, and that application can keep doing the same to other > >> >> > processes. > >> >> > >> >> No, but every other fd-based mechanism is limited in scope. E.g., if you > >> >> pass a map fd that's one specific map that can be passed around, with a > >> >> token it's all operations (of a specific type) which is way broader. > >> > > >> > It's not black and white. Once you have a BPF program FD, you can > >> > attach it many times, for example, and cause regressions. Sure, here > >> > we are talking about creating multiple BPF maps or loading multiple > >> > BPF programs, so it's wider in scope, but still, it's not that > >> > fundamentally different. > >> > >> Right, but the difference is that a single BPF program is a known > >> entity, so even if the application you pass the fd to can attach it > >> multiple times, it can't make it do new things (e.g., bpf_probe_read() > >> stuff it is not supposed to). Whereas with bpf_token you have no such > >> guarantee. > > > > Sure, I'm not claiming BPF token is just like passing BPF program FD > > around. My point is that anything in the kernel that is representable > > by FD can be passed around to an unintended process through > > SCM_RIGHTS. And if you want to have tighter control over who's passing > > what, you'd probably need LSM. But it's not a requirement. > > > > With BPF token it is important to trust the application you are > > passing BPF token to. This is not a mechanism to just freely pass > > around the ability to do BPF. You do it only to applications you > > control. > > Trust is not binary, though. "Do I trust this application to perform > this specific action" is different from "do I trust this application to > perform any action in the future". A security mechanism should grant the > minimum required privileges required to perform the operation; this > token thing encourages (defaults to) broader grants, which is worrysome. BPF token defaults to not allowing anything, unless you explicitly allow commands/progs/maps. If you don't set allow_cmds, you literally get a useless BPF token that grants you nothing. > > > With user namespaces, if we could grant CAP_BPF and co to use BPF, > > we'd do that. But we can't. BPF token at least gives us this > > opportunity. > > If the use case is to punch holes in the user namespace isolation I feel > like that is better solved at the user namespace level than the BPF > subsystem level... > > -Toke > > > (Ran out of time and I'm about to leave for PTO, so dropping the RPC > discussion for now) >
On Fri, Jun 9, 2023, at 12:08 PM, Andrii Nakryiko wrote: > On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote: >> >> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote: >> > This patch set introduces new BPF object, BPF token, which allows to delegate >> > a subset of BPF functionality from privileged system-wide daemon (e.g., >> > systemd or any other container manager) to a *trusted* unprivileged >> > application. Trust is the key here. This functionality is not about allowing >> > unconditional unprivileged BPF usage. Establishing trust, though, is >> > completely up to the discretion of respective privileged application that >> > would create a BPF token. >> > >> >> I skimmed the description and the LSFMM slides. >> >> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such). It went nowhere. >> >> Where does BPF token fit in? Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container? > > Yes?.. In the sense that it is possible to create BPF programs and BPF > maps from inside the container (with BPF token). Right now under user > namespace it's impossible no matter what you do. I have no problem with creating BPF maps inside a container, but I think the maps should *be in the container*. My series wasn’t about unprivileged BPF per se. It was about updating the existing BPF permission model so that it made sense in a context in which it had multiple users that didn’t trust each other. > >> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. > > BPF is still a privileged thing. You can't just say that any > unprivileged application should be able to use BPF. That's why BPF > token is about trusting unpriv application in a controlled environment > (production) to not do something crazy. It can be enforced further > through LSM usage, but in a lot of cases, when dealing with internal > production applications it's enough to have a proper application > design and rely on code review process to avoid any negative effects. We really shouldn’t be creating new kinds of privileged containers that do uncontained things. If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. > > So privileged daemon (container manager) will be configured with the > knowledge of which services/containers are allowed to use BPF, and > will grant BPF token only to those that were explicitly allowlisted.
On Mon, Jun 19, 2023 at 10:40 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Fri, Jun 9, 2023, at 12:08 PM, Andrii Nakryiko wrote: > > On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote: > >> > >> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote: > >> > This patch set introduces new BPF object, BPF token, which allows to delegate > >> > a subset of BPF functionality from privileged system-wide daemon (e.g., > >> > systemd or any other container manager) to a *trusted* unprivileged > >> > application. Trust is the key here. This functionality is not about allowing > >> > unconditional unprivileged BPF usage. Establishing trust, though, is > >> > completely up to the discretion of respective privileged application that > >> > would create a BPF token. > >> > > >> > >> I skimmed the description and the LSFMM slides. > >> > >> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such). It went nowhere. > >> > >> Where does BPF token fit in? Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container? > > > > Yes?.. In the sense that it is possible to create BPF programs and BPF > > maps from inside the container (with BPF token). Right now under user > > namespace it's impossible no matter what you do. > > I have no problem with creating BPF maps inside a container, but I think the maps should *be in the container*. > > My series wasn’t about unprivileged BPF per se. It was about updating the existing BPF permission model so that it made sense in a context in which it had multiple users that didn’t trust each other. I don't think it's possible with BPF, in principle, as I mentioned in the cover letter. Even if some particular types of programs could be "contained" in some sense, in general BPF is too global by its nature (it observes everything in kernel memory, it can influence system-wide behaviors, etc). > > > > >> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. > > > > BPF is still a privileged thing. You can't just say that any > > unprivileged application should be able to use BPF. That's why BPF > > token is about trusting unpriv application in a controlled environment > > (production) to not do something crazy. It can be enforced further > > through LSM usage, but in a lot of cases, when dealing with internal > > production applications it's enough to have a proper application > > design and rely on code review process to avoid any negative effects. > > We really shouldn’t be creating new kinds of privileged containers that do uncontained things. > > If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. Please see Hao's reply ([0]) about his and Google's (not so rosy) experiences with building and using such BPF proxy. We (Meta) internally didn't go this route at all and strongly prefer not to. There are lots of downsides and complications to having a BPF proxy. In the end, this is just shuffling around where the decision about trusting a given application with BPF access is being made. BPF proxy adds lots of unnecessary logistical, operational, and development complexity, but doesn't magically make anything safer. [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ > > > > > So privileged daemon (container manager) will be configured with the > > knowledge of which services/containers are allowed to use BPF, and > > will grant BPF token only to those that were explicitly allowlisted. >
On 22/06/2023 00:48, Andrii Nakryiko wrote: > >>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. >>> BPF is still a privileged thing. You can't just say that any >>> unprivileged application should be able to use BPF. That's why BPF >>> token is about trusting unpriv application in a controlled environment >>> (production) to not do something crazy. It can be enforced further >>> through LSM usage, but in a lot of cases, when dealing with internal >>> production applications it's enough to have a proper application >>> design and rely on code review process to avoid any negative effects. >> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. >> >> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. > Please see Hao's reply ([0]) about his and Google's (not so rosy) > experiences with building and using such BPF proxy. We (Meta) > internally didn't go this route at all and strongly prefer not to. > There are lots of downsides and complications to having a BPF proxy. > In the end, this is just shuffling around where the decision about > trusting a given application with BPF access is being made. BPF proxy > adds lots of unnecessary logistical, operational, and development > complexity, but doesn't magically make anything safer. > > [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ > Apologies for being blunt, but the token approach to me seems to be a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using eBPF. I think if your container needs to do privileged things it should have and be classified with the right permissions (privileges) to do what it needs to do. The proxy BPF on behalf of the container approach works for containers that don't need to do privileged BPF operations. I have to say that the `proxy BPF on behalf of the container` meets the needs of unprivileged pods and at the same time giving CAP_BPF to the applications meets the needs of these PODs that need to do privileged/bpf things without any tokens. Ultimately you are trusting these apps in the same way as if you were granting a token. >>> So privileged daemon (container manager) will be configured with the >>> knowledge of which services/containers are allowed to use BPF, and >>> will grant BPF token only to those that were explicitly allowlisted.
On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote: > On 22/06/2023 00:48, Andrii Nakryiko wrote: >> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. >>>> BPF is still a privileged thing. You can't just say that any >>>> unprivileged application should be able to use BPF. That's why BPF >>>> token is about trusting unpriv application in a controlled environment >>>> (production) to not do something crazy. It can be enforced further >>>> through LSM usage, but in a lot of cases, when dealing with internal >>>> production applications it's enough to have a proper application >>>> design and rely on code review process to avoid any negative effects. >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. >>> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. >> Please see Hao's reply ([0]) about his and Google's (not so rosy) >> experiences with building and using such BPF proxy. We (Meta) >> internally didn't go this route at all and strongly prefer not to. >> There are lots of downsides and complications to having a BPF proxy. >> In the end, this is just shuffling around where the decision about >> trusting a given application with BPF access is being made. BPF proxy >> adds lots of unnecessary logistical, operational, and development >> complexity, but doesn't magically make anything safer. >> >> [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ >> > Apologies for being blunt, but the token approach to me seems to be a > work around providing the right level/classification for a pod/container > in order to say you support unprivileged containers using eBPF. I think > if your container needs to do privileged things it should have and be > classified with the right permissions (privileges) to do what it needs > to do. Bluntness is great. I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. I even *wrote the code*. But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. Please try harder.
On Thu, Jun 22, 2023 at 1:23 AM Maryam Tahhan <mtahhan@redhat.com> wrote: > > On 22/06/2023 00:48, Andrii Nakryiko wrote: > > > >>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. > >>> BPF is still a privileged thing. You can't just say that any > >>> unprivileged application should be able to use BPF. That's why BPF > >>> token is about trusting unpriv application in a controlled environment > >>> (production) to not do something crazy. It can be enforced further > >>> through LSM usage, but in a lot of cases, when dealing with internal > >>> production applications it's enough to have a proper application > >>> design and rely on code review process to avoid any negative effects. > >> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. > >> > >> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. > > Please see Hao's reply ([0]) about his and Google's (not so rosy) > > experiences with building and using such BPF proxy. We (Meta) > > internally didn't go this route at all and strongly prefer not to. > > There are lots of downsides and complications to having a BPF proxy. > > In the end, this is just shuffling around where the decision about > > trusting a given application with BPF access is being made. BPF proxy > > adds lots of unnecessary logistical, operational, and development > > complexity, but doesn't magically make anything safer. > > > > [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ > > > Apologies for being blunt, but the token approach to me seems to be a > work around providing the right level/classification for a pod/container > in order to say you support unprivileged containers using eBPF. I think > if your container needs to do privileged things it should have and be > classified with the right permissions (privileges) to do what it needs > to do. For one, when user namespaces are involved, there is no BPF use at all, no matter how privileged you want to mark the container. I mentioned this in the cover letter. Now, the claim is that user namespaces are indeed useful and necessary, and yet we also want such user-namespaced applications to be able to use BPF. Currently there is no solution to that. And external BPF service is not a great one, see [0], for real world users' feedback. [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ > > The proxy BPF on behalf of the container approach works for containers > that don't need to do privileged BPF operations. BPF usage *is privileged* in all but some tiny use cases that are ok with heavily limited unprivileged BPF functionality (and even then recommendation is to disable unprivileged BPF altogether). Whether you proxy such privileged BPF usage through an external application or you are granting BPF token to such application is in the same category: someone has to decide to trust the application to perform privileged BPF operations. And the only debatable thing here is whether the application itself should do bpf() syscalls directly and be able to use the entire BPF ecosystem of libraries, tools, techniques, and approaches. Or we go and rewrite the world to use some RPC-based proxy to bpf() syscall? And to put it bluntly, the latter is not a realistic (or even good) option. > > I have to say that the `proxy BPF on behalf of the container` meets the > needs of unprivileged pods and at the same time giving CAP_BPF to the I tried to make it very clear in the cover letter, but granting CAP_BPF under user namespace means precisely nothing. CAP_BPF is only useful in the init namespace. > applications meets the needs of these PODs that need to do > privileged/bpf things without any tokens. Ultimately you are trusting > these apps in the same way as if you were granting a token. Yes, absolutely. As I mentioned very explicitly, it's the question of trusting application. Service vs token is implementation details, but the one that has huge implications in how applications are built, tested, versioned, deployed, etc. > > > >>> So privileged daemon (container manager) will be configured with the > >>> knowledge of which services/containers are allowed to use BPF, and > >>> will grant BPF token only to those that were explicitly allowlisted. > >
On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote: > Please avoid replying in HTML. > On 22/06/2023 17:49, Andy Lutomirski wrote: > > Apologies for being blunt, but the token approach to me seems to be a > work around providing the right level/classification for a pod/container > in order to say you support unprivileged containers using eBPF. I think > if your container needs to do privileged things it should have and be > classified with the right permissions (privileges) to do what it needs > to do. > > Bluntness is great. > > I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" > > That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" > > The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. > > This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" > > My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. I even *wrote the code*. But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. > > Please try harder. > > I'm going to be honest, I can't tell if we are in agreement or not :). I'm also going to use pod and container interchangeably throughout my response (bear with me) > > > So just to clarify a few things on my end. When I said "level/classification" I meant privileges --> A container should have the right level of privileges assigned to it for what it's trying to do in the K8s scenario through it's pod spec. To me it seems like BPF token is a way to work around the permissions assigned to a container in K8s for example: with bpf_token I'm marking a pod as unprivileged but then under the hood, through another service I'm giving it a token to do more than what it was specified in it's pod spec. Yeah I have a separate service controlling the tokens but something about it just seems not right (to me). If CAP_BPF is too broad, can we break it down further into something more granular. Something that can be assigned to the container through the pod spec rather than a separate service that seems to be doing things under the hood? This doesn't even start to solve the problem I know... Disclaimer: I don't know anything about Kubernetes, so don't expect me reply with correct terminology or detailed understanding of configuration of containers. But on a more generic and conceptual level, it seems like you are making some implementation assumptions and arguing based on that. Like, why container spec cannot have native support for "granted BPF functionality"? Why would BPF token have to be granted through some separate service and not integrated into whatever Kubernetes' "container manager" functionality and just be a natural extension of the spec? For CAP_BPF too broad. It is broad, yes. If you have good ideas how to break it down some more -- please propose. But this is all orthogonal, because the blocking problem is fundamental incompatibility of user namespaces (and their implied isolation and sandboxing of workloads) and BPF functionality, which is global by its very nature. The latter is unavoidable in principle. No matter how much you break down CAP_BPF, you can't enforce that BPF program won't interfere with applications in other containers. Or that it won't "spy" on them. It's just not what BPF can enforce in principle. So that comes back down to a question of trust and then controlled delegation of BPF functionality. You trust workload with BPF usage because you reviewed the BPF code, workload, testing, etc? Grant BPF token and let that container use limited subset of BPF. Employ BPF LSM to further restrict it beyond what BPF token can control. You cannot trust an application to not do something harmful? You shouldn't grant it either CAP_BPF in init namespace, nor BPF token in user namespace. That's it. Pick your poison. But all this cannot be mechanically decided or enforced. There has to be some humans involved in making these decisions. Kernel's job is to provide building blocks to grant and control BPF functionality to the extent that it is technically possible. > > I understand the difficulties with trying to deploy BPF in K8s and the concerns around privilege escalation for the containers. I understand not all use cases are created equally but I think this falls into at least 2 categories: > > - Pods/Containers that need to do privileged BPF ops but not under a CAP_BPF umbrella --> sure we need something for this. > - Pods/Containers that don't need to do any privileged BPF ops but still use BPF --> these are happy with a proxy service loading/unloading the bpf progs, creating maps and pinning them... But even in this scenario we need something to isolate the pinned maps/progs by different apps (why not DAC rules?), even better if the maps are in the container... The above doesn't make much sense to me, sorry. If the application is ok using unprivileged BPF, there is no problem there. They can today already and there is no BPF proxy or BPF token involved. As for "something to isolate the pinned maps/progs by different apps (why not DAC rules?)", there is no such thing, as I've explained already. I can install sched_switch raw_tracepoint BPF program (if I'm allowed to), and that program has system-wide observability. It cannot be bound to an application. You can't just say "trigger this sched_switch program only for scheduler decisions within my container". When you actually start thinking about just that one example, even assuming we add some per-container filter in the kernel to not trigger your program, then what do we do when we switch from process A in container X to process B in container Y? Is that event belonging to container X? Or container Y? How can you prevent a program from reading a task's data that doesn't belong to your container, when both are inputs to this single tracepoint event? Hopefully you can see where I'm going with this. And this is just one random tiny example. We can think up tons of other cases to prove BPF is not isolatable to any sort of "container". > > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better. I disagree. BPF proxy complicates logistics, operations, and developer experience, without resolving the issue of determining trust and the need to delegate or proxy BPF functionality.
On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote: > > On 22/06/2023 00:48, Andrii Nakryiko wrote: > >> > >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. > >>>> BPF is still a privileged thing. You can't just say that any > >>>> unprivileged application should be able to use BPF. That's why BPF > >>>> token is about trusting unpriv application in a controlled environment > >>>> (production) to not do something crazy. It can be enforced further > >>>> through LSM usage, but in a lot of cases, when dealing with internal > >>>> production applications it's enough to have a proper application > >>>> design and rely on code review process to avoid any negative effects. > >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. > >>> > >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. > >> Please see Hao's reply ([0]) about his and Google's (not so rosy) > >> experiences with building and using such BPF proxy. We (Meta) > >> internally didn't go this route at all and strongly prefer not to. > >> There are lots of downsides and complications to having a BPF proxy. > >> In the end, this is just shuffling around where the decision about > >> trusting a given application with BPF access is being made. BPF proxy > >> adds lots of unnecessary logistical, operational, and development > >> complexity, but doesn't magically make anything safer. > >> > >> [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ > >> > > Apologies for being blunt, but the token approach to me seems to be a > > work around providing the right level/classification for a pod/container > > in order to say you support unprivileged containers using eBPF. I think > > if your container needs to do privileged things it should have and be > > classified with the right permissions (privileges) to do what it needs > > to do. > > Bluntness is great. > > I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. BPF is not "anything else", it's important to understand that BPF is inherently not compratmentalizable. And it's vast and generic in its capabilities. This changes everything. So your analogies are misleading. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" > > That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" > > The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. > > This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. > > "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" > > My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. Can you apply DAC rules to which kernel events BPF program can be run on? Can you apply DAC rules to which in-kernel data structures a BPF program can look at and make sure that it doesn't access a task/socket/etc that "belongs" to some other container/user/etc? Can we limit XDP or AF_XDP BPF programs from seeing and controlling network traffic that will be eventually routed to a container that XDP program "should not" have access to? Without making everything so slow that it's useless? > I even *wrote the code*. Did you submit it upstream for review and wide discussion? Did you test it and integrate it with production workloads to prove that your solution is actually a viable real-world solution and not a toy? Writing the code doesn't mean solving the problem. > But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. I won't speak on behalf of the entire BPF community, but I'm trying to explain that BPF cannot be reasonably sandboxed and has to be privileged due to its global nature. And I haven't yet seen any realistic counter-proposal to change that. And it's not about ownership of the BPF map or BPF program, it's way beyond that.. > > Please try harder. Well, maybe there is something in that "some reason" you mentioned above that you so quickly dismissed?
On Thu, Jun 22, 2023 at 7:40 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote: > > > > Please avoid replying in HTML. > Sorry. [...] > > Disclaimer: I don't know anything about Kubernetes, so don't expect me > reply with correct terminology or detailed understanding of > configuration of containers. > > But on a more generic and conceptual level, it seems like you are > making some implementation assumptions and arguing based on that. > Firstly, thank you for taking the time to respond and explain. I can see where you are coming from. Yeah, admittedly I did make a few assumptions. I was thrown by the reference to `unprivileged` processes in the cover letter. It seems like this is a way to grant namespaced BPF permissions to a process (my gross oversimplification - sorry). Looking back throughout your responses there's nothing unprivileged here. [...] > Hopefully you can see where I'm going with this. And this is just one > random tiny example. We can think up tons of other cases to prove BPF > is not isolatable to any sort of "container". > > > > > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better. > > I disagree. BPF proxy complicates logistics, operations, and developer > experience, without resolving the issue of determining trust and the > need to delegate or proxy BPF functionality. I appreciate your viewpoint. I just don't think that this is a one solution fits every scenario situation. For example in the case of AF_XDP, I'd like to be able to run my containers without any additional privileges. I've been working on a device plugin for Kubernetes whose job is to provision netdevs with an XDP redirect program (then later there's a CNI that moves the netdev into the pod network namespace). Originally I was using bpf locally in the device plugin (to load the bpf program and get the XSK map fd) and SCM rights to pass the XSK_MAP over UDS but honestly it was relatively cumbersome from an app development POV, very easy to get wrong, and trying to keep up with the latest bpf api changes started to become an issue. If I wanted to add more interesting bpf programs I had to do a full recompile... I've now moved to using bpfd, for the loading and unloading of the bpf program on my behalf, it also comes with a bunch of other advantages including being able to update my trusted bpf program transparently to both the device plugin my application (I don't have to respin this either when I write/want to add a new bpf prog), but mainly I have a trusted proxy managing bpffs, bpf progs and maps for me. There's still more work to do here... I understand this is a much simplified scenario. and I'm sure I can think of several more where proxy is useful. All I'm trying to say is, I'm not sure there's just a one size fits all soln for these issues. Thanks Maryam
On Thu, Jun 22, 2023 at 2:04 PM Maryam Tahhan <mtahhan@redhat.com> wrote: > > On Thu, Jun 22, 2023 at 7:40 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote: > > > > > > > Please avoid replying in HTML. > > > > Sorry. No worries, the problem is that the mailing list filters out such messages. So if you go to [0] and scroll to the bottom of the page, you'll see that your email is not in the lore archive. People not CC'ed directly will only see what you wrote through my reply quoting your email. [0] https://lore.kernel.org/bpf/CAFdtZitYhOK4TzAJVbFPMfup_homxSSu3Q8zjJCCiHCf22eJvQ@mail.gmail.com/#t > > [...] > > > > > Disclaimer: I don't know anything about Kubernetes, so don't expect me > > reply with correct terminology or detailed understanding of > > configuration of containers. > > > > But on a more generic and conceptual level, it seems like you are > > making some implementation assumptions and arguing based on that. > > > > Firstly, thank you for taking the time to respond and explain. I can see > where you are coming from. > > Yeah, admittedly I did make a few assumptions. I was thrown by the reference > to `unprivileged` processes in the cover letter. It seems like this is a way to > grant namespaced BPF permissions to a process (my gross > oversimplification - sorry). Yep, with the caveat that BPF functionality itself cannot be namespaced (i.e., contained within the container), so this has to be granted by a fully privileged process/proxy based on trusting the workload to not do anything harmful. > Looking back throughout your responses there's nothing unprivileged here. > > [...] > > > > Hopefully you can see where I'm going with this. And this is just one > > random tiny example. We can think up tons of other cases to prove BPF > > is not isolatable to any sort of "container". > > > > > > > > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better. > > > > I disagree. BPF proxy complicates logistics, operations, and developer > > experience, without resolving the issue of determining trust and the > > need to delegate or proxy BPF functionality. > > I appreciate your viewpoint. I just don't think that this is a one > solution fits every > scenario situation. Absolutely. It's also not my intent or goal to kill any sort of BPF proxy. What I'm trying to convey is that the BPF proxy approach has severe downsides, depending on application, deployment practices, etc, etc. It's not always a (good) answer. So I just want to avoid having the dichotomy of "BPF token or BPF proxy, there could be only one". > For example in the case of AF_XDP, I'd like to be > able to run > my containers without any additional privileges. I've been working on a device > plugin for Kubernetes whose job is to provision netdevs with an XDP redirect > program (then later there's a CNI that moves the netdev into the pod network > namespace). Originally I was using bpf locally in the device plugin > (to load the > bpf program and get the XSK map fd) and SCM rights to pass the XSK_MAP over > UDS but honestly it was relatively cumbersome from an app development POV, very > easy to get wrong, and trying to keep up with the latest bpf api > changes started to > become an issue. If I wanted to add more interesting bpf programs I > had to do a full > recompile... > > I've now moved to using bpfd, for the loading and unloading of the bpf > program on my behalf, > it also comes with a bunch of other advantages including being able to > update my trusted bpf > program transparently to both the device plugin my application (I > don't have to respin this either > when I write/want to add a new bpf prog), but mainly I have a trusted > proxy managing bpffs, bpf progs and maps for me. There's still more > work to do here... > It's a spectrum, and from my observations networking BPF programs lend themselves more easily to this model of BPF proxy (at least until they become complicated ensembles of networking and tracing BPF programs). Very often networking applications can indeed load BPF program completely independently from user-space parts, keep them "persisted" in kernel, occasionally control them through pinned BPF maps, etc. But the further you go towards tracing applications where BPF parts are integral part of overall user-space application, this model doesn't work very well. It's much simple to have BPF parts embedded, loaded, versioned, initialized and interacted with from inside the same process. And we have lots of such applications. BPF proxy approach is a massive complication for such use cases with a bunch of downsides. > I understand this is a much simplified scenario. and I'm sure I can > think of several more where > proxy is useful. All I'm trying to say is, I'm not sure there's just a > one size fits all soln for these issues. 100% agree. BPF token won't fit all use cases. And BPF proxy won't fit all use cases either. Both approaches can and should coexist. > > Thanks > Maryam >
On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote: > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote: > > For CAP_BPF too broad. It is broad, yes. If you have good ideas how to > break it down some more -- please propose. But this is all orthogonal, > because the blocking problem is fundamental incompatibility of user > namespaces (and their implied isolation and sandboxing of workloads) > and BPF functionality, which is global by its very nature. The latter > is unavoidable in principle. How, exactly, is BPF global by its very nature? The *implementation* has some issues with globalness. Much of it should be fixable. > > No matter how much you break down CAP_BPF, you can't enforce that BPF > program won't interfere with applications in other containers. Or that > it won't "spy" on them. It's just not what BPF can enforce in > principle. The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do. There are bugs -- I get that. There are helper functions that are fundamentally global. But, in the absence of verifier bugs, BPF has actual boundaries to its functionality. > > So that comes back down to a question of trust and then controlled > delegation of BPF functionality. You trust workload with BPF usage > because you reviewed the BPF code, workload, testing, etc? Grant BPF > token and let that container use limited subset of BPF. Employ BPF LSM > to further restrict it beyond what BPF token can control. > > You cannot trust an application to not do something harmful? You > shouldn't grant it either CAP_BPF in init namespace, nor BPF token in > user namespace. That's it. Pick your poison. I think what's lost here is hardening vs restricting intended functionality. We have access control to restrict intended functionality. We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it. BPF needs hardening -- this is well established. Right now, this is accomplished by restricting it to global root (effectively). It should have access controls, too, but it doesn't. > > But all this cannot be mechanically decided or enforced. There has to > be some humans involved in making these decisions. Kernel's job is to > provide building blocks to grant and control BPF functionality to the > extent that it is technically possible. > Exactly. And it DOES NOT. bpf maps, etc do not have sensible access controls. Things that should not be global are global. I'm saying the kernel should fix THAT. Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it. > As for "something to isolate the pinned maps/progs by different apps > (why not DAC rules?)", there is no such thing, as I've explained > already. > > I can install sched_switch raw_tracepoint BPF program (if I'm allowed > to), and that program has system-wide observability. It cannot be > bound to an application. Great, a real example! Either: (a) don't run this in a container. Have a service for the container to request the help of this program. (b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container). > then what do we do when we switch from process A in container > X to process B in container Y? Is that event belonging to container X? > Or container Y? I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality. If you haven't defined what your program is even supposed to do in a container, don't run it there. > Hopefully you can see where I'm going with this. And this is just one > random tiny example. We can think up tons of other cases to prove BPF > is not isolatable to any sort of "container". No. You have not come up with an example of why BPF is not isolatable to a container. You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security. Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things. They're both under the BPF umbrella. They're not the same thing. Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter. It makes no sense.
On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote: > On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote: >> >> >> >> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote: >> > On 22/06/2023 00:48, Andrii Nakryiko wrote: >> >> >> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. >> >>>> BPF is still a privileged thing. You can't just say that any >> >>>> unprivileged application should be able to use BPF. That's why BPF >> >>>> token is about trusting unpriv application in a controlled environment >> >>>> (production) to not do something crazy. It can be enforced further >> >>>> through LSM usage, but in a lot of cases, when dealing with internal >> >>>> production applications it's enough to have a proper application >> >>>> design and rely on code review process to avoid any negative effects. >> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. >> >>> >> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. >> >> Please see Hao's reply ([0]) about his and Google's (not so rosy) >> >> experiences with building and using such BPF proxy. We (Meta) >> >> internally didn't go this route at all and strongly prefer not to. >> >> There are lots of downsides and complications to having a BPF proxy. >> >> In the end, this is just shuffling around where the decision about >> >> trusting a given application with BPF access is being made. BPF proxy >> >> adds lots of unnecessary logistical, operational, and development >> >> complexity, but doesn't magically make anything safer. >> >> >> >> [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ >> >> >> > Apologies for being blunt, but the token approach to me seems to be a >> > work around providing the right level/classification for a pod/container >> > in order to say you support unprivileged containers using eBPF. I think >> > if your container needs to do privileged things it should have and be >> > classified with the right permissions (privileges) to do what it needs >> > to do. >> >> Bluntness is great. >> >> I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. > > BPF is not "anything else", it's important to understand that BPF is > inherently not compratmentalizable. And it's vast and generic in its > capabilities. This changes everything. So your analogies are > misleading. > file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc. They are infinitely extensible. They work in containers. What is so special about BPF? >> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" >> >> That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. >> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" >> >> The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. >> >> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. >> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" >> >> My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. > > Can you apply DAC rules to which kernel events BPF program can be run > on? Can you apply DAC rules to which in-kernel data structures a BPF > program can look at and make sure that it doesn't access a > task/socket/etc that "belongs" to some other container/user/etc? No, of course. If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module. It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module. We don't give containers special tokens that let them load arbitrary modules. We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules. But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers. BPF can learn to do this. > > Can we limit XDP or AF_XDP BPF programs from seeing and controlling > network traffic that will be eventually routed to a container that XDP > program "should not" have access to? Without making everything so slow > that it's useless? Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that. Or a vlan or a macvlan or whatever. (I'm assuming XDP can be scoped like this. I'm not that familiar with the details.) > >> I even *wrote the code*. > > Did you submit it upstream for review and wide discussion? Yes. > Did you > test it and integrate it with production workloads to prove that your > solution is actually a viable real-world solution and not a toy? I did test it. I did not integrate it with production workloads. > Writing the code doesn't mean solving the problem. Of course not. My code was a little step in the right direction. The BPF community was apparently not interested in it. > >> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. > > I won't speak on behalf of the entire BPF community, but I'm trying to > explain that BPF cannot be reasonably sandboxed and has to be > privileged due to its global nature. And I haven't yet seen any > realistic counter-proposal to change that. And it's not about > ownership of the BPF map or BPF program, it's way beyond that.. > It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created. If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model. I'm saying that I think there *can* be a security model. But until the maintainers start to believe that, there won't be one.
On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote: > On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote: > >> Hopefully you can see where I'm going with this. And this is just one >> random tiny example. We can think up tons of other cases to prove BPF >> is not isolatable to any sort of "container". > > No. You have not come up with an example of why BPF is not isolatable > to a container. You have come up with an example of why binding to a > sched_switch raw tracepoint does not make sense in a container without > additional mechanisms to give it well defined functionality and > appropriate security. Thinking about this some more: Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example). The workload is in the container. The tracepoint is global. Kernel memory is global unless something that is trusted and understands the containers is doing the reading. And proxying BPF is a mess. So here are a couple of possible solutions: (a) Improve BPF maps a bit so that BPF maps work well in containers. It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags. (IIRC my patch series was a decent step in this direction,) Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container. So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data. (b) Make a way to pass a pre-approved program into a container. So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container. I think (a) is better. In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container. For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers. And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance. You want *one* XDP program fanning the packets out to the relevant containers. If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation. --Andy
On 6/22/2023 8:28 PM, Andy Lutomirski wrote: > On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote: >> On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote: >>> >>> >>> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote: >>>> On 22/06/2023 00:48, Andrii Nakryiko wrote: >>>>>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. >>>>>>> BPF is still a privileged thing. You can't just say that any >>>>>>> unprivileged application should be able to use BPF. That's why BPF >>>>>>> token is about trusting unpriv application in a controlled environment >>>>>>> (production) to not do something crazy. It can be enforced further >>>>>>> through LSM usage, but in a lot of cases, when dealing with internal >>>>>>> production applications it's enough to have a proper application >>>>>>> design and rely on code review process to avoid any negative effects. >>>>>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. >>>>>> >>>>>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. >>>>> Please see Hao's reply ([0]) about his and Google's (not so rosy) >>>>> experiences with building and using such BPF proxy. We (Meta) >>>>> internally didn't go this route at all and strongly prefer not to. >>>>> There are lots of downsides and complications to having a BPF proxy. >>>>> In the end, this is just shuffling around where the decision about >>>>> trusting a given application with BPF access is being made. BPF proxy >>>>> adds lots of unnecessary logistical, operational, and development >>>>> complexity, but doesn't magically make anything safer. >>>>> >>>>> [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ >>>>> >>>> Apologies for being blunt, but the token approach to me seems to be a >>>> work around providing the right level/classification for a pod/container >>>> in order to say you support unprivileged containers using eBPF. I think >>>> if your container needs to do privileged things it should have and be >>>> classified with the right permissions (privileges) to do what it needs >>>> to do. >>> Bluntness is great. >>> >>> I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. >> BPF is not "anything else", it's important to understand that BPF is >> inherently not compratmentalizable. And it's vast and generic in its >> capabilities. This changes everything. So your analogies are >> misleading. >> > file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc. They are infinitely extensible. They work in containers. > > What is so special about BPF? > >>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" >>> >>> That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. >>> >>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" >>> >>> The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. >>> >>> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. >>> >>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" >>> >>> My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. >> Can you apply DAC rules to which kernel events BPF program can be run >> on? Can you apply DAC rules to which in-kernel data structures a BPF >> program can look at and make sure that it doesn't access a >> task/socket/etc that "belongs" to some other container/user/etc? > No, of course. > > If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module. It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module. > > We don't give containers special tokens that let them load arbitrary modules. We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules. > > But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers. BPF can learn to do this. > >> Can we limit XDP or AF_XDP BPF programs from seeing and controlling >> network traffic that will be eventually routed to a container that XDP >> program "should not" have access to? Without making everything so slow >> that it's useless? > Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that. Or a vlan or a macvlan or whatever. (I'm assuming XDP can be scoped like this. I'm not that familiar with the details.) > >>> I even *wrote the code*. >> Did you submit it upstream for review and wide discussion? > Yes. > >> Did you >> test it and integrate it with production workloads to prove that your >> solution is actually a viable real-world solution and not a toy? > I did test it. I did not integrate it with production workloads. > >> Writing the code doesn't mean solving the problem. > Of course not. My code was a little step in the right direction. The BPF community was apparently not interested in it. > >>> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. >> I won't speak on behalf of the entire BPF community, but I'm trying to >> explain that BPF cannot be reasonably sandboxed and has to be >> privileged due to its global nature. And I haven't yet seen any >> realistic counter-proposal to change that. And it's not about >> ownership of the BPF map or BPF program, it's way beyond that.. >> > It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created. Agreed. Complete security denial makes development so much easier. In the 1980's we were told that there was no way UNIX could ever be made secure, especially because of IP networking and window systems. It wasn't easy, what with everybody screaming (often literally) about the performance impact and code complexity of every single change, no matter how small. I'm *not* advocating adopting it, but you could look at the Zephyr security model as a worked example of a system similar to BPF that does have a security model. I understand that there are many ways to argue that this won't work for BPF, or that the model has issues of its own, but have a look. https://docs.zephyrproject.org/latest/security/security-overview.html > > If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model. > > I'm saying that I think there *can* be a security model. But until the maintainers start to believe that, there won't be one.
On 6/16/23 12:48 AM, Andrii Nakryiko wrote: > On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote: >> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote: >>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko >>> <andrii.nakryiko@gmail.com> wrote: >>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote: >>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko >>>>> <andrii.nakryiko@gmail.com> wrote: >>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: >>>>>>> >>>>>>> Hi Andrii, >>>>>>> >>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: >>>>>>>> >>>>>>>> ... >>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc. >>>>>>> >>>>>>> Is there a reason for coupling this only with the userns? >>>>>> >>>>>> There is no coupling. Without userns it is at least possible to grant >>>>>> CAP_BPF and other capabilities from init ns. With user namespace that >>>>>> becomes impossible. >>>>> >>>>> But these are not the same: delegate full cap vs delegate an fd mask? >>>> >>>> What FD mask are we talking about here? I don't recall us talking >>>> about any FD masks, so this one is a bit confusing without more >>>> context. >>> >>> Ah err, sorry yes referring to fd token (which I assumed is a mask of >>> allowed operations or something like that). >>> >>> So I want the possibility to delegate the fd token in the init userns. >>> >>>>> >>>>> One can argue unprivileged in init userns is the same privileged in >>>>> nested userns >>>>> Getting to delegate fd in init userns, then in nested ones seems logical... >>>> >>>> Again, sorry, I'm not following. Can you please elaborate what you mean? >>> >>> I mean can we use the fd token in the init user namespace too? not >>> only in the nested user namespaces but in the first one? Sorry I >>> didn't check the code. >>> > > [...] > >>> >>>>> Having the fd or "token" that gives access rights pinned in two >>>>> separate bpffs mounts seems too much, it crosses namespaces (mount, >>>>> userns etc), environments setup by privileged... >>>> >>>> See above, there is nothing namespaceable about BPF itself, and BPF >>>> token as well. If some production setup benefits from pinning one BPF >>>> token in multiple places, I don't see the problem with that. >>>> >>>>> >>>>> I would just make it per bpffs mount and that's it, nothing more. If a >>>>> program wants to bind mount it somewhere else then it's not a bpf >>>>> problem. >>>> >>>> And if some application wants to pin BPF token, why would that be BPF >>>> subsystem's problem as well? >>> >>> The credentials, capabilities, keyring, different namespaces, etc are >>> all attached to the owning user namespace, if the BPF subsystem goes >>> its own way and creates a token to split up CAP_BPF without following >>> that model, then it's definitely a BPF subsystem problem... I don't >>> recommend that. >>> >>> Feels it's going more of a system-wide approach opening BPF >>> functionality where ultimately it clashes with the argument: delegate >>> a subset of BPF functionality to a *trusted* unprivileged application. >>> My reading of delegation is within a container/service hierarchy >>> nothing more. >> >> You're making the exact arguments that Lennart, Aleksa, and I have been >> making in the LSFMM presentation about this topic. It's even recorded: > > Alright, so (I think) I get a pretty good feel now for what the main > concerns are, and why people are trying to push this to be an FS. And > it's not so much that BPF token grants bpf() syscall usage to unpriv > (but trusted) workloads or that BPF itself is not namespaceable. The > main worry is that BPF token, once issues, could be > illegally/uncontrollably passed outside of container, intentionally or > not. And by having this association with mount namespace (through BPF > FS) we automatically limit the sharing to only contain that has access > to that BPF FS. +1 > So I agree that it makes sense to have this mount namespace > association, but I also would like to keep BPF token to be a separate > entity from BPF FS itself, and have the ability to have multiple > different BPF tokens exposed in a single BPF FS instance. I think the > latter is important. > > So how about this slight modification: when a BPF token is created > using BPF_TOKEN_CREATE command, the user has to provide an FD for > "associated" BPF FS instance (superblock). What that does is allows > BPF token to be created with BPF FS and/or mount namespace association > set in stone. After that BPF token can only be pinned in that BPF FS > instance and cannot leave the boundaries of that mount namespace > (specific details to be worked out, this is new area for me, so I'm > sorry if I'm missing nuances). Given bpffs is not a singleton and there can be multiple bpffs instances in a container, couldn't we make the token a special bpffs mount/mode? Something like single .token file in that mount (for example) which can be opened and the fd then passed along for prog/map creation? And given the multiple mounts, this also allows potentially for multiple tokens? In other words, this is already set up by the container manager when it sets up mounts rather than later, and the regular bpffs instance is sth separate from all that. Meaning, in your container you get the usual bpffs instance and then one or more special bpffs instances as tokens at different paths (and in future they could unlock different subset of bpf functionality for example). Thanks, Daniel
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> applications meets the needs of these PODs that need to do >> privileged/bpf things without any tokens. Ultimately you are trusting >> these apps in the same way as if you were granting a token. > > Yes, absolutely. As I mentioned very explicitly, it's the question of > trusting application. Service vs token is implementation details, but > the one that has huge implications in how applications are built, > tested, versioned, deployed, etc. So one thing that I don't really get is why such a "trusted application" needs to be run in a user namespace in the first place? If it's trusted, why not simply run it as a privileged container (without the user namespace) and grant it the right system-level capabilities, instead of going to all this trouble just to punch a hole in the user namespace isolation? -Toke
On 6/23/23 5:10 PM, Andy Lutomirski wrote: > On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote: >> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote: >> >>> Hopefully you can see where I'm going with this. And this is just one >>> random tiny example. We can think up tons of other cases to prove BPF >>> is not isolatable to any sort of "container". >> >> No. You have not come up with an example of why BPF is not isolatable >> to a container. You have come up with an example of why binding to a >> sched_switch raw tracepoint does not make sense in a container without >> additional mechanisms to give it well defined functionality and >> appropriate security. One big blocker for the case of BPF is not isolatable to a container are CPU hardware bugs. There has been plenty of mitigation effort so that the flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately it's a cat and mouse game and vendors are also not really transparent. So actual reasonable discussion can be resumed once CPU vendors gets their stuff fixed. [0] https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks > Thinking about this some more: > > Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example). The workload is in the container. The tracepoint is global. Kernel memory is global unless something that is trusted and understands the containers is doing the reading. And proxying BPF is a mess. Agree that proxy is a mess for various reasons stated earlier. > So here are a couple of possible solutions: > > (a) Improve BPF maps a bit so that BPF maps work well in containers. It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags. (IIRC my patch series was a decent step in this direction,) Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container. So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data. I don't think it's very practical, meaning the vast majority of applications out there today are tightly coupled BPF code + user space application, and in a lot of cases programs are dynamically created. This would require somehow splitting up parts of your application to run outside the container in hostns and other parts inside the container.. for the sake of the mentioned example it's something fairly static, but real-world applications look different and are much more complex. > (b) Make a way to pass a pre-approved program into a container. So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container. Same as above. Programs are in most cases very tightly coupled to the application itself. I'm not sure if the ask is to redesign/implement all the existing user space infra. > I think (a) is better. In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container. > > For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers. Worst case, sure, but it's not the point. These containers which would receive the tokens are part of your trusted compute base.. so its up to the specific applications and their surrounding infrastructure with regards to what problem they solve where and approved by operators/platform engs to deploy in your cluster. I don't particularly see that there's a performance problem. Andrii specifically mentioned /trusted unprivileged applications/. > And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance. You want *one* XDP program fanning the packets out to the relevant containers. > > If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation. > > --Andy >
On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote: > On 6/23/23 5:10 PM, Andy Lutomirski wrote: >> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote: >>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote: >>> >>>> Hopefully you can see where I'm going with this. And this is just one >>>> random tiny example. We can think up tons of other cases to prove BPF >>>> is not isolatable to any sort of "container". >>> >>> No. You have not come up with an example of why BPF is not isolatable >>> to a container. You have come up with an example of why binding to a >>> sched_switch raw tracepoint does not make sense in a container without >>> additional mechanisms to give it well defined functionality and >>> appropriate security. > > One big blocker for the case of BPF is not isolatable to a container are > CPU hardware bugs. There has been plenty of mitigation effort so that the > flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately > it's a cat and mouse game and vendors are also not really transparent. So > actual reasonable discussion can be resumed once CPU vendors gets their > stuff fixed. > > [0] > https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks > By this standard, shouldn’t we just give up? Let everyone map /dev/mem readonly and stop pretending we can implement any form of access control. Of course, we don’t do this. We try pretty hard to squash bugs and keep programs from doing an end run around OS security. >> Thinking about this some more: >> >> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example). The workload is in the container. The tracepoint is global. Kernel memory is global unless something that is trusted and understands the containers is doing the reading. And proxying BPF is a mess. > > Agree that proxy is a mess for various reasons stated earlier. > >> So here are a couple of possible solutions: >> >> (a) Improve BPF maps a bit so that BPF maps work well in containers. It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags. (IIRC my patch series was a decent step in this direction,) Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container. So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data. > > I don't think it's very practical, meaning the vast majority of applications > out there today are tightly coupled BPF code + user space application, and in > a lot of cases programs are dynamically created. This would require somehow > splitting up parts of your application to run outside the container in hostns > and other parts inside the container.. for the sake of the mentioned example > it's something fairly static, but real-world applications look different and > are much more complex. > It sounds like you are describing a situation where there is a workload in a container, where the *entire container* is part of the TCB, but the part of the workload that has the explicit right to read all of kernel memory (e.g. bpf_probe_read_kernel) is so tightly coupled to the container that no one outside the container wants to audit it. And yet someone still wants to run it in a userns. This is IMO a rather bizarre situation. If I were operating a large fleet, and I had teams developing software to run in a container, I would not want to grant those containers this right without strict controls, and I don’t mean on/off controls. I would want strict auditing of *what exact BPF code* (including source) was run, and why, and who wrote it, and what the intended results are, and what limits access to the results, etc. After all, we’re talking about the right, BY DESIGN, to access PII, payment card information, medical information, information protected by any jurisdiction’s data control rights, etc. Literally everything. This ability, as described, isn’t “the right to use BPF.” It is the right to *read all secrets*, intentionally. (And modify them, with bpf_probe_write_user, possibly subject to some constraints.) If this series was about passing a “may load kernel modules” token around, I think it would get an extremely chilly reception, even though we have module signatures. I don’t see anything about BPF that makes BPF tokens more reasonable unless a real security model is developed first. >> (b) Make a way to pass a pre-approved program into a container. So a daemon outside loads the program and does some new magic to say "make an fd that can beused to attach this particular program to this particular tracepoint" and pass that into the container. > > Same as above. Programs are in most cases very tightly coupled to the > application > itself. I'm not sure if the ask is to redesign/implement all the > existing user > space infra. > >> I think (a) is better. In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container. >> >> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers. > > Worst case, sure, but it's not the point. These containers which would > receive > the tokens are part of your trusted compute base.. so its up to the > specific > applications and their surrounding infrastructure with regards to what > problem > they solve where and approved by operators/platform engs to deploy in > your cluster. > I don't particularly see that there's a performance problem. Andrii > specifically > mentioned /trusted unprivileged applications/. > >> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance. You want *one* XDP program fanning the packets out to the relevant containers. >> >> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation. >> >> --Andy >>
On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote: > On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote: > > If this series was about passing a “may load kernel modules” token > around, I think it would get an extremely chilly reception, even though > we have module signatures. I don’t see anything about BPF that makes > BPF tokens more reasonable unless a real security model is developed > first. > To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace. I'm saying the mechanism should have explicit access control. It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time. BPF, unlike kernel modules, has a verifier. While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks. (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably. Other hooks would have their own scoping. Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup. Etc.) If new, more restrictive functions are needed, they could be added. Alternatively, people could try a limited form of BPF proxying. It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision. This would need some API changes (maybe), but it seems eminently doable.
On 6/24/23 5:28 PM, Andy Lutomirski wrote: > On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote: >> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote: >> >> If this series was about passing a “may load kernel modules” token >> around, I think it would get an extremely chilly reception, even though >> we have module signatures. I don’t see anything about BPF that makes >> BPF tokens more reasonable unless a real security model is developed >> first. > > To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace. I'm saying the mechanism should have explicit access control. It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time. > > BPF, unlike kernel modules, has a verifier. While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks. > > (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably. Other hooks would have their own scoping. Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup. Etc.) > > If new, more restrictive functions are needed, they could be added. Wasn't this the idea of the BPF tokens proposal, meaning you could create them with restricted access as you mentioned - allowing an explicit subset of program types to be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in this token context upon program load-time (resp. map creation), the verifier is then extended for restricted access. For example, see the bpf_token_allow_{cmd,map_type,prog_type}() in this series. The user namespace relation was part of the use cases, but not strictly part of the mechanism itself in this series. With regards to the scoping, are you saying that the current design with the bitmasks in the token create uapi is not flexible enough? If yes, what concrete alternative do you propose? > Alternatively, people could try a limited form of BPF proxying. It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision. This would need some API changes (maybe), but it seems eminently doable. Thinking about this from an k8s environment angle, I think this wouldn't really be practical for various reasons.. you now need to maintain two implementations for your container images which ships BPF one which loads programs as today, and another one which talks to this proxy if available, then you also need to standardize and support the various loader libraries for this, you need to deal with yet one more component in your cluster which could fail (compared to talking to kernel directly), and being dependent on new proxy functionality becomes similar as with waiting for new kernels to hit mainstream, it could potentially take a very long time until production upgrades. What is being proposed here in this regard is less complex given no extra proxy is involved. I would certainly prefer a kernel-based solution. Thanks, Daniel
On Thu, Jun 22, 2023 at 6:03 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote: > > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote: > > > > For CAP_BPF too broad. It is broad, yes. If you have good ideas how to > > break it down some more -- please propose. But this is all orthogonal, > > because the blocking problem is fundamental incompatibility of user > > namespaces (and their implied isolation and sandboxing of workloads) > > and BPF functionality, which is global by its very nature. The latter > > is unavoidable in principle. > > How, exactly, is BPF global by its very nature? > > The *implementation* has some issues with globalness. Much of it should be fixable. > bpf_probe_read_kernel() is widely used and required for real-world applications. It's global by its nature and in principle not restrictable. We can say that we'll just disable applications that use bpf_probe_read_kernel(), but the goal is to enable applications that are *practically useful*, not just some restricted set of programs that are provably contained. > > > > No matter how much you break down CAP_BPF, you can't enforce that BPF > > program won't interfere with applications in other containers. Or that > > it won't "spy" on them. It's just not what BPF can enforce in > > principle. > > The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do. There are bugs -- I get that. There are helper functions that are fundamentally global. But, in the absence of verifier bugs, BPF has actual boundaries to its functionality. looking at your other replies, I think you realized yourself that there are valid use cases where it's impossible to statically validate boundaries > > > > > So that comes back down to a question of trust and then controlled > > delegation of BPF functionality. You trust workload with BPF usage > > because you reviewed the BPF code, workload, testing, etc? Grant BPF > > token and let that container use limited subset of BPF. Employ BPF LSM > > to further restrict it beyond what BPF token can control. > > > > You cannot trust an application to not do something harmful? You > > shouldn't grant it either CAP_BPF in init namespace, nor BPF token in > > user namespace. That's it. Pick your poison. > > I think what's lost here is hardening vs restricting intended functionality. > > We have access control to restrict intended functionality. We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it. > > BPF needs hardening -- this is well established. Right now, this is accomplished by restricting it to global root (effectively). It should have access controls, too, but it doesn't. > > > > > But all this cannot be mechanically decided or enforced. There has to > > be some humans involved in making these decisions. Kernel's job is to > > provide building blocks to grant and control BPF functionality to the > > extent that it is technically possible. > > > > Exactly. And it DOES NOT. bpf maps, etc do not have sensible access controls. Things that should not be global are global. I'm saying the kernel should fix THAT. Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it. > > > As for "something to isolate the pinned maps/progs by different apps > > (why not DAC rules?)", there is no such thing, as I've explained > > already. > > > > I can install sched_switch raw_tracepoint BPF program (if I'm allowed > > to), and that program has system-wide observability. It cannot be > > bound to an application. > > Great, a real example! > > Either: > > (a) don't run this in a container. Have a service for the container to request the help of this program. > > (b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container). > > > then what do we do when we switch from process A in container > > X to process B in container Y? Is that event belonging to container X? > > Or container Y? > > I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality. If you haven't defined what your program is even supposed to do in a container, don't run it there. I think you are missing the point I'm making. A specific BPF program that will use sched_switch is doing correct and right thing (for whatever that means in a specific case). We as humans designed, implemented, validated, reviewed it and are confident enough (as much as we can be with software) that it does the right thing. It doesn't try to spy on things, doesn't try to disrupt things. We know this as humans thanks to our internal development process. But this is not *provable* in a mechanical sense such that the kernel can validate and enforce this. And yet it's a practically useful application which we'd like to be able to launch from inside the container without rearchitecting and rewriting the entire world and proxying everything through some external root service. > > > > Hopefully you can see where I'm going with this. And this is just one > > random tiny example. We can think up tons of other cases to prove BPF > > is not isolatable to any sort of "container". > > No. You have not come up with an example of why BPF is not isolatable to a container. You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security. > > Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things. They're both under the BPF umbrella. They're not the same thing. I'm not conflating things. Thinking about BPF maps and BPF programs in isolation from them being attached somewhere in the kernel and doing actual and useful work is not useful. It's the end-to-end functionality including attaching and running BPF programs is what matters. Pedantically drawing the line at the BPF program load step and saying "this is BPF and everything else is not BPF" isn't really helpful. No one cares about just loading and validating BPF programs. Developers care about attaching and running them, that's what it all is about. > > Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter. It makes no sense.
On Thu, Jun 22, 2023 at 8:29 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote: > > On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote: > >> > >> > >> > >> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote: > >> > On 22/06/2023 00:48, Andrii Nakryiko wrote: > >> >> > >> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary. > >> >>>> BPF is still a privileged thing. You can't just say that any > >> >>>> unprivileged application should be able to use BPF. That's why BPF > >> >>>> token is about trusting unpriv application in a controlled environment > >> >>>> (production) to not do something crazy. It can be enforced further > >> >>>> through LSM usage, but in a lot of cases, when dealing with internal > >> >>>> production applications it's enough to have a proper application > >> >>>> design and rely on code review process to avoid any negative effects. > >> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things. > >> >>> > >> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container. > >> >> Please see Hao's reply ([0]) about his and Google's (not so rosy) > >> >> experiences with building and using such BPF proxy. We (Meta) > >> >> internally didn't go this route at all and strongly prefer not to. > >> >> There are lots of downsides and complications to having a BPF proxy. > >> >> In the end, this is just shuffling around where the decision about > >> >> trusting a given application with BPF access is being made. BPF proxy > >> >> adds lots of unnecessary logistical, operational, and development > >> >> complexity, but doesn't magically make anything safer. > >> >> > >> >> [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/ > >> >> > >> > Apologies for being blunt, but the token approach to me seems to be a > >> > work around providing the right level/classification for a pod/container > >> > in order to say you support unprivileged containers using eBPF. I think > >> > if your container needs to do privileged things it should have and be > >> > classified with the right permissions (privileges) to do what it needs > >> > to do. > >> > >> Bluntness is great. > >> > >> I think that this whole level/classification thing is utterly wrong. Replace "BPF" with basically anything else, and you'll see how absurd it is. > > > > BPF is not "anything else", it's important to understand that BPF is > > inherently not compratmentalizable. And it's vast and generic in its > > capabilities. This changes everything. So your analogies are > > misleading. > > > > file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc. They are infinitely extensible. They work in containers. > > What is so special about BPF? Socket with a well-defined and constrained interface that defines what you can do with it (send and receive bytes, in a controlled fashion), and BPF programs that intentionally are allowed to have an almost arbitrarily complex control flow *controlled by user*, and can combine dozens if not hundreds of "building blocks" (BPF helpers, kfuncs, various BPF maps, etc) and that could be activated at various points deep in the kernel (and run that custom user-provided code in kernel space). I'd say that yeah, BPF is on another level as far as genericity goes, compared to other interfaces. And that's BPF's goal and appeal, nothing wrong with it. But I do think BPF and sockets, files, things in /proc, etc are pretty different in terms of how they can be proved and enforced to be sandboxed. > > >> > >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk" > >> > >> That's very 1990's. Maybe 1980's. Of *course* giving access to a filesystem has some inherent security exposure. So we can give containers access to *different* filesystems. Or we can use ACLs. Or MAC policy. Or whatever. We have many solutions, none of which are perfect, and we're doing okay. > >> > >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network" > >> > >> The network is a big deal. For some reason, it's cool these days to treat TCP as highly privileged. You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port. You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network. > >> > >> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules. > >> > >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF" > >> > >> My response is: what's wrong with BPF? BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them. > > > > Can you apply DAC rules to which kernel events BPF program can be run > > on? Can you apply DAC rules to which in-kernel data structures a BPF > > program can look at and make sure that it doesn't access a > > task/socket/etc that "belongs" to some other container/user/etc? > > No, of course. > > If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module. It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module. > > We don't give containers special tokens that let them load arbitrary modules. We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules. > > But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers. BPF can learn to do this. > > > > > Can we limit XDP or AF_XDP BPF programs from seeing and controlling > > network traffic that will be eventually routed to a container that XDP > > program "should not" have access to? Without making everything so slow > > that it's useless? > > Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that. Or a vlan or a macvlan or whatever. (I'm assuming XDP can be scoped like this. I'm not that familiar with the details.) > > > > >> I even *wrote the code*. > > > > Did you submit it upstream for review and wide discussion? > > Yes. > > > Did you > > test it and integrate it with production workloads to prove that your > > solution is actually a viable real-world solution and not a toy? > > I did test it. I did not integrate it with production workloads. > Real-world use cases are the ultimate test of APIs and features. No matter how brilliant and elegant the solution is, if it doesn't work with real-world applications, it's pretty useless. It's not that hard to allow only a very limited and very restrictive subset of BPF to be allowed to be loaded and attached from containers without privileged permissions. But the point is to find a solution that works for complicated (and sometimes very messy) real applications that were validated by humans (to the best of their abilities), but can't be proven to be contained within some container. > > Writing the code doesn't mean solving the problem. > > Of course not. My code was a little step in the right direction. The BPF community was apparently not interested in it. > > > > >> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts. > > > > I won't speak on behalf of the entire BPF community, but I'm trying to > > explain that BPF cannot be reasonably sandboxed and has to be > > privileged due to its global nature. And I haven't yet seen any > > realistic counter-proposal to change that. And it's not about > > ownership of the BPF map or BPF program, it's way beyond that.. > > > > It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created. > > If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model. > > I'm saying that I think there *can* be a security model. But until the maintainers start to believe that, there won't be one. See above, whatever security model you have in mind, it should be workable with real-world applications. Building some elegant system that will work for just a (rather small) subset of use cases isn't appealing.
On Fri, Jun 23, 2023 at 3:18 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 6/16/23 12:48 AM, Andrii Nakryiko wrote: > > On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote: > >> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote: > >>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko > >>> <andrii.nakryiko@gmail.com> wrote: > >>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote: > >>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko > >>>>> <andrii.nakryiko@gmail.com> wrote: > >>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote: > >>>>>>> > >>>>>>> Hi Andrii, > >>>>>>> > >>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote: > >>>>>>>> > >>>>>>>> ... > >>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc. > >>>>>>> > >>>>>>> Is there a reason for coupling this only with the userns? > >>>>>> > >>>>>> There is no coupling. Without userns it is at least possible to grant > >>>>>> CAP_BPF and other capabilities from init ns. With user namespace that > >>>>>> becomes impossible. > >>>>> > >>>>> But these are not the same: delegate full cap vs delegate an fd mask? > >>>> > >>>> What FD mask are we talking about here? I don't recall us talking > >>>> about any FD masks, so this one is a bit confusing without more > >>>> context. > >>> > >>> Ah err, sorry yes referring to fd token (which I assumed is a mask of > >>> allowed operations or something like that). > >>> > >>> So I want the possibility to delegate the fd token in the init userns. > >>> > >>>>> > >>>>> One can argue unprivileged in init userns is the same privileged in > >>>>> nested userns > >>>>> Getting to delegate fd in init userns, then in nested ones seems logical... > >>>> > >>>> Again, sorry, I'm not following. Can you please elaborate what you mean? > >>> > >>> I mean can we use the fd token in the init user namespace too? not > >>> only in the nested user namespaces but in the first one? Sorry I > >>> didn't check the code. > >>> > > > > [...] > > > >>> > >>>>> Having the fd or "token" that gives access rights pinned in two > >>>>> separate bpffs mounts seems too much, it crosses namespaces (mount, > >>>>> userns etc), environments setup by privileged... > >>>> > >>>> See above, there is nothing namespaceable about BPF itself, and BPF > >>>> token as well. If some production setup benefits from pinning one BPF > >>>> token in multiple places, I don't see the problem with that. > >>>> > >>>>> > >>>>> I would just make it per bpffs mount and that's it, nothing more. If a > >>>>> program wants to bind mount it somewhere else then it's not a bpf > >>>>> problem. > >>>> > >>>> And if some application wants to pin BPF token, why would that be BPF > >>>> subsystem's problem as well? > >>> > >>> The credentials, capabilities, keyring, different namespaces, etc are > >>> all attached to the owning user namespace, if the BPF subsystem goes > >>> its own way and creates a token to split up CAP_BPF without following > >>> that model, then it's definitely a BPF subsystem problem... I don't > >>> recommend that. > >>> > >>> Feels it's going more of a system-wide approach opening BPF > >>> functionality where ultimately it clashes with the argument: delegate > >>> a subset of BPF functionality to a *trusted* unprivileged application. > >>> My reading of delegation is within a container/service hierarchy > >>> nothing more. > >> > >> You're making the exact arguments that Lennart, Aleksa, and I have been > >> making in the LSFMM presentation about this topic. It's even recorded: > > > > Alright, so (I think) I get a pretty good feel now for what the main > > concerns are, and why people are trying to push this to be an FS. And > > it's not so much that BPF token grants bpf() syscall usage to unpriv > > (but trusted) workloads or that BPF itself is not namespaceable. The > > main worry is that BPF token, once issues, could be > > illegally/uncontrollably passed outside of container, intentionally or > > not. And by having this association with mount namespace (through BPF > > FS) we automatically limit the sharing to only contain that has access > > to that BPF FS. > > +1 > > > So I agree that it makes sense to have this mount namespace > > association, but I also would like to keep BPF token to be a separate > > entity from BPF FS itself, and have the ability to have multiple > > different BPF tokens exposed in a single BPF FS instance. I think the > > latter is important. > > > > So how about this slight modification: when a BPF token is created > > using BPF_TOKEN_CREATE command, the user has to provide an FD for > > "associated" BPF FS instance (superblock). What that does is allows > > BPF token to be created with BPF FS and/or mount namespace association > > set in stone. After that BPF token can only be pinned in that BPF FS > > instance and cannot leave the boundaries of that mount namespace > > (specific details to be worked out, this is new area for me, so I'm > > sorry if I'm missing nuances). > > Given bpffs is not a singleton and there can be multiple bpffs instances > in a container, couldn't we make the token a special bpffs mount/mode? > Something like single .token file in that mount (for example) which can > be opened and the fd then passed along for prog/map creation? And given > the multiple mounts, this also allows potentially for multiple tokens? > In other words, this is already set up by the container manager when it > sets up mounts rather than later, and the regular bpffs instance is sth > separate from all that. Meaning, in your container you get the usual > bpffs instance and then one or more special bpffs instances as tokens > at different paths (and in future they could unlock different subset of > bpf functionality for example). Just from a technical point of view we could do that. But I see a lot of value in keeping BPF token creation as part of BPF syscall and its API. And the main issue, I believe, was not allowing BPF token to escape the intended container, which should be more than covered by BPF_TOKEN_CREATE pinning a token into provided BPF FS instance and not allowing it to be repinned after that. > > Thanks, > Daniel
On Fri, Jun 23, 2023 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > >> applications meets the needs of these PODs that need to do > >> privileged/bpf things without any tokens. Ultimately you are trusting > >> these apps in the same way as if you were granting a token. > > > > Yes, absolutely. As I mentioned very explicitly, it's the question of > > trusting application. Service vs token is implementation details, but > > the one that has huge implications in how applications are built, > > tested, versioned, deployed, etc. > > So one thing that I don't really get is why such a "trusted application" > needs to be run in a user namespace in the first place? If it's trusted, > why not simply run it as a privileged container (without the user > namespace) and grant it the right system-level capabilities, instead of > going to all this trouble just to punch a hole in the user namespace > isolation? Because it's still useful to provide isolation that user namespace provides in all other aspects besides BPF usage. The fact that it's a trusted application doesn't mean that bugs don't happen, or that some action that was not intended might be attempted (due to a bug, some deep unintended library "feature", or just because someone didn't anticipate some interaction). Trusted here means we believe our BPF usage is not going to spy on sensitive data, or attempt to disrupt other workloads, because of design and code reviews, and we intend to maintain that property. But people are still involved, of course, and bugs do happen. We'd like to get as much protection as possible, and that's what the user namespace is offering. For BPF-side of things, we have to trust the process because there is no technical solution. Running outside the user namespace we also don't have any guarantees about BPF. We just have even less protection in all other aspects outside of BPF. We are trying to improve our story with user namespace to mitigate what's mitigatable. > > -Toke >
On Sat, Jun 24, 2023 at 7:00 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote: > > On 6/23/23 5:10 PM, Andy Lutomirski wrote: > >> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote: > >>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote: > >>> > >>>> Hopefully you can see where I'm going with this. And this is just one > >>>> random tiny example. We can think up tons of other cases to prove BPF > >>>> is not isolatable to any sort of "container". > >>> > >>> No. You have not come up with an example of why BPF is not isolatable > >>> to a container. You have come up with an example of why binding to a > >>> sched_switch raw tracepoint does not make sense in a container without > >>> additional mechanisms to give it well defined functionality and > >>> appropriate security. > > > > One big blocker for the case of BPF is not isolatable to a container are > > CPU hardware bugs. There has been plenty of mitigation effort so that the > > flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately > > it's a cat and mouse game and vendors are also not really transparent. So > > actual reasonable discussion can be resumed once CPU vendors gets their > > stuff fixed. > > > > [0] > > https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks > > > > By this standard, shouldn’t we just give up? Let everyone map /dev/mem readonly and stop pretending we can implement any form of access control. > > Of course, we don’t do this. We try pretty hard to squash bugs and keep programs from doing an end run around OS security. > > >> Thinking about this some more: > >> > >> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example). The workload is in the container. The tracepoint is global. Kernel memory is global unless something that is trusted and understands the containers is doing the reading. And proxying BPF is a mess. > > > > Agree that proxy is a mess for various reasons stated earlier. > > > >> So here are a couple of possible solutions: > >> > >> (a) Improve BPF maps a bit so that BPF maps work well in containers. It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags. (IIRC my patch series was a decent step in this direction,) Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container. So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data. > > > > I don't think it's very practical, meaning the vast majority of applications > > out there today are tightly coupled BPF code + user space application, and in > > a lot of cases programs are dynamically created. This would require somehow > > splitting up parts of your application to run outside the container in hostns > > and other parts inside the container.. for the sake of the mentioned example > > it's something fairly static, but real-world applications look different and > > are much more complex. > > > > It sounds like you are describing a situation where there is a workload in a container, where the *entire container* is part of the TCB, but the part of the workload that has the explicit right to read all of kernel memory (e.g. bpf_probe_read_kernel) is so tightly coupled to the container that no one outside the container wants to audit it. > > And yet someone still wants to run it in a userns. > Yes, to get all the other benefits of userns. Yes, BPF isolation cannot be enforced and we rely on a human-driven process to decide whether it's ok to run BPF inside each specific container. But why can't we also get all the other benefits of userns outside of BPF usage. BPF parts are critical for such applications, but they also normally have a huge user-space part, and use large common libraries, so there is a lot of benefit to having as much userns-provided isolation as possible. > This is IMO a rather bizarre situation. > > If I were operating a large fleet, and I had teams developing software to run in a container, I would not want to grant those containers this right without strict controls, and I don’t mean on/off controls. I would want strict auditing of *what exact BPF code* (including source) was run, and why, and who wrote it, and what the intended results are, and what limits access to the results, etc. After all, we’re talking about the right, BY DESIGN, to access PII, payment card information, medical information, information protected by any jurisdiction’s data control rights, etc. Literally everything. This ability, as described, isn’t “the right to use BPF.” It is the right to *read all secrets*, intentionally. (And modify them, with bpf_probe_write_user, possibly subject to some constraints.) What makes you think this is not how it's actually done in practice already (except right now we don't have BPF token, so it's all-or-nothin, userns or not, root or not, which is overall worse than what we'll get with BPF token + userns)? Audit, code review, proper development practices. Then discussions and reviews between team running container manager and team with BPF-based workload to make decisions whether it's safe to allow BPF access (and to what degree) and how teams will maintain privacy and safety obligations. > > > If this series was about passing a “may load kernel modules” token around, I think it would get an extremely chilly reception, even though we have module signatures. I don’t see anything about BPF that makes BPF tokens more reasonable unless a real security model is developed first. If we had dozens of teams developing and loading/unloading their custom kernel modules all the time, it might not have sounded so ridiculous? > > >> (b) Make a way to pass a pre-approved program into a container. So a daemon outside loads the program and does some new magic to say "make an fd that can beused to attach this particular program to this particular tracepoint" and pass that into the container. > > > > Same as above. Programs are in most cases very tightly coupled to the > > application > > itself. I'm not sure if the ask is to redesign/implement all the > > existing user > > space infra. > > > >> I think (a) is better. In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container. > >> > >> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers. > > > > Worst case, sure, but it's not the point. These containers which would > > receive > > the tokens are part of your trusted compute base.. so its up to the > > specific > > applications and their surrounding infrastructure with regards to what > > problem > > they solve where and approved by operators/platform engs to deploy in > > your cluster. > > I don't particularly see that there's a performance problem. Andrii > > specifically > > mentioned /trusted unprivileged applications/. Yep, performance is not why this is being done. > > > >> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance. You want *one* XDP program fanning the packets out to the relevant containers. > >> > >> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation. > >> > >> --Andy > >>
On Sat, Jun 24, 2023 at 5:28 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote: > > On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote: > > > > > If this series was about passing a “may load kernel modules” token > > around, I think it would get an extremely chilly reception, even though > > we have module signatures. I don’t see anything about BPF that makes > > BPF tokens more reasonable unless a real security model is developed > > first. > > > > To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace. I'm saying the mechanism should have explicit access control. It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time. > > BPF, unlike kernel modules, has a verifier. While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks. > > (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably. Other hooks would have their own scoping. Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup. Etc.) > > If new, more restrictive functions are needed, they could be added. > This seems to align with BPF fd/token delegation. I asked in another thread if more context/policies could be provided from user space when configuring the fd and the answer: it can be on top as a follow up... The user namespace is just one single use case of many, also confirmed in this reply [0] . Getting it to work in init userns should be the first logical step anyway, then once you have an fd you can delegate it or pass it around to childs that create nested user namespaces, etc as it is currently done within container managers when they setup the environments including the uid mapping... and of course there should be some sort of mechanism to ensure that the delegated fd comes say from a parent user namespace before using it and deny any cross namespaces usage... > Alternatively, people could try a limited form of BPF proxying. It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision. This would need some API changes (maybe), but it seems eminently doable. > Even a *limited* BPF proxying seems more in the opposite direction of what you are suggesting above? If I have an fd or the bpffs mount with a token properly setup by the manager I can directly use it inside my containers, load small bpf programs without talking to another external API of another container... I assume the manager passed me the rights or already pre-approved the operation... Of course there is also the case of approving the attachment of bpf programs without passing an fd/token which I assume is your point or in other words denying it which makes perfectly sense indeed, then yes: an outside daemon could do this, systemd / container managers etc with the help of LSMs could *deny* attachment of BPF programs without any external API changes (they already support LSMs), IIRC there is already a hook part of bpf() syscall to restrict some program types maybe, so future cases of bpf token should add in kernel and LSMs + bpf-lsm hooks, ensure they are properly called with the full context and restrict further... So for the "limited form of BPF proxying... to approve attachment..." I think with fd delegation of bpffs mount (that requires privileges to set it up) then an in kernel LSM hooks on top to tighten this up is the way to go [0] https://lore.kernel.org/bpf/CAEf4BzbjGBY2=XGmTBWX3Vrgkc7h0FRQMTbB-SeKEf28h6OhAQ@mail.gmail.com/
On Mon, Jun 26, 2023, at 8:23 AM, Daniel Borkmann wrote: > On 6/24/23 5:28 PM, Andy Lutomirski wrote: >> On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote: >>> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote: >>> >>> If this series was about passing a “may load kernel modules” token >>> around, I think it would get an extremely chilly reception, even though >>> we have module signatures. I don’t see anything about BPF that makes >>> BPF tokens more reasonable unless a real security model is developed >>> first. >> >> To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace. I'm saying the mechanism should have explicit access control. It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time. >> >> BPF, unlike kernel modules, has a verifier. While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks. >> >> (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably. Other hooks would have their own scoping. Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup. Etc.) >> >> If new, more restrictive functions are needed, they could be added. > > Wasn't this the idea of the BPF tokens proposal, meaning you could > create them with > restricted access as you mentioned - allowing an explicit subset of > program types to > be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in > this token > context upon program load-time (resp. map creation), the verifier is > then extended > for restricted access. For example, see the > bpf_token_allow_{cmd,map_type,prog_type}() > in this series. The user namespace relation was part of the use cases, > but not strictly > part of the mechanism itself in this series. Hmm. It's very coarse grained. Also, the bpf() attach API seems to be largely (completely?) missing what I would expect to be basic access controls on the things being attached to. For example, the whole cgroup_bpf_prog_attach() path seems to be entirely missing any checks as to whether its caller has any particular permission over the cgroup in question. It doesn't even check whether the cgroup is being accessed from the current userns (i.e. whether the fd refers to a struct file with f_path.mnt belonging to the current userns). So the API in this patchset has no way to restrict permission to attach to cgroups to only apply to cgroups belonging to the container. > > With regards to the scoping, are you saying that the current design > with the bitmasks > in the token create uapi is not flexible enough? If yes, what concrete > alternative do > you propose? > >> Alternatively, people could try a limited form of BPF proxying. It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision. This would need some API changes (maybe), but it seems eminently doable. > > Thinking about this from an k8s environment angle, I think this > wouldn't really be > practical for various reasons.. you now need to maintain two > implementations for your > container images which ships BPF one which loads programs as today, and > another one > which talks to this proxy if available, This seems fairly trivially solvable. Agree on an API, say using UNIX sockets to /var/run/bpfd/whatever.socket. (Or maybe /var/lib? I’m not sure there’s universal agreement on where things like this to.) The exact same API works uncontained (bpfd running, probably socket-activated) from a binary in the system and as a bind-mount from outside. I don’t know k8s well at all, but it looks like hostPath can do exactly this. Off the top of my head, I don’t know whether systemd’s .socket can be configured the right way so the same configuration would work contained and uncontained. One could certainly work around *that* by having two different paths tried in succession, but that seems a bit silly. This actually seems easier than supplying bpf tokens to a container. > then you also need to > standardize and support > the various loader libraries for this, you need to deal with yet one > more component > in your cluster which could fail (compared to talking to kernel > directly), and being > dependent on new proxy functionality becomes similar as with waiting > for new kernels > to hit mainstream, it could potentially take a very long time until > production upgrades. > What is being proposed here in this regard is less complex given no > extra proxy is > involved. I would certainly prefer a kernel-based solution. A userspace solution makes it easy to apply some kind of flexible approval and audit policy to the BPF program. I can imagine all kinds of ways that a fleet operator might want to control what can run, and trying to stick it in the kernel seems rather complex and awkward to customize. I suppose a bpf token could be set up to call out to its creator for permission to load a program, which would involve a different set of tradeoffs.
On Mon, Jun 26, 2023, at 3:08 PM, Andrii Nakryiko wrote: > On Fri, Jun 23, 2023 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> >> applications meets the needs of these PODs that need to do >> >> privileged/bpf things without any tokens. Ultimately you are trusting >> >> these apps in the same way as if you were granting a token. >> > >> > Yes, absolutely. As I mentioned very explicitly, it's the question of >> > trusting application. Service vs token is implementation details, but >> > the one that has huge implications in how applications are built, >> > tested, versioned, deployed, etc. >> >> So one thing that I don't really get is why such a "trusted application" >> needs to be run in a user namespace in the first place? If it's trusted, >> why not simply run it as a privileged container (without the user >> namespace) and grant it the right system-level capabilities, instead of >> going to all this trouble just to punch a hole in the user namespace >> isolation? > > Because it's still useful to provide isolation that user namespace > provides in all other aspects besides BPF usage. > > The fact that it's a trusted application doesn't mean that bugs don't > happen, or that some action that was not intended might be attempted > (due to a bug, some deep unintended library "feature", or just because > someone didn't anticipate some interaction). > > Trusted here means we believe our BPF usage is not going to spy on > sensitive data, or attempt to disrupt other workloads, because of > design and code reviews, and we intend to maintain that property. But > people are still involved, of course, and bugs do happen. We'd like to > get as much protection as possible, and that's what the user namespace > is offering. > I'm wondering if your approach makes sense for Meta but maybe not outside Meta. I think Meta is a bit unusual in that it operates a huge fleet, but the developers of the software in that fleet are a fairly tight group. (I'm speculating here. I don't know much about what goes on inside Meta, obviously.) Concretely, you say "we believe our BPF usage is not going to spy on sensitive data". Who is this "we"? The kernel developers? The people developing the BPF programs? The people setting policy for the fleet? The people creating container images that want to use BPF and run within the fleet? Are these all the same "we"? For a company with actual outside tenants, or a company that needs to comply with various privacy rules for some, but not all, of its applications, there are a lot of "we"s involved. Some group develops software (or this is outsourced -- the BPF maintainership is essentially within Meta, after all). Some group administers the fleet. Some group develops BPF programs (or downloads them from outside and hopefully vets them). Some group builds container images that want to use those programs. Some group deploys these images via kubernetes or whatever. Some group prepares reports for that say that certain services offered comply with PCI or HIPAA or FedRAMP or GDPR or whatever. They're not all the same people. Obviously bugs exist and mistakes happen. But, at the end of the day, someone is going to read a BPF program (or a kernel module, or whatever) and take some degree of responsibility for saying "I read this thing, and I approve its use in a certain context". And then *that permission* should be granted. With your patchset as it is, the permission granted is not "run this program I approved" but rather "read all kernel memory". And I don't think that will fly with a lot of potential users. > For BPF-side of things, we have to trust the process because there is > no technical solution. Running outside the user namespace we also > don't have any guarantees about BPF. We just have even less protection > in all other aspects outside of BPF. We are trying to improve our > story with user namespace to mitigate what's mitigatable. But there *are* technical solutions. At least two broad types, as I've been trying to say. 1. Stronger and more flexible controls as to which specific programs can be loaded and run. The people doing the trusting may very well want to trust specific things (and audit which things they've trusted, etc.) 2. Stronger and more flexible controls as to what programs can do. Right now, bpf() can attach to essentially any cgroup or tracepoint if it can attach to any at all. Programs can acccess all kernel memory (because alternatives to bpf_probe_kernel_read() aren't really available, and there is no incentive right now to add them, because there isn't even a way AFAIK to turn off bpf_probe_kernel_read()). Progress on either one of these could go a long way.
On Tue, Jul 4, 2023, at 1:48 PM, Andy Lutomirski wrote: > On Mon, Jun 26, 2023, at 8:23 AM, Daniel Borkmann wrote: >> On 6/24/23 5:28 PM, Andy Lutomirski wrote: >> >> Wasn't this the idea of the BPF tokens proposal, meaning you could >> create them with >> restricted access as you mentioned - allowing an explicit subset of >> program types to >> be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in >> this token >> context upon program load-time (resp. map creation), the verifier is >> then extended >> for restricted access. For example, see the >> bpf_token_allow_{cmd,map_type,prog_type}() >> in this series. The user namespace relation was part of the use cases, >> but not strictly >> part of the mechanism itself in this series. > > Hmm. It's very coarse grained. > > Also, the bpf() attach API seems to be largely (completely?) missing > what I would expect to be basic access controls on the things being > attached to. For example, the whole cgroup_bpf_prog_attach() path > seems to be entirely missing any checks as to whether its caller has > any particular permission over the cgroup in question. It doesn't even > check whether the cgroup is being accessed from the current userns > (i.e. whether the fd refers to a struct file with f_path.mnt belonging > to the current userns). So the API in this patchset has no way to > restrict permission to attach to cgroups to only apply to cgroups > belonging to the container. > Forgot to mention: there's also no way to limit the functions that can be called. While it's currently a bit of a pipe dream to do much useful work without bpf_probe_kernel_read(), it's at least conceptually possible to accomplish quite a bit without it, but there's no way to make that be part of the policy.