mbox series

[v2,bpf-next,00/18] BPF token

Message ID 20230607235352.1723243-1-andrii@kernel.org (mailing list archive)
Headers show
Series BPF token | expand

Message

Andrii Nakryiko June 7, 2023, 11:53 p.m. UTC
This patch set introduces new BPF object, BPF token, which allows to delegate
a subset of BPF functionality from privileged system-wide daemon (e.g.,
systemd or any other container manager) to a *trusted* unprivileged
application. Trust is the key here. This functionality is not about allowing
unconditional unprivileged BPF usage. Establishing trust, though, is
completely up to the discretion of respective privileged application that
would create a BPF token.

The main motivation for BPF token is a desire to enable containerized
BPF applications to be used together with user namespaces. This is currently
impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
arbitrary memory, and it's impossible to ensure that they only read memory of
processes belonging to any given namespace. This means that it's impossible to
have namespace-aware CAP_BPF capability, and as such another mechanism to
allow safe usage of BPF functionality is necessary. BPF token and delegation
of it to a trusted unprivileged applications is such mechanism. Kernel makes
no assumption about what "trusted" constitutes in any particular case, and
it's up to specific privileged applications and their surrounding
infrastructure to decide that. What kernel provides is a set of APIs to create
and tune BPF token, and pass it around to privileged BPF commands that are
creating new BPF objects like BPF programs, BPF maps, etc.

Previous attempt at addressing this very same problem ([0]) attempted to
utilize authoritative LSM approach, but was conclusively rejected by upstream
LSM maintainers. BPF token concept is not changing anything about LSM
approach, but can be combined with LSM hooks for very fine-grained security
policy. Some ideas about making BPF token more convenient to use with LSM (in
particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
2023 presentation ([1]). E.g., an ability to specify user-provided data
(context), which in combination with BPF LSM would allow implementing a very
dynamic and fine-granular custom security policies on top of BPF token. In the
interest of minimizing API surface area discussions this is going to be
added in follow up patches, as it's not essential to the fundamental concept
of delegatable BPF token.

It should be noted that BPF token is conceptually quite similar to the idea of
/dev/bpf device file, proposed by Song a while ago ([2]). The biggest
difference is the idea of using virtual anon_inode file to hold BPF token and
allowing multiple independent instances of them, each with its own set of
restrictions. BPF pinning solves the problem of exposing such BPF token
through file system (BPF FS, in this case) for cases where transferring FDs
over Unix domain sockets is not convenient. And also, crucially, BPF token
approach is not using any special stateful task-scoped flags. Instead, bpf()
syscall accepts token_fd parameters explicitly for each relevant BPF command.
This addresses main concerns brought up during the /dev/bpf discussion, and
fits better with overall BPF subsystem design.

This patch set adds a basic minimum of functionality to make BPF token useful
and to discuss API and functionality. Currently only low-level libbpf APIs
support passing BPF token around, allowing to test kernel functionality, but
for the most part is not sufficient for real-world applications, which
typically use high-level libbpf APIs based on `struct bpf_object` type. This
was done with the intent to limit the size of patch set and concentrate on
mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
as a separate follow up patch set kernel support makes it upstream.

Another part that should happen once kernel-side BPF token is established, is
a set of conventions between applications (e.g., systemd), tools (e.g.,
bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
at well-defined locations to allow applications take advantage of this in
automatic fashion without explicit code changes on BPF application's side.
But I'd like to postpone this discussion to after BPF token concept lands.

  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/

v1->v2:
  - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
  - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).

Andrii Nakryiko (18):
  bpf: introduce BPF token object
  libbpf: add bpf_token_create() API
  selftests/bpf: add BPF_TOKEN_CREATE test
  bpf: move unprivileged checks into map_create() and bpf_prog_load()
  bpf: inline map creation logic in map_create() function
  bpf: centralize permissions checks for all BPF map types
  bpf: add BPF token support to BPF_MAP_CREATE command
  libbpf: add BPF token support to bpf_map_create() API
  selftests/bpf: add BPF token-enabled test for BPF_MAP_CREATE command
  bpf: add BPF token support to BPF_BTF_LOAD command
  libbpf: add BPF token support to bpf_btf_load() API
  selftests/bpf: add BPF token-enabled BPF_BTF_LOAD selftest
  bpf: keep BPF_PROG_LOAD permission checks clear of validations
  bpf: add BPF token support to BPF_PROG_LOAD command
  bpf: take into account BPF token when fetching helper protos
  bpf: consistenly use BPF token throughout BPF verifier logic
  libbpf: add BPF token support to bpf_prog_load() API
  selftests/bpf: add BPF token-enabled BPF_PROG_LOAD tests

 drivers/media/rc/bpf-lirc.c                   |   2 +-
 include/linux/bpf.h                           |  70 ++-
 include/linux/filter.h                        |   2 +-
 include/uapi/linux/bpf.h                      |  37 ++
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/arraymap.c                         |   2 +-
 kernel/bpf/bloom_filter.c                     |   3 -
 kernel/bpf/bpf_local_storage.c                |   3 -
 kernel/bpf/bpf_struct_ops.c                   |   3 -
 kernel/bpf/cgroup.c                           |   6 +-
 kernel/bpf/core.c                             |   3 +-
 kernel/bpf/cpumap.c                           |   4 -
 kernel/bpf/devmap.c                           |   3 -
 kernel/bpf/hashtab.c                          |   6 -
 kernel/bpf/helpers.c                          |   6 +-
 kernel/bpf/inode.c                            |  26 ++
 kernel/bpf/lpm_trie.c                         |   3 -
 kernel/bpf/queue_stack_maps.c                 |   4 -
 kernel/bpf/reuseport_array.c                  |   3 -
 kernel/bpf/stackmap.c                         |   3 -
 kernel/bpf/syscall.c                          | 401 ++++++++++++++----
 kernel/bpf/token.c                            | 136 ++++++
 kernel/bpf/verifier.c                         |  13 +-
 kernel/trace/bpf_trace.c                      |   2 +-
 net/core/filter.c                             |  36 +-
 net/core/sock_map.c                           |   4 -
 net/ipv4/bpf_tcp_ca.c                         |   2 +-
 net/netfilter/nf_bpf_link.c                   |   2 +-
 net/xdp/xskmap.c                              |   4 -
 tools/include/uapi/linux/bpf.h                |  39 ++
 tools/lib/bpf/bpf.c                           |  32 +-
 tools/lib/bpf/bpf.h                           |  24 +-
 tools/lib/bpf/libbpf.map                      |   1 +
 .../selftests/bpf/prog_tests/libbpf_probes.c  |   4 +
 .../selftests/bpf/prog_tests/libbpf_str.c     |   6 +
 .../testing/selftests/bpf/prog_tests/token.c  | 260 ++++++++++++
 .../bpf/prog_tests/unpriv_bpf_disabled.c      |   6 +-
 37 files changed, 975 insertions(+), 188 deletions(-)
 create mode 100644 kernel/bpf/token.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c

Comments

Stanislav Fomichev June 8, 2023, 6:49 p.m. UTC | #1
On 06/07, Andrii Nakryiko wrote:
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.
> 
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.
> 
> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
> 
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()
> syscall accepts token_fd parameters explicitly for each relevant BPF command.
> This addresses main concerns brought up during the /dev/bpf discussion, and
> fits better with overall BPF subsystem design.
> 
> This patch set adds a basic minimum of functionality to make BPF token useful
> and to discuss API and functionality. Currently only low-level libbpf APIs
> support passing BPF token around, allowing to test kernel functionality, but
> for the most part is not sufficient for real-world applications, which
> typically use high-level libbpf APIs based on `struct bpf_object` type. This
> was done with the intent to limit the size of patch set and concentrate on
> mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> as a separate follow up patch set kernel support makes it upstream.
> 
> Another part that should happen once kernel-side BPF token is established, is
> a set of conventions between applications (e.g., systemd), tools (e.g.,
> bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> at well-defined locations to allow applications take advantage of this in
> automatic fashion without explicit code changes on BPF application's side.
> But I'd like to postpone this discussion to after BPF token concept lands.
> 
>   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
>   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
>   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> 
> v1->v2:
>   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
>   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
 
I went through v2, everything makes sense, the only thing that is
slightly confusing to me is the bpf_token_capable() call.
The name somehow implies that the token is capable of something
where in reality the function does "return token || capable(x)".

IMO, it would be less confusing if we do something like the following,
explicitly, instead of calling a function:

if (token || {bpf_,perfmon_,}capable(x)) ...

(or rename to something like bpf_token_or_capable(x))

Up to you on whether to take any action on that. OTOH, once you
grasp what bpf_token_capable really does, it's not really a problem.
Andrii Nakryiko June 8, 2023, 10:17 p.m. UTC | #2
On Thu, Jun 8, 2023 at 11:49 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On 06/07, Andrii Nakryiko wrote:
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > This addresses main concerns brought up during the /dev/bpf discussion, and
> > fits better with overall BPF subsystem design.
> >
> > This patch set adds a basic minimum of functionality to make BPF token useful
> > and to discuss API and functionality. Currently only low-level libbpf APIs
> > support passing BPF token around, allowing to test kernel functionality, but
> > for the most part is not sufficient for real-world applications, which
> > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > was done with the intent to limit the size of patch set and concentrate on
> > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > as a separate follow up patch set kernel support makes it upstream.
> >
> > Another part that should happen once kernel-side BPF token is established, is
> > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > at well-defined locations to allow applications take advantage of this in
> > automatic fashion without explicit code changes on BPF application's side.
> > But I'd like to postpone this discussion to after BPF token concept lands.
> >
> >   [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> >   [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> >   [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> >
> > v1->v2:
> >   - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
> >   - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
>
> I went through v2, everything makes sense, the only thing that is
> slightly confusing to me is the bpf_token_capable() call.
> The name somehow implies that the token is capable of something
> where in reality the function does "return token || capable(x)".

heh, "bpf_token_" part is sort of like namespace/object prefix. The
intent here was to have a token-aware capable check. And yes, if we
get a token during prog/map/etc construction, the assumption is that
it provides all relevant permissions.

>
> IMO, it would be less confusing if we do something like the following,
> explicitly, instead of calling a function:
>
> if (token || {bpf_,perfmon_,}capable(x)) ...
>
> (or rename to something like bpf_token_or_capable(x))

I'd rather not open-code `if (token || ...)` checks everywhere, but I
can rename to `bpf_token_or_capable()` if people prefer. I erred on
the side of succinctness, but if it's confusing, then best to rename?

>
> Up to you on whether to take any action on that. OTOH, once you
> grasp what bpf_token_capable really does, it's not really a problem.

Cool, thanks for taking a look!
Toke Høiland-Jørgensen June 9, 2023, 11:17 a.m. UTC | #3
Andrii Nakryiko <andrii@kernel.org> writes:

> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.

I am not convinced that this token-based approach is a good way to solve
this: having the delegation mechanism be one where you can basically
only grant a perpetual delegation with no way to retract it, no way to
check what exactly it's being used for, and that is transitive (can be
passed on to others with no restrictions) seems like a recipe for
disaster. I believe this was basically the point Casey was making as
well in response to v1.

If the goal is to enable a privileged application (such as a container
manager) to grant another unprivileged application the permission to
perform certain bpf() operations, why not just proxy the operations
themselves over some RPC mechanism? That way the granting application
can perform authentication checks on every operation and ensure its
origins are sound at the time it is being made. Instead of just writing
a blank check (in the form of a token) and hoping the receiver of it is
not compromised...

-Toke
Andrii Nakryiko June 9, 2023, 6:21 p.m. UTC | #4
On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Andrii Nakryiko <andrii@kernel.org> writes:
>
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
> I am not convinced that this token-based approach is a good way to solve
> this: having the delegation mechanism be one where you can basically
> only grant a perpetual delegation with no way to retract it, no way to
> check what exactly it's being used for, and that is transitive (can be
> passed on to others with no restrictions) seems like a recipe for
> disaster. I believe this was basically the point Casey was making as
> well in response to v1.

Most of this can be added, if we really need to. Ability to revoke BPF
token is easy to implement (though of course it will apply only for
subsequent operations). We can allocate ID for BPF token just like we
do for BPF prog/map/link and let tools iterate and fetch information
about it. As for controlling who's passing what and where, I don't
think the situation is different for any other FD-based mechanism. You
might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
or BPF FS, and that application can keep doing the same to other
processes.

Ultimately, currently we have root permissions for applications that
need BPF. That's already very dangerous. But just because something
might be misused or abused doesn't prevent us from making a good
practical use of it, right?

Also, there is LSM on top of all of this to override and control how
the BPF subsystem is used, regardless of BPF token. It can override
any of the privileges mechanism, capabilities, BPF token, whatnot.

>
> If the goal is to enable a privileged application (such as a container
> manager) to grant another unprivileged application the permission to
> perform certain bpf() operations, why not just proxy the operations
> themselves over some RPC mechanism? That way the granting application

It's explicitly what we *do not* want to do, as it is a major problem
and logistical complication. Every single application will have to be
rewritten to use such a special daemon/service and its API, which is
completely different from bpf() syscall API. It invalidates the use of
all the libbpf (and other bpf libraries') APIs, BPF skeleton is
incompatible with this. It's a nightmare. I've got feedback from
people in another company that do have BPF service with just a tiny
subset of BPF functionality delegated to such service, and it's a pain
and definitely not a preferred way to do things.

Just think about having to mirror a big chunk of bpf() syscall as an
RPC. So no, BPF proxy is definitely not a good solution.


> can perform authentication checks on every operation and ensure its
> origins are sound at the time it is being made. Instead of just writing
> a blank check (in the form of a token) and hoping the receiver of it is
> not compromised...

All this could and should be done through LSM in much more decoupled
and transparent (to application) way. BPF token doesn't prevent this.
It actually helps with this, because organizations can actually
dictate that operations that do not provide BPF token are
automatically rejected, and those that do provide BPF token can be
further checked and granted or rejected based on specific BPF token
instance.

>
> -Toke
Andy Lutomirski June 9, 2023, 6:32 p.m. UTC | #5
On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.
>

I skimmed the description and the LSFMM slides.

Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.

Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?  Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
Andrii Nakryiko June 9, 2023, 7:08 p.m. UTC | #6
On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
> >
>
> I skimmed the description and the LSFMM slides.
>
> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.
>
> Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?

Yes?.. In the sense that it is possible to create BPF programs and BPF
maps from inside the container (with BPF token). Right now under user
namespace it's impossible no matter what you do.

> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.

BPF is still a privileged thing. You can't just say that any
unprivileged application should be able to use BPF. That's why BPF
token is about trusting unpriv application in a controlled environment
(production) to not do something crazy. It can be enforced further
through LSM usage, but in a lot of cases, when dealing with internal
production applications it's enough to have a proper application
design and rely on code review process to avoid any negative effects.

So privileged daemon (container manager) will be configured with the
knowledge of which services/containers are allowed to use BPF, and
will grant BPF token only to those that were explicitly allowlisted.
Toke Høiland-Jørgensen June 9, 2023, 9:21 p.m. UTC | #7
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii@kernel.org> writes:
>>
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token.
>>
>> I am not convinced that this token-based approach is a good way to solve
>> this: having the delegation mechanism be one where you can basically
>> only grant a perpetual delegation with no way to retract it, no way to
>> check what exactly it's being used for, and that is transitive (can be
>> passed on to others with no restrictions) seems like a recipe for
>> disaster. I believe this was basically the point Casey was making as
>> well in response to v1.
>
> Most of this can be added, if we really need to. Ability to revoke BPF
> token is easy to implement (though of course it will apply only for
> subsequent operations). We can allocate ID for BPF token just like we
> do for BPF prog/map/link and let tools iterate and fetch information
> about it. As for controlling who's passing what and where, I don't
> think the situation is different for any other FD-based mechanism. You
> might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> or BPF FS, and that application can keep doing the same to other
> processes.

No, but every other fd-based mechanism is limited in scope. E.g., if you
pass a map fd that's one specific map that can be passed around, with a
token it's all operations (of a specific type) which is way broader.

> Ultimately, currently we have root permissions for applications that
> need BPF. That's already very dangerous. But just because something
> might be misused or abused doesn't prevent us from making a good
> practical use of it, right?

That's not a given. It's always a trade-off, and if the mechanism is
likely to open up the system to additional risk that's not a good
trade-off even if it helps in some case. I basically worry that this is
the case here.

> Also, there is LSM on top of all of this to override and control how
> the BPF subsystem is used, regardless of BPF token. It can override
> any of the privileges mechanism, capabilities, BPF token, whatnot.

If this mechanism needs an LSM to be used safely, that's not incredibly
confidence-inspiring. Security mechanisms should fail safe, which this
one does not.

I'm also worried that an LSM policy is the only way to disable the
ability to create a token; with this in the kernel, I suddenly have to
trust not only that all applications with BPF privileges will not load
malicious code, but also that they won't (accidentally or maliciously)
conveys extra privileges on someone else. Seems a bit broad to have this
ability (to issue tokens) available to everyone with access to the bpf()
syscall, when (IIUC) it's only a single daemon in the system that would
legitimately do this in the deployment you're envisioning.

>> If the goal is to enable a privileged application (such as a container
>> manager) to grant another unprivileged application the permission to
>> perform certain bpf() operations, why not just proxy the operations
>> themselves over some RPC mechanism? That way the granting application
>
> It's explicitly what we *do not* want to do, as it is a major problem
> and logistical complication. Every single application will have to be
> rewritten to use such a special daemon/service and its API, which is
> completely different from bpf() syscall API. It invalidates the use of
> all the libbpf (and other bpf libraries') APIs, BPF skeleton is
> incompatible with this. It's a nightmare. I've got feedback from
> people in another company that do have BPF service with just a tiny
> subset of BPF functionality delegated to such service, and it's a pain
> and definitely not a preferred way to do things.

But weren't you proposing that libbpf should be able to transparently
look for tokens and load them without any application changes? Why can't
libbpf be taught to use an RPC socket in a similar fashion? It basically
boils down to something like:

static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
			  unsigned int size)
{
	if (!stat("/run/bpf.sock")) {
		sock = open_socket("/run/bpf.sock");
                write_to(sock, cmd, attr, size);
                return read_response(sock);
        } else {
		return syscall(__NR_bpf, cmd, attr, size);
        }
}

> Just think about having to mirror a big chunk of bpf() syscall as an
> RPC. So no, BPF proxy is definitely not a good solution.

The daemon at the other side of the socket in the example above doesn't
*have* to be taught all the semantics of the syscall, it can just look
at the command name and make a decision based on that and the identity
of the socket peer, then just pass the whole thing to the kernel if the
permission check passes.

>> can perform authentication checks on every operation and ensure its
>> origins are sound at the time it is being made. Instead of just writing
>> a blank check (in the form of a token) and hoping the receiver of it is
>> not compromised...
>
> All this could and should be done through LSM in much more decoupled
> and transparent (to application) way. BPF token doesn't prevent this.
> It actually helps with this, because organizations can actually
> dictate that operations that do not provide BPF token are
> automatically rejected, and those that do provide BPF token can be
> further checked and granted or rejected based on specific BPF token
> instance.

See above re: needing an LSM policy to make this safe...

-Toke
Andrii Nakryiko June 9, 2023, 10:03 p.m. UTC | #8
On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >>
> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> > application. Trust is the key here. This functionality is not about allowing
> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> > completely up to the discretion of respective privileged application that
> >> > would create a BPF token.
> >>
> >> I am not convinced that this token-based approach is a good way to solve
> >> this: having the delegation mechanism be one where you can basically
> >> only grant a perpetual delegation with no way to retract it, no way to
> >> check what exactly it's being used for, and that is transitive (can be
> >> passed on to others with no restrictions) seems like a recipe for
> >> disaster. I believe this was basically the point Casey was making as
> >> well in response to v1.
> >
> > Most of this can be added, if we really need to. Ability to revoke BPF
> > token is easy to implement (though of course it will apply only for
> > subsequent operations). We can allocate ID for BPF token just like we
> > do for BPF prog/map/link and let tools iterate and fetch information
> > about it. As for controlling who's passing what and where, I don't
> > think the situation is different for any other FD-based mechanism. You
> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> > or BPF FS, and that application can keep doing the same to other
> > processes.
>
> No, but every other fd-based mechanism is limited in scope. E.g., if you
> pass a map fd that's one specific map that can be passed around, with a
> token it's all operations (of a specific type) which is way broader.

It's not black and white. Once you have a BPF program FD, you can
attach it many times, for example, and cause regressions. Sure, here
we are talking about creating multiple BPF maps or loading multiple
BPF programs, so it's wider in scope, but still, it's not that
fundamentally different.

>
> > Ultimately, currently we have root permissions for applications that
> > need BPF. That's already very dangerous. But just because something
> > might be misused or abused doesn't prevent us from making a good
> > practical use of it, right?
>
> That's not a given. It's always a trade-off, and if the mechanism is
> likely to open up the system to additional risk that's not a good
> trade-off even if it helps in some case. I basically worry that this is
> the case here.
>
> > Also, there is LSM on top of all of this to override and control how
> > the BPF subsystem is used, regardless of BPF token. It can override
> > any of the privileges mechanism, capabilities, BPF token, whatnot.
>
> If this mechanism needs an LSM to be used safely, that's not incredibly
> confidence-inspiring. Security mechanisms should fail safe, which this
> one does not.

I proposed to add authoritative LSM hooks that would selectively allow
some of BPF operations on a case-by-case basis. This was rejected,
claiming that the best approach is to give process privilege to do
whatever it needs to do and then restrict it with LSM.

Ok, if not for user namespaces, that would mean giving application
CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it
with LSM. Except with user namespace that doesn't work. So that's
where BPF token comes in, but allows it to do it more safely by
allowing to coarsely tune what subset of BPF operations is granted.
And then LSM should be used to further restrict it.

>
> I'm also worried that an LSM policy is the only way to disable the
> ability to create a token; with this in the kernel, I suddenly have to
> trust not only that all applications with BPF privileges will not load
> malicious code, but also that they won't (accidentally or maliciously)
> conveys extra privileges on someone else. Seems a bit broad to have this
> ability (to issue tokens) available to everyone with access to the bpf()
> syscall, when (IIUC) it's only a single daemon in the system that would
> legitimately do this in the deployment you're envisioning.

Note, any process with real CAP_SYS_ADMIN. Let's not forget that.

But would you feel better if BPF_TOKEN_CREATE was guarded behind
sysctl or Kconfig?

Ultimately, worrying is fine, but there are real problems that need to
be solved. And not doing anything isn't a great option.

>
> >> If the goal is to enable a privileged application (such as a container
> >> manager) to grant another unprivileged application the permission to
> >> perform certain bpf() operations, why not just proxy the operations
> >> themselves over some RPC mechanism? That way the granting application
> >
> > It's explicitly what we *do not* want to do, as it is a major problem
> > and logistical complication. Every single application will have to be
> > rewritten to use such a special daemon/service and its API, which is
> > completely different from bpf() syscall API. It invalidates the use of
> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is
> > incompatible with this. It's a nightmare. I've got feedback from
> > people in another company that do have BPF service with just a tiny
> > subset of BPF functionality delegated to such service, and it's a pain
> > and definitely not a preferred way to do things.
>
> But weren't you proposing that libbpf should be able to transparently
> look for tokens and load them without any application changes? Why can't
> libbpf be taught to use an RPC socket in a similar fashion? It basically
> boils down to something like:
>
> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
>                           unsigned int size)
> {
>         if (!stat("/run/bpf.sock")) {
>                 sock = open_socket("/run/bpf.sock");
>                 write_to(sock, cmd, attr, size);
>                 return read_response(sock);
>         } else {
>                 return syscall(__NR_bpf, cmd, attr, size);
>         }
> }
>

Well, for one, Meta we'll use its own Thrift-based RPC protocol.
Google might use something internal for them using GRPC, someone else
would want to utilize systemd, yet others will use yet another
implementation. RPC introduces more failure modes. While with syscall
we know that operation either succeeded or failed, with RPC we'll have
to deal with "maybe", if it was some communication error.

Let's not trivialize adding, using, and supporting the RPC version of
bpf() syscall.


> > Just think about having to mirror a big chunk of bpf() syscall as an
> > RPC. So no, BPF proxy is definitely not a good solution.
>
> The daemon at the other side of the socket in the example above doesn't
> *have* to be taught all the semantics of the syscall, it can just look
> at the command name and make a decision based on that and the identity
> of the socket peer, then just pass the whole thing to the kernel if the
> permission check passes.

Let's not trivialize the consequences of adding an RPC protocol to all
this, please. No matter in what form or shape.

>
> >> can perform authentication checks on every operation and ensure its
> >> origins are sound at the time it is being made. Instead of just writing
> >> a blank check (in the form of a token) and hoping the receiver of it is
> >> not compromised...
> >
> > All this could and should be done through LSM in much more decoupled
> > and transparent (to application) way. BPF token doesn't prevent this.
> > It actually helps with this, because organizations can actually
> > dictate that operations that do not provide BPF token are
> > automatically rejected, and those that do provide BPF token can be
> > further checked and granted or rejected based on specific BPF token
> > instance.
>
> See above re: needing an LSM policy to make this safe...

See above. We are talking about the CAP_SYS_ADMIN-enabled process.
It's not safe by definition already.

>
> -Toke
Djalal Harouni June 9, 2023, 10:29 p.m. UTC | #9
Hi Andrii,

On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.
>
> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.

Is there a reason for coupling this only with the userns?
The "trusted unprivileged" assumed by systemd can be in init userns?


> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
>
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()

What's the use case for transfering over unix domain sockets?

Will BPF token translation happen if you cross the different namespaces?

If the token is pinned into different bpffs, will the token share the
same context?
Andrii Nakryiko June 9, 2023, 10:57 p.m. UTC | #10
On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> Hi Andrii,
>
> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
> >
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> Is there a reason for coupling this only with the userns?

There is no coupling. Without userns it is at least possible to grant
CAP_BPF and other capabilities from init ns. With user namespace that
becomes impossible.

> The "trusted unprivileged" assumed by systemd can be in init userns?

It doesn't have to be systemd, but yes, BPF token can be created only
when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
of commands).

>
>
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
>
> What's the use case for transfering over unix domain sockets?

I'm not sure I understand the question. Unix domain socket
(specifically its SCM_RIGHTS ancillary message) allows to transfer
files between processes, which is one way to pass BPF object (like
prog/map/link, and now token). BPF FS is the other one. In practice
it's usually BPF FS, but there is no presumption about how file
reference is transferred.

>
> Will BPF token translation happen if you cross the different namespaces?

What does BPF token translation mean specifically? Currently it's a
very simple kernel object with refcnt and a few flags, so there is
nothing to translate?

>
> If the token is pinned into different bpffs, will the token share the
> same context?

So I was planning to allow a user process creating a BPF token to
specify custom user-provided data (context). This is not in this patch
set, but is it what you are asking about?

Regardless, pinning BPF object in BPF FS is just basically bumping a
refcnt and exposes that object in a way that can be looked up through
file system path (using bpf() syscall's BPF_OBJ_GET command).
Underlying object isn't cloned or copied, it's exactly the same object
with the same shared internal state.
Toke Høiland-Jørgensen June 12, 2023, 10:49 a.m. UTC | #11
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >>
>> >> Andrii Nakryiko <andrii@kernel.org> writes:
>> >>
>> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> >> > systemd or any other container manager) to a *trusted* unprivileged
>> >> > application. Trust is the key here. This functionality is not about allowing
>> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> >> > completely up to the discretion of respective privileged application that
>> >> > would create a BPF token.
>> >>
>> >> I am not convinced that this token-based approach is a good way to solve
>> >> this: having the delegation mechanism be one where you can basically
>> >> only grant a perpetual delegation with no way to retract it, no way to
>> >> check what exactly it's being used for, and that is transitive (can be
>> >> passed on to others with no restrictions) seems like a recipe for
>> >> disaster. I believe this was basically the point Casey was making as
>> >> well in response to v1.
>> >
>> > Most of this can be added, if we really need to. Ability to revoke BPF
>> > token is easy to implement (though of course it will apply only for
>> > subsequent operations). We can allocate ID for BPF token just like we
>> > do for BPF prog/map/link and let tools iterate and fetch information
>> > about it. As for controlling who's passing what and where, I don't
>> > think the situation is different for any other FD-based mechanism. You
>> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
>> > or BPF FS, and that application can keep doing the same to other
>> > processes.
>>
>> No, but every other fd-based mechanism is limited in scope. E.g., if you
>> pass a map fd that's one specific map that can be passed around, with a
>> token it's all operations (of a specific type) which is way broader.
>
> It's not black and white. Once you have a BPF program FD, you can
> attach it many times, for example, and cause regressions. Sure, here
> we are talking about creating multiple BPF maps or loading multiple
> BPF programs, so it's wider in scope, but still, it's not that
> fundamentally different.

Right, but the difference is that a single BPF program is a known
entity, so even if the application you pass the fd to can attach it
multiple times, it can't make it do new things (e.g., bpf_probe_read()
stuff it is not supposed to). Whereas with bpf_token you have no such
guarantee.

>>
>> > Ultimately, currently we have root permissions for applications that
>> > need BPF. That's already very dangerous. But just because something
>> > might be misused or abused doesn't prevent us from making a good
>> > practical use of it, right?
>>
>> That's not a given. It's always a trade-off, and if the mechanism is
>> likely to open up the system to additional risk that's not a good
>> trade-off even if it helps in some case. I basically worry that this is
>> the case here.
>>
>> > Also, there is LSM on top of all of this to override and control how
>> > the BPF subsystem is used, regardless of BPF token. It can override
>> > any of the privileges mechanism, capabilities, BPF token, whatnot.
>>
>> If this mechanism needs an LSM to be used safely, that's not incredibly
>> confidence-inspiring. Security mechanisms should fail safe, which this
>> one does not.
>
> I proposed to add authoritative LSM hooks that would selectively allow
> some of BPF operations on a case-by-case basis. This was rejected,
> claiming that the best approach is to give process privilege to do
> whatever it needs to do and then restrict it with LSM.
>
> Ok, if not for user namespaces, that would mean giving application
> CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it
> with LSM. Except with user namespace that doesn't work. So that's
> where BPF token comes in, but allows it to do it more safely by
> allowing to coarsely tune what subset of BPF operations is granted.
> And then LSM should be used to further restrict it.

Right, I do understand the use case, my worry is that we're creating a
privilege escalation model that is really broad if it is *not* coupled
with an LSM to restrict it. Which will be the default outside of
controlled environments that really know what they are doing.

So I dunno, maybe some way to restrict the token so it only grants
privilege if there is *also* an explicit LSM verdict on it? I guess
that's still too close to an authoritative LSM hook that it'll pass? I
do think the "explicit grant" model of an authoritative LSM is a better
fit for this kind of thing...

>> I'm also worried that an LSM policy is the only way to disable the
>> ability to create a token; with this in the kernel, I suddenly have to
>> trust not only that all applications with BPF privileges will not load
>> malicious code, but also that they won't (accidentally or maliciously)
>> conveys extra privileges on someone else. Seems a bit broad to have this
>> ability (to issue tokens) available to everyone with access to the bpf()
>> syscall, when (IIUC) it's only a single daemon in the system that would
>> legitimately do this in the deployment you're envisioning.
>
> Note, any process with real CAP_SYS_ADMIN. Let's not forget that.
>
> But would you feel better if BPF_TOKEN_CREATE was guarded behind
> sysctl or Kconfig?

Hmm, yeah, some way to make sure it's off by default would be
preferable, IMO.

> Ultimately, worrying is fine, but there are real problems that need to
> be solved. And not doing anything isn't a great option.

Right, it would be good if some of the security folks could chime in
with their view of how this is best achieved without running into any of
the "bad ideas" they are opposed to.

>> >> If the goal is to enable a privileged application (such as a container
>> >> manager) to grant another unprivileged application the permission to
>> >> perform certain bpf() operations, why not just proxy the operations
>> >> themselves over some RPC mechanism? That way the granting application
>> >
>> > It's explicitly what we *do not* want to do, as it is a major problem
>> > and logistical complication. Every single application will have to be
>> > rewritten to use such a special daemon/service and its API, which is
>> > completely different from bpf() syscall API. It invalidates the use of
>> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is
>> > incompatible with this. It's a nightmare. I've got feedback from
>> > people in another company that do have BPF service with just a tiny
>> > subset of BPF functionality delegated to such service, and it's a pain
>> > and definitely not a preferred way to do things.
>>
>> But weren't you proposing that libbpf should be able to transparently
>> look for tokens and load them without any application changes? Why can't
>> libbpf be taught to use an RPC socket in a similar fashion? It basically
>> boils down to something like:
>>
>> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
>>                           unsigned int size)
>> {
>>         if (!stat("/run/bpf.sock")) {
>>                 sock = open_socket("/run/bpf.sock");
>>                 write_to(sock, cmd, attr, size);
>>                 return read_response(sock);
>>         } else {
>>                 return syscall(__NR_bpf, cmd, attr, size);
>>         }
>> }
>>
>
> Well, for one, Meta we'll use its own Thrift-based RPC protocol.
> Google might use something internal for them using GRPC, someone else
> would want to utilize systemd, yet others will use yet another
> implementation. RPC introduces more failure modes. While with syscall
> we know that operation either succeeded or failed, with RPC we'll have
> to deal with "maybe", if it was some communication error.
>
> Let's not trivialize adding, using, and supporting the RPC version of
> bpf() syscall.

I am not trying to trivialise it, I am well aware that it is more
complicated in practice than just adding a wrapper like the above. I am
just arguing with your point that "all applications need to change, so
we can't do RPC". Any mechanism we add along there lines will require
application changes, including the BPF token. And if the way we're going
to avoid that is by baking the support into libbpf, then that can be
done regardless of the mechanism we choose.

Or to put it another way: as you say it may be more *complicated* to add
an RPC-based path to libbpf, but it's not fundamentally impossible, it's
just another technical problem to be solved. And if that added
complexity buys us better security properties, maybe that is a good
trade-off. At least we shouldn't dismiss it out of hand.

-Toke
Djalal Harouni June 12, 2023, 12:02 p.m. UTC | #12
On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> >
> > Hi Andrii,
> >
> > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > ...
> > > creating new BPF objects like BPF programs, BPF maps, etc.
> >
> > Is there a reason for coupling this only with the userns?
>
> There is no coupling. Without userns it is at least possible to grant
> CAP_BPF and other capabilities from init ns. With user namespace that
> becomes impossible.

But these are not the same: delegate full cap vs delegate an fd mask?

One can argue unprivileged in init userns is the same privileged in
nested userns
Getting to delegate fd in init userns, then in nested ones seems logical...

> > The "trusted unprivileged" assumed by systemd can be in init userns?
>
> It doesn't have to be systemd, but yes, BPF token can be created only
> when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> of commands).

I'm more into getting fd delegation work also in the first init userns...

I can't understand why it's not possible or doable?

> >
> >
> > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > approach, but can be combined with LSM hooks for very fine-grained security
> > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > (context), which in combination with BPF LSM would allow implementing a very
> > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > interest of minimizing API surface area discussions this is going to be
> > > added in follow up patches, as it's not essential to the fundamental concept
> > > of delegatable BPF token.
> > >
> > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > allowing multiple independent instances of them, each with its own set of
> > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> >
> > What's the use case for transfering over unix domain sockets?
>
> I'm not sure I understand the question. Unix domain socket
> (specifically its SCM_RIGHTS ancillary message) allows to transfer
> files between processes, which is one way to pass BPF object (like
> prog/map/link, and now token). BPF FS is the other one. In practice
> it's usually BPF FS, but there is no presumption about how file
> reference is transferred.

Got it.

IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
userns, no ?

I assume such which allows to set up things in a hierarchical way...

If I set up the environment to lock things down the line, I find it
strange if a received fd would allow me to do more things than what
was planned when I created the environment: namespaces, mounts, etc

I think you have to add the owning userns context to the fd or
"token", and on the receiving part if the current userns is the same
or a nested one of the current userns hierarchy then allow bpf
operation, otherwise fail with -EACCESS or something similar...


> >
> > Will BPF token translation happen if you cross the different namespaces?
>
> What does BPF token translation mean specifically? Currently it's a
> very simple kernel object with refcnt and a few flags, so there is
> nothing to translate?

Please see above comment about the owning userns context

> >
> > If the token is pinned into different bpffs, will the token share the
> > same context?
>
> So I was planning to allow a user process creating a BPF token to
> specify custom user-provided data (context). This is not in this patch
> set, but is it what you are asking about?

Exactly, define what you can access inside the container... this would
align with Andy's suggestion "making BPF behave sensibly in that
container seems like it should also be necessary." I do agree on this.

Again I think LSM and bpf+lsm should have the final word on this too...


> Regardless, pinning BPF object in BPF FS is just basically bumping a
> refcnt and exposes that object in a way that can be looked up through
> file system path (using bpf() syscall's BPF_OBJ_GET command).
> Underlying object isn't cloned or copied, it's exactly the same object
> with the same shared internal state.

This is the part I also find strange, I can understand pinning a bpf
program, map, etc, but an fd that gives some access rights should be
part of the filesystem from the start, I don't get the extra pinning.
Also it seems bpffs is per superblock mount so why not allow
privileged to mount bpffs with the corresponding information, then
privileged can open the fd, set it up and pass it down the line when
executing the main program?  or even allow unprivileged to open it on
bpffs with some restrictive conditions?

Then it would be the business of the privileged to bind mount bpffs in
some other places, share it, etc

Having the fd or "token" that gives access rights pinned in two
separate bpffs mounts seems too much, it crosses namespaces (mount,
userns etc), environments setup by privileged...

I would just make it per bpffs mount and that's it, nothing more. If a
program wants to bind mount it somewhere else then it's not a bpf
problem.
Dave Tucker June 12, 2023, 12:44 p.m. UTC | #13
> On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote:
> 
> This patch set introduces new BPF object, BPF token, which allows to delegate
> a subset of BPF functionality from privileged system-wide daemon (e.g.,
> systemd or any other container manager) to a *trusted* unprivileged
> application. Trust is the key here. This functionality is not about allowing
> unconditional unprivileged BPF usage. Establishing trust, though, is
> completely up to the discretion of respective privileged application that
> would create a BPF token.


Hello! Author of a bpfd[1] here.

> The main motivation for BPF token is a desire to enable containerized
> BPF applications to be used together with user namespaces. This is currently
> impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> arbitrary memory, and it's impossible to ensure that they only read memory of
> processes belonging to any given namespace. This means that it's impossible to
> have namespace-aware CAP_BPF capability, and as such another mechanism to
> allow safe usage of BPF functionality is necessary. BPF token and delegation
> of it to a trusted unprivileged applications is such mechanism. Kernel makes
> no assumption about what "trusted" constitutes in any particular case, and
> it's up to specific privileged applications and their surrounding
> infrastructure to decide that. What kernel provides is a set of APIs to create
> and tune BPF token, and pass it around to privileged BPF commands that are
> creating new BPF objects like BPF programs, BPF maps, etc.

You could do that… but the problem is created due to the pattern of having a
single binary that is responsible for:

- Loading and attaching the BPF program in question
- Interacting with maps

Let’s set aside some of the other fun concerns of eBPF in containers:
 - Requiring mounting of vmlinux, bpffs, traces etc…
 - How fs permissions on host translate into permissions in containers

While your proposal lets you grant a subset of CAP_BPF to some other process,
which I imagine could also be done with SELinux, it doesn’t stop you from needing
other required permissions for attaching tracing programs in such an
environment. 

For example, say container A wants to attach a uprobe to a process in container B.
Container A needs to be able to nsenter into container B’s pidns in order for attachment
to succeed… but then what I can do with CAP_BPF is the least of my concerns since
I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
much scarier than CAP_BPF in the first place.

If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
then with recent kernels your container workload should be fine to run entirely unprivileged,
or worst case with only CAP_BPF since all you need to do is read/write maps.

Policy control - which process can request to load programs that monitor which other
processes - would happen within this system daemon and you wouldn’t need tokens.

Since it’s easy enough to do this in userspace, I’d be strongly against adding more
complexity into BPF to support this usecase.

> Previous attempt at addressing this very same problem ([0]) attempted to
> utilize authoritative LSM approach, but was conclusively rejected by upstream
> LSM maintainers. BPF token concept is not changing anything about LSM
> approach, but can be combined with LSM hooks for very fine-grained security
> policy. Some ideas about making BPF token more convenient to use with LSM (in
> particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> 2023 presentation ([1]). E.g., an ability to specify user-provided data
> (context), which in combination with BPF LSM would allow implementing a very
> dynamic and fine-granular custom security policies on top of BPF token. In the
> interest of minimizing API surface area discussions this is going to be
> added in follow up patches, as it's not essential to the fundamental concept
> of delegatable BPF token.
> 
> It should be noted that BPF token is conceptually quite similar to the idea of
> /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> difference is the idea of using virtual anon_inode file to hold BPF token and
> allowing multiple independent instances of them, each with its own set of
> restrictions. BPF pinning solves the problem of exposing such BPF token
> through file system (BPF FS, in this case) for cases where transferring FDs
> over Unix domain sockets is not convenient. And also, crucially, BPF token
> approach is not using any special stateful task-scoped flags. Instead, bpf()
> syscall accepts token_fd parameters explicitly for each relevant BPF command.
> This addresses main concerns brought up during the /dev/bpf discussion, and
> fits better with overall BPF subsystem design.
> 
> This patch set adds a basic minimum of functionality to make BPF token useful
> and to discuss API and functionality. Currently only low-level libbpf APIs
> support passing BPF token around, allowing to test kernel functionality, but
> for the most part is not sufficient for real-world applications, which
> typically use high-level libbpf APIs based on `struct bpf_object` type. This
> was done with the intent to limit the size of patch set and concentrate on
> mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> as a separate follow up patch set kernel support makes it upstream.
> 
> Another part that should happen once kernel-side BPF token is established, is
> a set of conventions between applications (e.g., systemd), tools (e.g.,
> bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> at well-defined locations to allow applications take advantage of this in
> automatic fashion without explicit code changes on BPF application's side.
> But I'd like to postpone this discussion to after BPF token concept lands.
> 
>  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
>  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
>  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> 

- Dave

[1]: https://github.com/bpfd-dev/bpfd
Djalal Harouni June 12, 2023, 2:31 p.m. UTC | #14
On Mon, Jun 12, 2023 at 2:02 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
...
> > I'm not sure I understand the question. Unix domain socket
> > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > files between processes, which is one way to pass BPF object (like
> > prog/map/link, and now token). BPF FS is the other one. In practice
> > it's usually BPF FS, but there is no presumption about how file
> > reference is transferred.
>
> Got it.
>
> IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> userns, no ?
>
> I assume such which allows to set up things in a hierarchical way...
>
> If I set up the environment to lock things down the line, I find it
> strange if a received fd would allow me to do more things than what
> was planned when I created the environment: namespaces, mounts, etc
>
> I think you have to add the owning userns context to the fd or
> "token", and on the receiving part if the current userns is the same
> or a nested one of the current userns hierarchy then allow bpf
> operation, otherwise fail with -EACCESS or something similar...

Andrii to make it clear: the owning userns that is owner/creator of
the bpffs mount (better this one since you prevent the inherit fd and
do bad things with it cases...) lets call it userns A,  and the
receiving process is in userns B, so when transfering the fd if userns
B == userns A or if A is an ancestor of B then allow to do things with
fd token, otherwise just deny it...

At least that's how I see things now, but maybe there are corner cases...
Djalal Harouni June 12, 2023, 3:52 p.m. UTC | #15
On Mon, Jun 12, 2023 at 2:45 PM Dave Tucker <datucker@redhat.com> wrote:
>
>
>
> > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
>
> Hello! Author of a bpfd[1] here.
>
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> You could do that… but the problem is created due to the pattern of having a
> single binary that is responsible for:
>
> - Loading and attaching the BPF program in question
> - Interacting with maps
>
> Let’s set aside some of the other fun concerns of eBPF in containers:
>  - Requiring mounting of vmlinux, bpffs, traces etc…
>  - How fs permissions on host translate into permissions in containers
>
> While your proposal lets you grant a subset of CAP_BPF to some other process,
> which I imagine could also be done with SELinux, it doesn’t stop you from needing
>
> other required permissions for attaching tracing programs in such an
> environment.
>
> For example, say container A wants to attach a uprobe to a process in container B.
> Container A needs to be able to nsenter into container B’s pidns in order for attachment
> to succeed… but then what I can do with CAP_BPF is the least of my concerns since
> I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
> much scarier than CAP_BPF in the first place.
>
> If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
> then with recent kernels your container workload should be fine to run entirely unprivileged,
> or worst case with only CAP_BPF since all you need to do is read/write maps.
>
> Policy control - which process can request to load programs that monitor which other
> processes - would happen within this system daemon and you wouldn’t need tokens.
>
> Since it’s easy enough to do this in userspace, I’d be strongly against adding more
> complexity into BPF to support this usecase.

For some cases complexity could be the other way, bpf by design are
small programs that can be loaded/unloaded dynamically and work on
their own... easily adaptable to dynamic workload... not all bpf are
the same...

Stuffing *everything* together and performing round trips between main
container and container transfering, loading and attaching bpf
programs would question what's the advantage?
Andrii Nakryiko June 12, 2023, 10:08 p.m. UTC | #16
On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >>
> >> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >> >>
> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> >> > application. Trust is the key here. This functionality is not about allowing
> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> >> > completely up to the discretion of respective privileged application that
> >> >> > would create a BPF token.
> >> >>
> >> >> I am not convinced that this token-based approach is a good way to solve
> >> >> this: having the delegation mechanism be one where you can basically
> >> >> only grant a perpetual delegation with no way to retract it, no way to
> >> >> check what exactly it's being used for, and that is transitive (can be
> >> >> passed on to others with no restrictions) seems like a recipe for
> >> >> disaster. I believe this was basically the point Casey was making as
> >> >> well in response to v1.
> >> >
> >> > Most of this can be added, if we really need to. Ability to revoke BPF
> >> > token is easy to implement (though of course it will apply only for
> >> > subsequent operations). We can allocate ID for BPF token just like we
> >> > do for BPF prog/map/link and let tools iterate and fetch information
> >> > about it. As for controlling who's passing what and where, I don't
> >> > think the situation is different for any other FD-based mechanism. You
> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> >> > or BPF FS, and that application can keep doing the same to other
> >> > processes.
> >>
> >> No, but every other fd-based mechanism is limited in scope. E.g., if you
> >> pass a map fd that's one specific map that can be passed around, with a
> >> token it's all operations (of a specific type) which is way broader.
> >
> > It's not black and white. Once you have a BPF program FD, you can
> > attach it many times, for example, and cause regressions. Sure, here
> > we are talking about creating multiple BPF maps or loading multiple
> > BPF programs, so it's wider in scope, but still, it's not that
> > fundamentally different.
>
> Right, but the difference is that a single BPF program is a known
> entity, so even if the application you pass the fd to can attach it
> multiple times, it can't make it do new things (e.g., bpf_probe_read()
> stuff it is not supposed to). Whereas with bpf_token you have no such
> guarantee.

Sure, I'm not claiming BPF token is just like passing BPF program FD
around. My point is that anything in the kernel that is representable
by FD can be passed around to an unintended process through
SCM_RIGHTS. And if you want to have tighter control over who's passing
what, you'd probably need LSM. But it's not a requirement.

With BPF token it is important to trust the application you are
passing BPF token to. This is not a mechanism to just freely pass
around the ability to do BPF. You do it only to applications you
control.

You can initiate BPF token from under CAP_SYS_ADMIN only. If you give
CAP_SYS_ADMIN to some application that might pass BPF token to some
random application, you should probably revisit the whole approach.
You can do a lot of harm with that CAP_SYS_ADMIN beyond the BPF
subsystem.

On the other hand, the more correct comparison would be whether to
give some unprivileged application a BPF token versus giving it
CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYSADMIN (or the necessary
subset of it). With BPF token you can narrow down to what exact types
of programs and maps it can use, if at all. BPF token applies to BPF
subsystem only. With caps, you are giving that application way more
power than you'd like, but that's ok in practice, because a) you need
that application to do something useful with BPF, so you take that
risk, and b) you normally would control that application, so you are
mitigating this risk even without any LSM or something like that on
top.

We do the latter all the time because we have to. BPF token gives us
more well-scoped alternatively.

With user namespaces, if we could grant CAP_BPF and co to use BPF,
we'd do that. But we can't. BPF token at least gives us this
opportunity.

So while I understand your concerns in principle, I think they are a
bit overblown in practice.

>
> >>
> >> > Ultimately, currently we have root permissions for applications that
> >> > need BPF. That's already very dangerous. But just because something
> >> > might be misused or abused doesn't prevent us from making a good
> >> > practical use of it, right?
> >>
> >> That's not a given. It's always a trade-off, and if the mechanism is
> >> likely to open up the system to additional risk that's not a good
> >> trade-off even if it helps in some case. I basically worry that this is
> >> the case here.
> >>
> >> > Also, there is LSM on top of all of this to override and control how
> >> > the BPF subsystem is used, regardless of BPF token. It can override
> >> > any of the privileges mechanism, capabilities, BPF token, whatnot.
> >>
> >> If this mechanism needs an LSM to be used safely, that's not incredibly
> >> confidence-inspiring. Security mechanisms should fail safe, which this
> >> one does not.
> >
> > I proposed to add authoritative LSM hooks that would selectively allow
> > some of BPF operations on a case-by-case basis. This was rejected,
> > claiming that the best approach is to give process privilege to do
> > whatever it needs to do and then restrict it with LSM.
> >
> > Ok, if not for user namespaces, that would mean giving application
> > CAP_BPF+CAP_PERFMON+CAP_NET_ADMIN+CAP_SYS_ADMIN, and then restrict it
> > with LSM. Except with user namespace that doesn't work. So that's
> > where BPF token comes in, but allows it to do it more safely by
> > allowing to coarsely tune what subset of BPF operations is granted.
> > And then LSM should be used to further restrict it.
>
> Right, I do understand the use case, my worry is that we're creating a
> privilege escalation model that is really broad if it is *not* coupled
> with an LSM to restrict it. Which will be the default outside of
> controlled environments that really know what they are doing.

Look, you are worried that you gave some process root permissions and
that process delegated a small portion of that (BPF token) to an
unprivileged process, which abuses it somehow. Beyond the question of
"why did you grant root permissions to something you can't trust to do
the right thing", isn't there a more dangerous stuff (I don't know,
setuid, chmod/chown, etc) that root process can perform to grant
unprivileged process unintended and uncontrolled privileges?

Why BPF token is the one singled out that would have to require
mandatory LSM to be installed?

>
> So I dunno, maybe some way to restrict the token so it only grants
> privilege if there is *also* an explicit LSM verdict on it? I guess
> that's still too close to an authoritative LSM hook that it'll pass? I
> do think the "explicit grant" model of an authoritative LSM is a better
> fit for this kind of thing...
>

I proposed an authoritative LSM, it was pretty plainly rejected and
the model of "grant a lot + restrict with LSM" was suggested.

> >> I'm also worried that an LSM policy is the only way to disable the
> >> ability to create a token; with this in the kernel, I suddenly have to
> >> trust not only that all applications with BPF privileges will not load
> >> malicious code, but also that they won't (accidentally or maliciously)
> >> conveys extra privileges on someone else. Seems a bit broad to have this
> >> ability (to issue tokens) available to everyone with access to the bpf()
> >> syscall, when (IIUC) it's only a single daemon in the system that would
> >> legitimately do this in the deployment you're envisioning.
> >
> > Note, any process with real CAP_SYS_ADMIN. Let's not forget that.
> >
> > But would you feel better if BPF_TOKEN_CREATE was guarded behind
> > sysctl or Kconfig?
>
> Hmm, yeah, some way to make sure it's off by default would be
> preferable, IMO.
>
> > Ultimately, worrying is fine, but there are real problems that need to
> > be solved. And not doing anything isn't a great option.
>
> Right, it would be good if some of the security folks could chime in
> with their view of how this is best achieved without running into any of
> the "bad ideas" they are opposed to.

agreed

>
> >> >> If the goal is to enable a privileged application (such as a container
> >> >> manager) to grant another unprivileged application the permission to
> >> >> perform certain bpf() operations, why not just proxy the operations
> >> >> themselves over some RPC mechanism? That way the granting application
> >> >
> >> > It's explicitly what we *do not* want to do, as it is a major problem
> >> > and logistical complication. Every single application will have to be
> >> > rewritten to use such a special daemon/service and its API, which is
> >> > completely different from bpf() syscall API. It invalidates the use of
> >> > all the libbpf (and other bpf libraries') APIs, BPF skeleton is
> >> > incompatible with this. It's a nightmare. I've got feedback from
> >> > people in another company that do have BPF service with just a tiny
> >> > subset of BPF functionality delegated to such service, and it's a pain
> >> > and definitely not a preferred way to do things.
> >>
> >> But weren't you proposing that libbpf should be able to transparently
> >> look for tokens and load them without any application changes? Why can't
> >> libbpf be taught to use an RPC socket in a similar fashion? It basically
> >> boils down to something like:
> >>
> >> static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
> >>                           unsigned int size)
> >> {
> >>         if (!stat("/run/bpf.sock")) {
> >>                 sock = open_socket("/run/bpf.sock");
> >>                 write_to(sock, cmd, attr, size);
> >>                 return read_response(sock);
> >>         } else {
> >>                 return syscall(__NR_bpf, cmd, attr, size);
> >>         }
> >> }
> >>
> >
> > Well, for one, Meta we'll use its own Thrift-based RPC protocol.
> > Google might use something internal for them using GRPC, someone else
> > would want to utilize systemd, yet others will use yet another
> > implementation. RPC introduces more failure modes. While with syscall
> > we know that operation either succeeded or failed, with RPC we'll have
> > to deal with "maybe", if it was some communication error.
> >
> > Let's not trivialize adding, using, and supporting the RPC version of
> > bpf() syscall.
>
> I am not trying to trivialise it, I am well aware that it is more
> complicated in practice than just adding a wrapper like the above. I am
> just arguing with your point that "all applications need to change, so
> we can't do RPC". Any mechanism we add along there lines will require
> application changes, including the BPF token. And if the way we're going

Well, it depends on what kinds of changes we are talking about. E.g.,
in most explicit case, it would be something like:

int token_fd = bpf_token_get("/sys/fs/bpf/my_granted_token");
if (token_fd < 0)
   /* we can bail out or just assume no token */
LIBBPF_OPTS(bpf_object_open_opts, .token_fd = token_fd);

struct my_skel *skel = my_skel__open_opts(&opts);


That's literally it. And if we have some convention that libbpf will
try to open, say, /sys/fs/bpf/.token automatically, there will be zero
code changes. And I'm not simplifying this.


> to avoid that is by baking the support into libbpf, then that can be
> done regardless of the mechanism we choose.
>
> Or to put it another way: as you say it may be more *complicated* to add
> an RPC-based path to libbpf, but it's not fundamentally impossible, it's
> just another technical problem to be solved. And if that added
> complexity buys us better security properties, maybe that is a good
> trade-off. At least we shouldn't dismiss it out of hand.

You are oversimplifying this. There is a huge difference between
syscall and RPC and interfaces.

The former (syscall approach) will error out only on invalid inputs
(and highly improbable if kernel runs out of memory, which means your
app is dead anyways). You don't code against syscall interface with
expectation that it can fail at any point and you should be able to
recover it.

With RPC you have to bake in into your application that any RPC can
fail transiently, for many reasons. Service could be down, restarted,
slow, etc, etc. This changes *everything* in how you develop
application, how you write code, how you handle errors, how you
monitor stuff. Everything.

It's impossible to just swap out syscall with RPC transparently
without introducing horrible consequences. This is not some technical
difficulty, it's a fundamental impedance mismatch. One of the early
distributed systems mistakes was to pretend that remote procedure
calls could be reliable and assume errors are rare and could be
pretended to behave like syscalls or local in-process APIs. It has
been recognized many times over how bad such approaches were. It's
outside of the scope of this discussion to go into more details.
Suffice it to say that libbpf is not going to pretend that syscall and
some RPC are equivalent and can be interchangeable in a transparent
way.

And then, even if we were crazy enough to do the above, there is no
way everyone will settle on one single implementation and/or RPC
protocol and API such that libbpf could implement it in its upstream
version. Big companies most probably will go with their own internal
ones that would give them better integration with internal
infrastructure, better overvability, etc. And even in open-source
there probably won't be one single implementation everyone will be
happy with.

>
> -Toke
Andrii Nakryiko June 12, 2023, 10:27 p.m. UTC | #17
On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > >
> > > Hi Andrii,
> > >
> > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > > ...
> > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > >
> > > Is there a reason for coupling this only with the userns?
> >
> > There is no coupling. Without userns it is at least possible to grant
> > CAP_BPF and other capabilities from init ns. With user namespace that
> > becomes impossible.
>
> But these are not the same: delegate full cap vs delegate an fd mask?

What FD mask are we talking about here? I don't recall us talking
about any FD masks, so this one is a bit confusing without more
context.

>
> One can argue unprivileged in init userns is the same privileged in
> nested userns
> Getting to delegate fd in init userns, then in nested ones seems logical...

Again, sorry, I'm not following. Can you please elaborate what you mean?

>
> > > The "trusted unprivileged" assumed by systemd can be in init userns?
> >
> > It doesn't have to be systemd, but yes, BPF token can be created only
> > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > of commands).
>
> I'm more into getting fd delegation work also in the first init userns...
>
> I can't understand why it's not possible or doable?
>

I don't know what you are proposing, as I mentioned above, so it's
hard to answer this question.

> > >
> > >
> > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > interest of minimizing API surface area discussions this is going to be
> > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > of delegatable BPF token.
> > > >
> > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > allowing multiple independent instances of them, each with its own set of
> > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > >
> > > What's the use case for transfering over unix domain sockets?
> >
> > I'm not sure I understand the question. Unix domain socket
> > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > files between processes, which is one way to pass BPF object (like
> > prog/map/link, and now token). BPF FS is the other one. In practice
> > it's usually BPF FS, but there is no presumption about how file
> > reference is transferred.
>
> Got it.
>
> IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> userns, no ?
>
> I assume such which allows to set up things in a hierarchical way...
>
> If I set up the environment to lock things down the line, I find it
> strange if a received fd would allow me to do more things than what
> was planned when I created the environment: namespaces, mounts, etc
>
> I think you have to add the owning userns context to the fd or
> "token", and on the receiving part if the current userns is the same
> or a nested one of the current userns hierarchy then allow bpf
> operation, otherwise fail with -EACCESS or something similar...
>

I think I mentioned problems with namespacing BPF itself. It's just
fundamentally impossible due to a system-wide nature of BPF. So we can
pretend to somehow attach/restrict BPF token to some namespace, but it
still allows BPF programs to peek at any kernel state or user-space
process.

So I'd rather us not pretend we can do something that we actually
cannot enforce.

>
> > >
> > > Will BPF token translation happen if you cross the different namespaces?
> >
> > What does BPF token translation mean specifically? Currently it's a
> > very simple kernel object with refcnt and a few flags, so there is
> > nothing to translate?
>
> Please see above comment about the owning userns context
>
> > >
> > > If the token is pinned into different bpffs, will the token share the
> > > same context?
> >
> > So I was planning to allow a user process creating a BPF token to
> > specify custom user-provided data (context). This is not in this patch
> > set, but is it what you are asking about?
>
> Exactly, define what you can access inside the container... this would
> align with Andy's suggestion "making BPF behave sensibly in that
> container seems like it should also be necessary." I do agree on this.
>

I don't know what Andy's suggestion actually is (as I honestly can't
make out what your proposal is, sorry; you guys are not making it easy
on me by being pretty vague and nonspecific). But see above about
pretending to contain BPF within a container. There is no such thing.
BPF is system-wide.

> Again I think LSM and bpf+lsm should have the final word on this too...
>

Yes, I also think that having LSM on top is beneficial. But not a
strict requirement and more or less orthogonal.

>
> > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > refcnt and exposes that object in a way that can be looked up through
> > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > Underlying object isn't cloned or copied, it's exactly the same object
> > with the same shared internal state.
>
> This is the part I also find strange, I can understand pinning a bpf
> program, map, etc, but an fd that gives some access rights should be
> part of the filesystem from the start, I don't get the extra pinning.

BPF pinning of BPF token is optional. Everything still works without
any BPF FS mount at all. It's an FD, BPF FS is just one of the means
to pass FD to another process. I actually don't see why coupling BPF
FS and BPF token is simpler.

Now, BPF token is a kernel object, with its own state. It has an FD
associated with it. It can be passed around and provided as an
argument to bpf() syscall. In that sense it's just like BPF
prog/map/link, just another BPF object.

> Also it seems bpffs is per superblock mount so why not allow
> privileged to mount bpffs with the corresponding information, then
> privileged can open the fd, set it up and pass it down the line when
> executing the main program?  or even allow unprivileged to open it on
> bpffs with some restrictive conditions?
>
> Then it would be the business of the privileged to bind mount bpffs in
> some other places, share it, etc

How is this fundamentally different from BPF token pinning by
*privileged* process? Except we are not conflating BPF FS as a way to
pin/get many different BPF objects with BPF token itself. In both
cases it's up to privileged process to set up sharing of BPF token
appropriately.

>
> Having the fd or "token" that gives access rights pinned in two
> separate bpffs mounts seems too much, it crosses namespaces (mount,
> userns etc), environments setup by privileged...

See above, there is nothing namespaceable about BPF itself, and BPF
token as well. If some production setup benefits from pinning one BPF
token in multiple places, I don't see the problem with that.

>
> I would just make it per bpffs mount and that's it, nothing more. If a
> program wants to bind mount it somewhere else then it's not a bpf
> problem.

And if some application wants to pin BPF token, why would that be BPF
subsystem's problem as well?
Andrii Nakryiko June 12, 2023, 11:04 p.m. UTC | #18
On Mon, Jun 12, 2023 at 5:45 AM Dave Tucker <datucker@redhat.com> wrote:
>
>
>
> > On 8 Jun 2023, at 00:53, Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > This patch set introduces new BPF object, BPF token, which allows to delegate
> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> > systemd or any other container manager) to a *trusted* unprivileged
> > application. Trust is the key here. This functionality is not about allowing
> > unconditional unprivileged BPF usage. Establishing trust, though, is
> > completely up to the discretion of respective privileged application that
> > would create a BPF token.
>
>
> Hello! Author of a bpfd[1] here.
>
> > The main motivation for BPF token is a desire to enable containerized
> > BPF applications to be used together with user namespaces. This is currently
> > impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
> > or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
> > helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
> > arbitrary memory, and it's impossible to ensure that they only read memory of
> > processes belonging to any given namespace. This means that it's impossible to
> > have namespace-aware CAP_BPF capability, and as such another mechanism to
> > allow safe usage of BPF functionality is necessary. BPF token and delegation
> > of it to a trusted unprivileged applications is such mechanism. Kernel makes
> > no assumption about what "trusted" constitutes in any particular case, and
> > it's up to specific privileged applications and their surrounding
> > infrastructure to decide that. What kernel provides is a set of APIs to create
> > and tune BPF token, and pass it around to privileged BPF commands that are
> > creating new BPF objects like BPF programs, BPF maps, etc.
>
> You could do that… but the problem is created due to the pattern of having a
> single binary that is responsible for:
>
> - Loading and attaching the BPF program in question
> - Interacting with maps

It is a very desirable property to couple and deploy user process and
its BPF programs/maps together and manage their lifecycle directly.
All of Meta's production applications are using this model. This
allows for a simple and reliable versioning story. This allows using
BPF skeleton and BPF global variables naturally. It makes it simple
and easy to develop, debug, version, deploy, monitor BPF applications.

It also couples BPF program attachment (link) with lifetime of the
user space process. So if it crashes or restarts without clean
detachment, we don't end up with orphaned BPF programs and maps. We've
had pretty bad issues due to such orphaned programs, and that's why
the whole BPF link concept was formalized.

So it's actually a desirable approach in a real-world production setup.

>
> Let’s set aside some of the other fun concerns of eBPF in containers:
>  - Requiring mounting of vmlinux, bpffs, traces etc…
>  - How fs permissions on host translate into permissions in containers
>
> While your proposal lets you grant a subset of CAP_BPF to some other process,
> which I imagine could also be done with SELinux, it doesn’t stop you from needing
> other required permissions for attaching tracing programs in such an
> environment.

In some cases yes, there are other parts of the kernel that would
require some more work to be able to be used. But a lot of things are
possible within bpf() syscall already, including tracing stuff.

>
> For example, say container A wants to attach a uprobe to a process in container B.
> Container A needs to be able to nsenter into container B’s pidns in order for attachment
> to succeed… but then what I can do with CAP_BPF is the least of my concerns since
> I’d wager I’d need to mount `/proc` from the host in container A + have elevated privileges
> much scarier than CAP_BPF in the first place.

You'd wager, or you know for sure? I haven't tried, so I won't make any claims.

I do know, though, that our systemd-wide profiling agent (not running
under user namespace), can attach to and profile namespaced
applications running inside containers without any nsenter.

But again, uprobe'ing some other container is just one of possible use
cases. Even if some scenarios would require more stuff beyond the BPF
token, it doesn't invalidate the need and usefulness of the BPF token.

>
> If you move “Loading and attaching” away to somewhere else (i.e a daemon like bpfd)
> then with recent kernels your container workload should be fine to run entirely unprivileged,
> or worst case with only CAP_BPF since all you need to do is read/write maps.

Except we explicitly want to avoid the need for some external entity
loading BPF programs on my behalf, like I explained in replies to
Toke.

>
> Policy control - which process can request to load programs that monitor which other
> processes - would happen within this system daemon and you wouldn’t need tokens.

And we can do the same through controlling which containers/services
are issued BPF tokens. And in addition to that could employ LSM for
more dynamic and fine-granular control.

Doing this through a centralized daemon is one way of doing this. But
it's not the universally better way to do this.

>
> Since it’s easy enough to do this in userspace, I’d be strongly against adding more
> complexity into BPF to support this usecase.

I appreciate you trying to get more customers for bpfd, there is
nothing wrong with that. But this approach has major (good and bad)
implications and is not the most appropriate solution in a lot of
cases and setups.

As for complexity. If you looked at the code, you saw that it's a
completely optional feature as far as BPF UAPI goes, so your customers
won't need to care about BPF token existence, if they are happy using
bpfd solution.

>
> > Previous attempt at addressing this very same problem ([0]) attempted to
> > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > LSM maintainers. BPF token concept is not changing anything about LSM
> > approach, but can be combined with LSM hooks for very fine-grained security
> > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > (context), which in combination with BPF LSM would allow implementing a very
> > dynamic and fine-granular custom security policies on top of BPF token. In the
> > interest of minimizing API surface area discussions this is going to be
> > added in follow up patches, as it's not essential to the fundamental concept
> > of delegatable BPF token.
> >
> > It should be noted that BPF token is conceptually quite similar to the idea of
> > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > difference is the idea of using virtual anon_inode file to hold BPF token and
> > allowing multiple independent instances of them, each with its own set of
> > restrictions. BPF pinning solves the problem of exposing such BPF token
> > through file system (BPF FS, in this case) for cases where transferring FDs
> > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > syscall accepts token_fd parameters explicitly for each relevant BPF command.
> > This addresses main concerns brought up during the /dev/bpf discussion, and
> > fits better with overall BPF subsystem design.
> >
> > This patch set adds a basic minimum of functionality to make BPF token useful
> > and to discuss API and functionality. Currently only low-level libbpf APIs
> > support passing BPF token around, allowing to test kernel functionality, but
> > for the most part is not sufficient for real-world applications, which
> > typically use high-level libbpf APIs based on `struct bpf_object` type. This
> > was done with the intent to limit the size of patch set and concentrate on
> > mostly kernel-side changes. All the necessary plumbing for libbpf will be sent
> > as a separate follow up patch set kernel support makes it upstream.
> >
> > Another part that should happen once kernel-side BPF token is established, is
> > a set of conventions between applications (e.g., systemd), tools (e.g.,
> > bpftool), and libraries (e.g., libbpf) about sharing BPF tokens through BPF FS
> > at well-defined locations to allow applications take advantage of this in
> > automatic fashion without explicit code changes on BPF application's side.
> > But I'd like to postpone this discussion to after BPF token concept lands.
> >
> >  [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
> >  [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
> >  [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
> >
>
> - Dave
>
> [1]: https://github.com/bpfd-dev/bpfd
>
Hao Luo June 13, 2023, 9:48 p.m. UTC | #19
On Mon, Jun 12, 2023 at 3:08 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >
<...>
> > to avoid that is by baking the support into libbpf, then that can be
> > done regardless of the mechanism we choose.
> >
> > Or to put it another way: as you say it may be more *complicated* to add
> > an RPC-based path to libbpf, but it's not fundamentally impossible, it's
> > just another technical problem to be solved. And if that added
> > complexity buys us better security properties, maybe that is a good
> > trade-off. At least we shouldn't dismiss it out of hand.
>
> You are oversimplifying this. There is a huge difference between
> syscall and RPC and interfaces.
>
> The former (syscall approach) will error out only on invalid inputs
> (and highly improbable if kernel runs out of memory, which means your
> app is dead anyways). You don't code against syscall interface with
> expectation that it can fail at any point and you should be able to
> recover it.
>
> With RPC you have to bake in into your application that any RPC can
> fail transiently, for many reasons. Service could be down, restarted,
> slow, etc, etc. This changes *everything* in how you develop
> application, how you write code, how you handle errors, how you
> monitor stuff. Everything.
>
> It's impossible to just swap out syscall with RPC transparently
> without introducing horrible consequences. This is not some technical
> difficulty, it's a fundamental impedance mismatch. One of the early
> distributed systems mistakes was to pretend that remote procedure
> calls could be reliable and assume errors are rare and could be
> pretended to behave like syscalls or local in-process APIs. It has
> been recognized many times over how bad such approaches were. It's
> outside of the scope of this discussion to go into more details.
> Suffice it to say that libbpf is not going to pretend that syscall and
> some RPC are equivalent and can be interchangeable in a transparent
> way.
>
> And then, even if we were crazy enough to do the above, there is no
> way everyone will settle on one single implementation and/or RPC
> protocol and API such that libbpf could implement it in its upstream
> version. Big companies most probably will go with their own internal
> ones that would give them better integration with internal
> infrastructure, better overvability, etc. And even in open-source
> there probably won't be one single implementation everyone will be
> happy with.
>

Hello Toke and Andrii,

I agree with Andrii here. In Google, we have several years of
experience building and using BPF RPC service. We delegate BPF
operations to this service. From our experience, the RPC approach is
quite limiting and becomes impractical for many BPF use cases.

For programs that do not require much user interaction, it works just
fine. It just loads and attaches the programs, that's all. The problem
is the programs that require much user interaction, for example, the
ones doing observability, which may often read maps or poll on bpf
ringbuf. Overhead and reliability of RPC is one concern. Another
problem is the BPF operations based on mmap, for example, directly
updating/reading BPF global variables as used in skeleton. We still
haven't figured out how to fully support bpf skeleton. We also haven't
figured out how to support BPF ringbuf using RPC. There are also
problems maintaining this service to catch up with some new features
in libbpf.

Anyway, I think the syscall interface has been heavily baked in libbpf
and bpf kernel interfaces today. There are many BPF use cases where
delegating all BPF operations to a service can't work well. IMHO, to
achieve a good balance between flexibility and security, some
abstraction that conveys controlled trust from priv to unpriv is
necessary. The idea of BPF token makes sense to me. With token, libbpf
interface requires only minimal change, unpriv user can call libbpf
and bpf syscall natively, wins on efficiency and less maintenance
burden for libbpf developers.

Thanks,
Hao
Djalal Harouni June 14, 2023, 12:23 a.m. UTC | #20
On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> >
> > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > >
> > > > Hi Andrii,
> > > >
> > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > > ...
> > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > >
> > > > Is there a reason for coupling this only with the userns?
> > >
> > > There is no coupling. Without userns it is at least possible to grant
> > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > becomes impossible.
> >
> > But these are not the same: delegate full cap vs delegate an fd mask?
>
> What FD mask are we talking about here? I don't recall us talking
> about any FD masks, so this one is a bit confusing without more
> context.

Ah err, sorry yes referring to fd token (which I assumed is a mask of
allowed operations or something like that).

So I want the possibility to delegate the fd token in the init userns.

> >
> > One can argue unprivileged in init userns is the same privileged in
> > nested userns
> > Getting to delegate fd in init userns, then in nested ones seems logical...
>
> Again, sorry, I'm not following. Can you please elaborate what you mean?

I mean can we use the fd token in the init user namespace too? not
only in the nested user namespaces but in the first one? Sorry I
didn't check the code.


> >
> > > > The "trusted unprivileged" assumed by systemd can be in init userns?
> > >
> > > It doesn't have to be systemd, but yes, BPF token can be created only
> > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > > of commands).
> >
> > I'm more into getting fd delegation work also in the first init userns...
> >
> > I can't understand why it's not possible or doable?
> >
>
> I don't know what you are proposing, as I mentioned above, so it's
> hard to answer this question.
>


> > > >
> > > >
> > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > interest of minimizing API surface area discussions this is going to be
> > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > of delegatable BPF token.
> > > > >
> > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > allowing multiple independent instances of them, each with its own set of
> > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > >
> > > > What's the use case for transfering over unix domain sockets?
> > >
> > > I'm not sure I understand the question. Unix domain socket
> > > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > > files between processes, which is one way to pass BPF object (like
> > > prog/map/link, and now token). BPF FS is the other one. In practice
> > > it's usually BPF FS, but there is no presumption about how file
> > > reference is transferred.
> >
> > Got it.
> >
> > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> > userns, no ?
> >
> > I assume such which allows to set up things in a hierarchical way...
> >
> > If I set up the environment to lock things down the line, I find it
> > strange if a received fd would allow me to do more things than what
> > was planned when I created the environment: namespaces, mounts, etc
> >
> > I think you have to add the owning userns context to the fd or
> > "token", and on the receiving part if the current userns is the same
> > or a nested one of the current userns hierarchy then allow bpf
> > operation, otherwise fail with -EACCESS or something similar...
> >
>
> I think I mentioned problems with namespacing BPF itself. It's just
> fundamentally impossible due to a system-wide nature of BPF. So we can
> pretend to somehow attach/restrict BPF token to some namespace, but it
> still allows BPF programs to peek at any kernel state or user-space
> process.

I'm not referring to namespacing BPF, but about the same token that
can fly between containers...
More or less problems mentioned by Casey
https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823

I think that a token or the fd should be part of the bpffs and should
not be shared between containers or crosse namespaces by default
without control... hence the suggested protection:
https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd


> So I'd rather us not pretend we can do something that we actually
> cannot enforce.

Actually it is to protect against accidental token sharing or abuse...
so completely different things.


> >
> > > >
> > > > Will BPF token translation happen if you cross the different namespaces?
> > >
> > > What does BPF token translation mean specifically? Currently it's a
> > > very simple kernel object with refcnt and a few flags, so there is
> > > nothing to translate?
> >
> > Please see above comment about the owning userns context
> >
> > > >
> > > > If the token is pinned into different bpffs, will the token share the
> > > > same context?
> > >
> > > So I was planning to allow a user process creating a BPF token to
> > > specify custom user-provided data (context). This is not in this patch
> > > set, but is it what you are asking about?
> >
> > Exactly, define what you can access inside the container... this would
> > align with Andy's suggestion "making BPF behave sensibly in that
> > container seems like it should also be necessary." I do agree on this.
> >
>
> I don't know what Andy's suggestion actually is (as I honestly can't
> make out what your proposal is, sorry; you guys are not making it easy
> on me by being pretty vague and nonspecific). But see above about
> pretending to contain BPF within a container. There is no such thing.
> BPF is system-wide.

Sorry about that, I can quickly put: you may restrict types of bpf
programs, you may disable or nop probes if they are running without a
process context, if the triggered probe is owned by root by specific
uid? if the process is under a specific cgroup hierarchy etc... Are
the above possible?


> > Again I think LSM and bpf+lsm should have the final word on this too...
> >
>
> Yes, I also think that having LSM on top is beneficial. But not a
> strict requirement and more or less orthogonal.

I do think there should be LSM hooks to tighten this, as LSMs have
more context outside of BPF...


> >
> > > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > > refcnt and exposes that object in a way that can be looked up through
> > > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > > Underlying object isn't cloned or copied, it's exactly the same object
> > > with the same shared internal state.
> >
> > This is the part I also find strange, I can understand pinning a bpf
> > program, map, etc, but an fd that gives some access rights should be
> > part of the filesystem from the start, I don't get the extra pinning.
>
> BPF pinning of BPF token is optional. Everything still works without
> any BPF FS mount at all. It's an FD, BPF FS is just one of the means
> to pass FD to another process. I actually don't see why coupling BPF
> FS and BPF token is simpler.

I think it's better the other way around since bpffs is per super
block and separate mount then it is already solved, you just get that
special fd from the fs and pass it...


> Now, BPF token is a kernel object, with its own state. It has an FD
> associated with it. It can be passed around and provided as an
> argument to bpf() syscall. In that sense it's just like BPF
> prog/map/link, just another BPF object.
>
> > Also it seems bpffs is per superblock mount so why not allow
> > privileged to mount bpffs with the corresponding information, then
> > privileged can open the fd, set it up and pass it down the line when
> > executing the main program?  or even allow unprivileged to open it on
> > bpffs with some restrictive conditions?
> >
> > Then it would be the business of the privileged to bind mount bpffs in
> > some other places, share it, etc
>
> How is this fundamentally different from BPF token pinning by
> *privileged* process? Except we are not conflating BPF FS as a way to
> pin/get many different BPF objects with BPF token itself. In both
> cases it's up to privileged process to set up sharing of BPF token
> appropriately.

I'm not convinced about the use case of sharing BPF tokens between
containers or services...

Every container or service has its own separate bpffs, what's the
point of pinning a shared token created by a different container
compared to mounting separate bpffs with an fd token prepared to be
used for that specific container?

Then the container/service can delegate it to child processes, etc...
but sharing between containers and crossing user namespaces, mount
namespaces of such containers where bpffs is already separate in that
context? I don't see the point, and it just opens the room to token
misuse...


> >
> > Having the fd or "token" that gives access rights pinned in two
> > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > userns etc), environments setup by privileged...
>
> See above, there is nothing namespaceable about BPF itself, and BPF
> token as well. If some production setup benefits from pinning one BPF
> token in multiple places, I don't see the problem with that.
>
> >
> > I would just make it per bpffs mount and that's it, nothing more. If a
> > program wants to bind mount it somewhere else then it's not a bpf
> > problem.
>
> And if some application wants to pin BPF token, why would that be BPF
> subsystem's problem as well?

The credentials, capabilities, keyring, different namespaces, etc are
all attached to the owning user namespace, if the BPF subsystem goes
its own way and creates a token to split up CAP_BPF without following
that model, then it's definitely a BPF subsystem problem...  I don't
recommend that.

Feels it's going more of a system-wide approach opening BPF
functionality where ultimately it clashes with the argument: delegate
a subset of BPF functionality to a *trusted* unprivileged application.
My reading of delegation is within a container/service hierarchy
nothing more.
Christian Brauner June 14, 2023, 9:39 a.m. UTC | #21
On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> > >
> > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > > ...
> > > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > >
> > > > > Is there a reason for coupling this only with the userns?
> > > >
> > > > There is no coupling. Without userns it is at least possible to grant
> > > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > > becomes impossible.
> > >
> > > But these are not the same: delegate full cap vs delegate an fd mask?
> >
> > What FD mask are we talking about here? I don't recall us talking
> > about any FD masks, so this one is a bit confusing without more
> > context.
> 
> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> allowed operations or something like that).
> 
> So I want the possibility to delegate the fd token in the init userns.
> 
> > >
> > > One can argue unprivileged in init userns is the same privileged in
> > > nested userns
> > > Getting to delegate fd in init userns, then in nested ones seems logical...
> >
> > Again, sorry, I'm not following. Can you please elaborate what you mean?
> 
> I mean can we use the fd token in the init user namespace too? not
> only in the nested user namespaces but in the first one? Sorry I
> didn't check the code.
> 
> 
> > >
> > > > > The "trusted unprivileged" assumed by systemd can be in init userns?
> > > >
> > > > It doesn't have to be systemd, but yes, BPF token can be created only
> > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > > > of commands).
> > >
> > > I'm more into getting fd delegation work also in the first init userns...
> > >
> > > I can't understand why it's not possible or doable?
> > >
> >
> > I don't know what you are proposing, as I mentioned above, so it's
> > hard to answer this question.
> >
> 
> 
> > > > >
> > > > >
> > > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > > interest of minimizing API surface area discussions this is going to be
> > > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > > of delegatable BPF token.
> > > > > >
> > > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > > allowing multiple independent instances of them, each with its own set of
> > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > >
> > > > > What's the use case for transfering over unix domain sockets?
> > > >
> > > > I'm not sure I understand the question. Unix domain socket
> > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > > > files between processes, which is one way to pass BPF object (like
> > > > prog/map/link, and now token). BPF FS is the other one. In practice
> > > > it's usually BPF FS, but there is no presumption about how file
> > > > reference is transferred.
> > >
> > > Got it.
> > >
> > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> > > userns, no ?
> > >
> > > I assume such which allows to set up things in a hierarchical way...
> > >
> > > If I set up the environment to lock things down the line, I find it
> > > strange if a received fd would allow me to do more things than what
> > > was planned when I created the environment: namespaces, mounts, etc
> > >
> > > I think you have to add the owning userns context to the fd or
> > > "token", and on the receiving part if the current userns is the same
> > > or a nested one of the current userns hierarchy then allow bpf
> > > operation, otherwise fail with -EACCESS or something similar...
> > >
> >
> > I think I mentioned problems with namespacing BPF itself. It's just
> > fundamentally impossible due to a system-wide nature of BPF. So we can
> > pretend to somehow attach/restrict BPF token to some namespace, but it
> > still allows BPF programs to peek at any kernel state or user-space
> > process.
> 
> I'm not referring to namespacing BPF, but about the same token that
> can fly between containers...
> More or less problems mentioned by Casey
> https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823
> 
> I think that a token or the fd should be part of the bpffs and should
> not be shared between containers or crosse namespaces by default
> without control... hence the suggested protection:
> https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd
> 
> 
> > So I'd rather us not pretend we can do something that we actually
> > cannot enforce.
> 
> Actually it is to protect against accidental token sharing or abuse...
> so completely different things.
> 
> 
> > >
> > > > >
> > > > > Will BPF token translation happen if you cross the different namespaces?
> > > >
> > > > What does BPF token translation mean specifically? Currently it's a
> > > > very simple kernel object with refcnt and a few flags, so there is
> > > > nothing to translate?
> > >
> > > Please see above comment about the owning userns context
> > >
> > > > >
> > > > > If the token is pinned into different bpffs, will the token share the
> > > > > same context?
> > > >
> > > > So I was planning to allow a user process creating a BPF token to
> > > > specify custom user-provided data (context). This is not in this patch
> > > > set, but is it what you are asking about?
> > >
> > > Exactly, define what you can access inside the container... this would
> > > align with Andy's suggestion "making BPF behave sensibly in that
> > > container seems like it should also be necessary." I do agree on this.
> > >
> >
> > I don't know what Andy's suggestion actually is (as I honestly can't
> > make out what your proposal is, sorry; you guys are not making it easy
> > on me by being pretty vague and nonspecific). But see above about
> > pretending to contain BPF within a container. There is no such thing.
> > BPF is system-wide.
> 
> Sorry about that, I can quickly put: you may restrict types of bpf
> programs, you may disable or nop probes if they are running without a
> process context, if the triggered probe is owned by root by specific
> uid? if the process is under a specific cgroup hierarchy etc... Are
> the above possible?
> 
> 
> > > Again I think LSM and bpf+lsm should have the final word on this too...
> > >
> >
> > Yes, I also think that having LSM on top is beneficial. But not a
> > strict requirement and more or less orthogonal.
> 
> I do think there should be LSM hooks to tighten this, as LSMs have
> more context outside of BPF...
> 
> 
> > >
> > > > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > > > refcnt and exposes that object in a way that can be looked up through
> > > > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > > > Underlying object isn't cloned or copied, it's exactly the same object
> > > > with the same shared internal state.
> > >
> > > This is the part I also find strange, I can understand pinning a bpf
> > > program, map, etc, but an fd that gives some access rights should be
> > > part of the filesystem from the start, I don't get the extra pinning.
> >
> > BPF pinning of BPF token is optional. Everything still works without
> > any BPF FS mount at all. It's an FD, BPF FS is just one of the means
> > to pass FD to another process. I actually don't see why coupling BPF
> > FS and BPF token is simpler.
> 
> I think it's better the other way around since bpffs is per super
> block and separate mount then it is already solved, you just get that
> special fd from the fs and pass it...
> 
> 
> > Now, BPF token is a kernel object, with its own state. It has an FD
> > associated with it. It can be passed around and provided as an
> > argument to bpf() syscall. In that sense it's just like BPF
> > prog/map/link, just another BPF object.
> >
> > > Also it seems bpffs is per superblock mount so why not allow
> > > privileged to mount bpffs with the corresponding information, then
> > > privileged can open the fd, set it up and pass it down the line when
> > > executing the main program?  or even allow unprivileged to open it on
> > > bpffs with some restrictive conditions?
> > >
> > > Then it would be the business of the privileged to bind mount bpffs in
> > > some other places, share it, etc
> >
> > How is this fundamentally different from BPF token pinning by
> > *privileged* process? Except we are not conflating BPF FS as a way to
> > pin/get many different BPF objects with BPF token itself. In both
> > cases it's up to privileged process to set up sharing of BPF token
> > appropriately.
> 
> I'm not convinced about the use case of sharing BPF tokens between
> containers or services...
> 
> Every container or service has its own separate bpffs, what's the
> point of pinning a shared token created by a different container
> compared to mounting separate bpffs with an fd token prepared to be
> used for that specific container?
> 
> Then the container/service can delegate it to child processes, etc...
> but sharing between containers and crossing user namespaces, mount
> namespaces of such containers where bpffs is already separate in that
> context? I don't see the point, and it just opens the room to token
> misuse...
> 
> 
> > >
> > > Having the fd or "token" that gives access rights pinned in two
> > > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > > userns etc), environments setup by privileged...
> >
> > See above, there is nothing namespaceable about BPF itself, and BPF
> > token as well. If some production setup benefits from pinning one BPF
> > token in multiple places, I don't see the problem with that.
> >
> > >
> > > I would just make it per bpffs mount and that's it, nothing more. If a
> > > program wants to bind mount it somewhere else then it's not a bpf
> > > problem.
> >
> > And if some application wants to pin BPF token, why would that be BPF
> > subsystem's problem as well?
> 
> The credentials, capabilities, keyring, different namespaces, etc are
> all attached to the owning user namespace, if the BPF subsystem goes
> its own way and creates a token to split up CAP_BPF without following
> that model, then it's definitely a BPF subsystem problem...  I don't
> recommend that.
> 
> Feels it's going more of a system-wide approach opening BPF
> functionality where ultimately it clashes with the argument: delegate
> a subset of BPF functionality to a *trusted* unprivileged application.
> My reading of delegation is within a container/service hierarchy
> nothing more.

You're making the exact arguments that Lennart, Aleksa, and I have been
making in the LSFMM presentation about this topic. It's even recorded:

https://youtu.be/4CCRTWEZLpw?t=1546

So we fully agree with you here.
Toke Høiland-Jørgensen June 14, 2023, 12:06 p.m. UTC | #22
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >>
>> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> >>
>> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >> >>
>> >> >> Andrii Nakryiko <andrii@kernel.org> writes:
>> >> >>
>> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> >> >> > systemd or any other container manager) to a *trusted* unprivileged
>> >> >> > application. Trust is the key here. This functionality is not about allowing
>> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> >> >> > completely up to the discretion of respective privileged application that
>> >> >> > would create a BPF token.
>> >> >>
>> >> >> I am not convinced that this token-based approach is a good way to solve
>> >> >> this: having the delegation mechanism be one where you can basically
>> >> >> only grant a perpetual delegation with no way to retract it, no way to
>> >> >> check what exactly it's being used for, and that is transitive (can be
>> >> >> passed on to others with no restrictions) seems like a recipe for
>> >> >> disaster. I believe this was basically the point Casey was making as
>> >> >> well in response to v1.
>> >> >
>> >> > Most of this can be added, if we really need to. Ability to revoke BPF
>> >> > token is easy to implement (though of course it will apply only for
>> >> > subsequent operations). We can allocate ID for BPF token just like we
>> >> > do for BPF prog/map/link and let tools iterate and fetch information
>> >> > about it. As for controlling who's passing what and where, I don't
>> >> > think the situation is different for any other FD-based mechanism. You
>> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
>> >> > or BPF FS, and that application can keep doing the same to other
>> >> > processes.
>> >>
>> >> No, but every other fd-based mechanism is limited in scope. E.g., if you
>> >> pass a map fd that's one specific map that can be passed around, with a
>> >> token it's all operations (of a specific type) which is way broader.
>> >
>> > It's not black and white. Once you have a BPF program FD, you can
>> > attach it many times, for example, and cause regressions. Sure, here
>> > we are talking about creating multiple BPF maps or loading multiple
>> > BPF programs, so it's wider in scope, but still, it's not that
>> > fundamentally different.
>>
>> Right, but the difference is that a single BPF program is a known
>> entity, so even if the application you pass the fd to can attach it
>> multiple times, it can't make it do new things (e.g., bpf_probe_read()
>> stuff it is not supposed to). Whereas with bpf_token you have no such
>> guarantee.
>
> Sure, I'm not claiming BPF token is just like passing BPF program FD
> around. My point is that anything in the kernel that is representable
> by FD can be passed around to an unintended process through
> SCM_RIGHTS. And if you want to have tighter control over who's passing
> what, you'd probably need LSM. But it's not a requirement.
>
> With BPF token it is important to trust the application you are
> passing BPF token to. This is not a mechanism to just freely pass
> around the ability to do BPF. You do it only to applications you
> control.

Trust is not binary, though. "Do I trust this application to perform
this specific action" is different from "do I trust this application to
perform any action in the future". A security mechanism should grant the
minimum required privileges required to perform the operation; this
token thing encourages (defaults to) broader grants, which is worrysome.

> With user namespaces, if we could grant CAP_BPF and co to use BPF,
> we'd do that. But we can't. BPF token at least gives us this
> opportunity.

If the use case is to punch holes in the user namespace isolation I feel
like that is better solved at the user namespace level than the BPF
subsystem level...

-Toke


(Ran out of time and I'm about to leave for PTO, so dropping the RPC
discussion for now)
Andrii Nakryiko June 15, 2023, 10:47 p.m. UTC | #23
On Tue, Jun 13, 2023 at 5:23 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>
> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> > >
> > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > > ...
> > > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > >
> > > > > Is there a reason for coupling this only with the userns?
> > > >
> > > > There is no coupling. Without userns it is at least possible to grant
> > > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > > becomes impossible.
> > >
> > > But these are not the same: delegate full cap vs delegate an fd mask?
> >
> > What FD mask are we talking about here? I don't recall us talking
> > about any FD masks, so this one is a bit confusing without more
> > context.
>
> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> allowed operations or something like that).

Ok, so your "FD masks" aka "fd token" is actually a BPF token as
referenced to in this patch set, right? Thanks for clarifying!

>
> So I want the possibility to delegate the fd token in the init userns.
>

So as it is right now, BPF token has no association with userns, so
yes, you can delegate it in init userns. It's just a kernel object
with its own FD, which you pass to bpf() syscall operations.

> > >
> > > One can argue unprivileged in init userns is the same privileged in
> > > nested userns
> > > Getting to delegate fd in init userns, then in nested ones seems logical...
> >
> > Again, sorry, I'm not following. Can you please elaborate what you mean?
>
> I mean can we use the fd token in the init user namespace too? not
> only in the nested user namespaces but in the first one? Sorry I
> didn't check the code.

Yes, absolutely.

>
>
> > >
> > > > > The "trusted unprivileged" assumed by systemd can be in init userns?
> > > >
> > > > It doesn't have to be systemd, but yes, BPF token can be created only
> > > > when you have CAP_SYS_ADMIN in init ns. It's in line with restrictions
> > > > on a bunch of other bpf() syscall commands (like GET_FD_BY_ID family
> > > > of commands).
> > >
> > > I'm more into getting fd delegation work also in the first init userns...
> > >
> > > I can't understand why it's not possible or doable?
> > >
> >
> > I don't know what you are proposing, as I mentioned above, so it's
> > hard to answer this question.
> >
>
>
> > > > >
> > > > >
> > > > > > Previous attempt at addressing this very same problem ([0]) attempted to
> > > > > > utilize authoritative LSM approach, but was conclusively rejected by upstream
> > > > > > LSM maintainers. BPF token concept is not changing anything about LSM
> > > > > > approach, but can be combined with LSM hooks for very fine-grained security
> > > > > > policy. Some ideas about making BPF token more convenient to use with LSM (in
> > > > > > particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
> > > > > > 2023 presentation ([1]). E.g., an ability to specify user-provided data
> > > > > > (context), which in combination with BPF LSM would allow implementing a very
> > > > > > dynamic and fine-granular custom security policies on top of BPF token. In the
> > > > > > interest of minimizing API surface area discussions this is going to be
> > > > > > added in follow up patches, as it's not essential to the fundamental concept
> > > > > > of delegatable BPF token.
> > > > > >
> > > > > > It should be noted that BPF token is conceptually quite similar to the idea of
> > > > > > /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
> > > > > > difference is the idea of using virtual anon_inode file to hold BPF token and
> > > > > > allowing multiple independent instances of them, each with its own set of
> > > > > > restrictions. BPF pinning solves the problem of exposing such BPF token
> > > > > > through file system (BPF FS, in this case) for cases where transferring FDs
> > > > > > over Unix domain sockets is not convenient. And also, crucially, BPF token
> > > > > > approach is not using any special stateful task-scoped flags. Instead, bpf()
> > > > >
> > > > > What's the use case for transfering over unix domain sockets?
> > > >
> > > > I'm not sure I understand the question. Unix domain socket
> > > > (specifically its SCM_RIGHTS ancillary message) allows to transfer
> > > > files between processes, which is one way to pass BPF object (like
> > > > prog/map/link, and now token). BPF FS is the other one. In practice
> > > > it's usually BPF FS, but there is no presumption about how file
> > > > reference is transferred.
> > >
> > > Got it.
> > >
> > > IIRC SCM_RIGHTS and SCM_CREDENTIALS are translated into the receiving
> > > userns, no ?
> > >
> > > I assume such which allows to set up things in a hierarchical way...
> > >
> > > If I set up the environment to lock things down the line, I find it
> > > strange if a received fd would allow me to do more things than what
> > > was planned when I created the environment: namespaces, mounts, etc
> > >
> > > I think you have to add the owning userns context to the fd or
> > > "token", and on the receiving part if the current userns is the same
> > > or a nested one of the current userns hierarchy then allow bpf
> > > operation, otherwise fail with -EACCESS or something similar...
> > >
> >
> > I think I mentioned problems with namespacing BPF itself. It's just
> > fundamentally impossible due to a system-wide nature of BPF. So we can
> > pretend to somehow attach/restrict BPF token to some namespace, but it
> > still allows BPF programs to peek at any kernel state or user-space
> > process.
>
> I'm not referring to namespacing BPF, but about the same token that
> can fly between containers...
> More or less problems mentioned by Casey
> https://lore.kernel.org/bpf/20230602150011.1657856-19-andrii@kernel.org/T/#m005dfd937e4fff7a8cc35036f0ce38281f01e823
>
> I think that a token or the fd should be part of the bpffs and should
> not be shared between containers or crosse namespaces by default
> without control... hence the suggested protection:
> https://lore.kernel.org/bpf/CAEf4BzazbMqAh_Nj_geKNLshxT+4NXOCd-LkZ+sRKsbZAJ1tUw@mail.gmail.com/T/#m217d041d9ef9e02b598d5f0e1ff61043aeae57fd
>

Ok, cool, thanks for clarifying! I think we are getting somewhere in
this discussion. It seems like you are not worried about the BPF token
concept per se, rather that it's not bound to namespace and thus can
be "leaked" outside of the intended container. Got it. This makes it
more concrete to talk about, but I'll reply in the email to Christian,
to keep my reply in one place.

>
> > So I'd rather us not pretend we can do something that we actually
> > cannot enforce.
>
> Actually it is to protect against accidental token sharing or abuse...
> so completely different things.
>

Ok, got it. I was worried that there is a perception that BPF token
allows to sandbox BPF application somehow (which is not the case), so
wanted to make sure we are not conflating things. With your latest
reply it's clear that the problem that most of the discussion is
revolving around is containing BPF token *sharing* within the
container.


>
> > >
> > > > >
> > > > > Will BPF token translation happen if you cross the different namespaces?
> > > >
> > > > What does BPF token translation mean specifically? Currently it's a
> > > > very simple kernel object with refcnt and a few flags, so there is
> > > > nothing to translate?
> > >
> > > Please see above comment about the owning userns context
> > >
> > > > >
> > > > > If the token is pinned into different bpffs, will the token share the
> > > > > same context?
> > > >
> > > > So I was planning to allow a user process creating a BPF token to
> > > > specify custom user-provided data (context). This is not in this patch
> > > > set, but is it what you are asking about?
> > >
> > > Exactly, define what you can access inside the container... this would
> > > align with Andy's suggestion "making BPF behave sensibly in that
> > > container seems like it should also be necessary." I do agree on this.
> > >
> >
> > I don't know what Andy's suggestion actually is (as I honestly can't
> > make out what your proposal is, sorry; you guys are not making it easy
> > on me by being pretty vague and nonspecific). But see above about
> > pretending to contain BPF within a container. There is no such thing.
> > BPF is system-wide.
>
> Sorry about that, I can quickly put: you may restrict types of bpf
> programs, you may disable or nop probes if they are running without a
> process context, if the triggered probe is owned by root by specific
> uid? if the process is under a specific cgroup hierarchy etc... Are
> the above possible?

Yes, about restricting BPF program types. Definitely "No" for "probes
if they are running without a process context, if the triggered probe
is owned by root by specific uid". "Maybe" for "under a specific
cgroup hierarchy", which we could add in some form, but we can only
control where BPF program is attached. Nothing will still prevent BPF
program from reading random kernel memory. But at least such BPF
programs won't be able to control, say, network traffic of unintended
cgroups. But the last part is not implemented in this patch set and
should be discussed separately.

>
>
> > > Again I think LSM and bpf+lsm should have the final word on this too...
> > >
> >
> > Yes, I also think that having LSM on top is beneficial. But not a
> > strict requirement and more or less orthogonal.
>
> I do think there should be LSM hooks to tighten this, as LSMs have
> more context outside of BPF...

Agreed, but it should be added on top as a separate follow up patch set.

>
>
> > >
> > > > Regardless, pinning BPF object in BPF FS is just basically bumping a
> > > > refcnt and exposes that object in a way that can be looked up through
> > > > file system path (using bpf() syscall's BPF_OBJ_GET command).
> > > > Underlying object isn't cloned or copied, it's exactly the same object
> > > > with the same shared internal state.
> > >
> > > This is the part I also find strange, I can understand pinning a bpf
> > > program, map, etc, but an fd that gives some access rights should be
> > > part of the filesystem from the start, I don't get the extra pinning.
> >
> > BPF pinning of BPF token is optional. Everything still works without
> > any BPF FS mount at all. It's an FD, BPF FS is just one of the means
> > to pass FD to another process. I actually don't see why coupling BPF
> > FS and BPF token is simpler.
>
> I think it's better the other way around since bpffs is per super
> block and separate mount then it is already solved, you just get that
> special fd from the fs and pass it...
>

Ok, I see your point, I have a slightly alternative proposal for some
parts of it, but I'll explain in reply to Christian.

>
> > Now, BPF token is a kernel object, with its own state. It has an FD
> > associated with it. It can be passed around and provided as an
> > argument to bpf() syscall. In that sense it's just like BPF
> > prog/map/link, just another BPF object.
> >
> > > Also it seems bpffs is per superblock mount so why not allow
> > > privileged to mount bpffs with the corresponding information, then
> > > privileged can open the fd, set it up and pass it down the line when
> > > executing the main program?  or even allow unprivileged to open it on
> > > bpffs with some restrictive conditions?
> > >
> > > Then it would be the business of the privileged to bind mount bpffs in
> > > some other places, share it, etc
> >
> > How is this fundamentally different from BPF token pinning by
> > *privileged* process? Except we are not conflating BPF FS as a way to
> > pin/get many different BPF objects with BPF token itself. In both
> > cases it's up to privileged process to set up sharing of BPF token
> > appropriately.
>
> I'm not convinced about the use case of sharing BPF tokens between
> containers or services...
>
> Every container or service has its own separate bpffs, what's the
> point of pinning a shared token created by a different container
> compared to mounting separate bpffs with an fd token prepared to be
> used for that specific container?
>
> Then the container/service can delegate it to child processes, etc...
> but sharing between containers and crossing user namespaces, mount
> namespaces of such containers where bpffs is already separate in that
> context? I don't see the point, and it just opens the room to token
> misuse...
>

I don't have a specific use case or need for this. It's more of a
principle that API should not be assuming or dictating how exactly
user-space is going to use it, so I'd say we should prevent whatever
crazy scenario that doesn't violate common sense.

But I get that lots of people are concerned about BPF token leaking
into unintended neighboring containers, so maybe we should bake in a
mechanism to make this impossible. Again, let's talk in the next
reply.

>
> > >
> > > Having the fd or "token" that gives access rights pinned in two
> > > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > > userns etc), environments setup by privileged...
> >
> > See above, there is nothing namespaceable about BPF itself, and BPF
> > token as well. If some production setup benefits from pinning one BPF
> > token in multiple places, I don't see the problem with that.
> >
> > >
> > > I would just make it per bpffs mount and that's it, nothing more. If a
> > > program wants to bind mount it somewhere else then it's not a bpf
> > > problem.
> >
> > And if some application wants to pin BPF token, why would that be BPF
> > subsystem's problem as well?
>
> The credentials, capabilities, keyring, different namespaces, etc are
> all attached to the owning user namespace, if the BPF subsystem goes
> its own way and creates a token to split up CAP_BPF without following
> that model, then it's definitely a BPF subsystem problem...  I don't
> recommend that.
>
> Feels it's going more of a system-wide approach opening BPF
> functionality where ultimately it clashes with the argument: delegate
> a subset of BPF functionality to a *trusted* unprivileged application.
> My reading of delegation is within a container/service hierarchy
> nothing more.
Andrii Nakryiko June 15, 2023, 10:48 p.m. UTC | #24
On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> > On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > >
> > > > On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> > > > > >
> > > > > > Hi Andrii,
> > > > > >
> > > > > > On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > > >
> > > > > > > ...
> > > > > > > creating new BPF objects like BPF programs, BPF maps, etc.
> > > > > >
> > > > > > Is there a reason for coupling this only with the userns?
> > > > >
> > > > > There is no coupling. Without userns it is at least possible to grant
> > > > > CAP_BPF and other capabilities from init ns. With user namespace that
> > > > > becomes impossible.
> > > >
> > > > But these are not the same: delegate full cap vs delegate an fd mask?
> > >
> > > What FD mask are we talking about here? I don't recall us talking
> > > about any FD masks, so this one is a bit confusing without more
> > > context.
> >
> > Ah err, sorry yes referring to fd token (which I assumed is a mask of
> > allowed operations or something like that).
> >
> > So I want the possibility to delegate the fd token in the init userns.
> >
> > > >
> > > > One can argue unprivileged in init userns is the same privileged in
> > > > nested userns
> > > > Getting to delegate fd in init userns, then in nested ones seems logical...
> > >
> > > Again, sorry, I'm not following. Can you please elaborate what you mean?
> >
> > I mean can we use the fd token in the init user namespace too? not
> > only in the nested user namespaces but in the first one? Sorry I
> > didn't check the code.
> >

[...]

> >
> > > >
> > > > Having the fd or "token" that gives access rights pinned in two
> > > > separate bpffs mounts seems too much, it crosses namespaces (mount,
> > > > userns etc), environments setup by privileged...
> > >
> > > See above, there is nothing namespaceable about BPF itself, and BPF
> > > token as well. If some production setup benefits from pinning one BPF
> > > token in multiple places, I don't see the problem with that.
> > >
> > > >
> > > > I would just make it per bpffs mount and that's it, nothing more. If a
> > > > program wants to bind mount it somewhere else then it's not a bpf
> > > > problem.
> > >
> > > And if some application wants to pin BPF token, why would that be BPF
> > > subsystem's problem as well?
> >
> > The credentials, capabilities, keyring, different namespaces, etc are
> > all attached to the owning user namespace, if the BPF subsystem goes
> > its own way and creates a token to split up CAP_BPF without following
> > that model, then it's definitely a BPF subsystem problem...  I don't
> > recommend that.
> >
> > Feels it's going more of a system-wide approach opening BPF
> > functionality where ultimately it clashes with the argument: delegate
> > a subset of BPF functionality to a *trusted* unprivileged application.
> > My reading of delegation is within a container/service hierarchy
> > nothing more.
>
> You're making the exact arguments that Lennart, Aleksa, and I have been
> making in the LSFMM presentation about this topic. It's even recorded:

Alright, so (I think) I get a pretty good feel now for what the main
concerns are, and why people are trying to push this to be an FS. And
it's not so much that BPF token grants bpf() syscall usage to unpriv
(but trusted) workloads or that BPF itself is not namespaceable. The
main worry is that BPF token, once issues, could be
illegally/uncontrollably passed outside of container, intentionally or
not. And by having this association with mount namespace (through BPF
FS) we automatically limit the sharing to only contain that has access
to that BPF FS.

So I agree that it makes sense to have this mount namespace
association, but I also would like to keep BPF token to be a separate
entity from BPF FS itself, and have the ability to have multiple
different BPF tokens exposed in a single BPF FS instance. I think the
latter is important.

So how about this slight modification: when a BPF token is created
using BPF_TOKEN_CREATE command, the user has to provide an FD for
"associated" BPF FS instance (superblock). What that does is allows
BPF token to be created with BPF FS and/or mount namespace association
set in stone. After that BPF token can only be pinned in that BPF FS
instance and cannot leave the boundaries of that mount namespace
(specific details to be worked out, this is new area for me, so I'm
sorry if I'm missing nuances).

What this slight tweak gives us is that we can still have multiple BPF
token instances within a single BPF FS. It is still pinnable/gettable
through common bpf() syscall's BPF_OBJ_PIN/BPF_OBJ_GET commands. You
still can have more nuances file permission and getting BPF token can
be controlled further through LSM. Also we still get to use an
extensible and familiar (to BPF users) bpf_attr binary approach.
Basically, it is very much native to BPF subsystem, but it is mount
namespace-bound like was requested by proponents of merging BPF token
and BPF FS together.

I assume that this BPF FS fd can be fetched using fsopen() or fspick()
syscalls, is that right?

WDYT? Does that sound like it would address all the above concerns?
Please point to any important details I might be missing (as I
mentioned, very unfamiliar territory).

>
> https://youtu.be/4CCRTWEZLpw?t=1546
>
> So we fully agree with you here.

I actually just rewatched that entire discussion. :) And after talking
about BPF token at length in the halls of the conference and email
discussions on this patch set, it was very useful to relisten (again)
all the finer points that were made back then. Thanks for the
remainder and the link.
Andrii Nakryiko June 15, 2023, 10:55 p.m. UTC | #25
On Wed, Jun 14, 2023 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Jun 12, 2023 at 3:49 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Fri, Jun 9, 2023 at 2:21 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >>
> >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jun 9, 2023 at 4:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >> >>
> >> >> >> Andrii Nakryiko <andrii@kernel.org> writes:
> >> >> >>
> >> >> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> >> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> >> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> >> >> > application. Trust is the key here. This functionality is not about allowing
> >> >> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> >> >> > completely up to the discretion of respective privileged application that
> >> >> >> > would create a BPF token.
> >> >> >>
> >> >> >> I am not convinced that this token-based approach is a good way to solve
> >> >> >> this: having the delegation mechanism be one where you can basically
> >> >> >> only grant a perpetual delegation with no way to retract it, no way to
> >> >> >> check what exactly it's being used for, and that is transitive (can be
> >> >> >> passed on to others with no restrictions) seems like a recipe for
> >> >> >> disaster. I believe this was basically the point Casey was making as
> >> >> >> well in response to v1.
> >> >> >
> >> >> > Most of this can be added, if we really need to. Ability to revoke BPF
> >> >> > token is easy to implement (though of course it will apply only for
> >> >> > subsequent operations). We can allocate ID for BPF token just like we
> >> >> > do for BPF prog/map/link and let tools iterate and fetch information
> >> >> > about it. As for controlling who's passing what and where, I don't
> >> >> > think the situation is different for any other FD-based mechanism. You
> >> >> > might as well create a BPF map/prog/link, pass it through SCM_RIGHTS
> >> >> > or BPF FS, and that application can keep doing the same to other
> >> >> > processes.
> >> >>
> >> >> No, but every other fd-based mechanism is limited in scope. E.g., if you
> >> >> pass a map fd that's one specific map that can be passed around, with a
> >> >> token it's all operations (of a specific type) which is way broader.
> >> >
> >> > It's not black and white. Once you have a BPF program FD, you can
> >> > attach it many times, for example, and cause regressions. Sure, here
> >> > we are talking about creating multiple BPF maps or loading multiple
> >> > BPF programs, so it's wider in scope, but still, it's not that
> >> > fundamentally different.
> >>
> >> Right, but the difference is that a single BPF program is a known
> >> entity, so even if the application you pass the fd to can attach it
> >> multiple times, it can't make it do new things (e.g., bpf_probe_read()
> >> stuff it is not supposed to). Whereas with bpf_token you have no such
> >> guarantee.
> >
> > Sure, I'm not claiming BPF token is just like passing BPF program FD
> > around. My point is that anything in the kernel that is representable
> > by FD can be passed around to an unintended process through
> > SCM_RIGHTS. And if you want to have tighter control over who's passing
> > what, you'd probably need LSM. But it's not a requirement.
> >
> > With BPF token it is important to trust the application you are
> > passing BPF token to. This is not a mechanism to just freely pass
> > around the ability to do BPF. You do it only to applications you
> > control.
>
> Trust is not binary, though. "Do I trust this application to perform
> this specific action" is different from "do I trust this application to
> perform any action in the future". A security mechanism should grant the
> minimum required privileges required to perform the operation; this
> token thing encourages (defaults to) broader grants, which is worrysome.

BPF token defaults to not allowing anything, unless you explicitly
allow commands/progs/maps. If you don't set allow_cmds, you literally
get a useless BPF token that grants you nothing.

>
> > With user namespaces, if we could grant CAP_BPF and co to use BPF,
> > we'd do that. But we can't. BPF token at least gives us this
> > opportunity.
>
> If the use case is to punch holes in the user namespace isolation I feel
> like that is better solved at the user namespace level than the BPF
> subsystem level...
>
> -Toke
>
>
> (Ran out of time and I'm about to leave for PTO, so dropping the RPC
> discussion for now)
>
Andy Lutomirski June 19, 2023, 5:40 p.m. UTC | #26
On Fri, Jun 9, 2023, at 12:08 PM, Andrii Nakryiko wrote:
> On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
>> > This patch set introduces new BPF object, BPF token, which allows to delegate
>> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
>> > systemd or any other container manager) to a *trusted* unprivileged
>> > application. Trust is the key here. This functionality is not about allowing
>> > unconditional unprivileged BPF usage. Establishing trust, though, is
>> > completely up to the discretion of respective privileged application that
>> > would create a BPF token.
>> >
>>
>> I skimmed the description and the LSFMM slides.
>>
>> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.
>>
>> Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?
>
> Yes?.. In the sense that it is possible to create BPF programs and BPF
> maps from inside the container (with BPF token). Right now under user
> namespace it's impossible no matter what you do.

I have no problem with creating BPF maps inside a container, but I think the maps should *be in the container*.

My series wasn’t about unprivileged BPF per se.  It was about updating the existing BPF permission model so that it made sense in a context in which it had multiple users that didn’t trust each other.

>
>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>
> BPF is still a privileged thing. You can't just say that any
> unprivileged application should be able to use BPF. That's why BPF
> token is about trusting unpriv application in a controlled environment
> (production) to not do something crazy. It can be enforced further
> through LSM usage, but in a lot of cases, when dealing with internal
> production applications it's enough to have a proper application
> design and rely on code review process to avoid any negative effects.

We really shouldn’t be creating new kinds of privileged containers that do uncontained things.

If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.

>
> So privileged daemon (container manager) will be configured with the
> knowledge of which services/containers are allowed to use BPF, and
> will grant BPF token only to those that were explicitly allowlisted.
Andrii Nakryiko June 21, 2023, 11:48 p.m. UTC | #27
On Mon, Jun 19, 2023 at 10:40 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Fri, Jun 9, 2023, at 12:08 PM, Andrii Nakryiko wrote:
> > On Fri, Jun 9, 2023 at 11:32 AM Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >> On Wed, Jun 7, 2023, at 4:53 PM, Andrii Nakryiko wrote:
> >> > This patch set introduces new BPF object, BPF token, which allows to delegate
> >> > a subset of BPF functionality from privileged system-wide daemon (e.g.,
> >> > systemd or any other container manager) to a *trusted* unprivileged
> >> > application. Trust is the key here. This functionality is not about allowing
> >> > unconditional unprivileged BPF usage. Establishing trust, though, is
> >> > completely up to the discretion of respective privileged application that
> >> > would create a BPF token.
> >> >
> >>
> >> I skimmed the description and the LSFMM slides.
> >>
> >> Years ago, I sent out a patch set to start down the path of making the bpf() API make sense when used in less-privileged contexts (regarding access control of BPF objects and such).  It went nowhere.
> >>
> >> Where does BPF token fit in?  Does a kernel with these patches applied actually behave sensibly if you pass a BPF token into a container?
> >
> > Yes?.. In the sense that it is possible to create BPF programs and BPF
> > maps from inside the container (with BPF token). Right now under user
> > namespace it's impossible no matter what you do.
>
> I have no problem with creating BPF maps inside a container, but I think the maps should *be in the container*.
>
> My series wasn’t about unprivileged BPF per se.  It was about updating the existing BPF permission model so that it made sense in a context in which it had multiple users that didn’t trust each other.

I don't think it's possible with BPF, in principle, as I mentioned in
the cover letter. Even if some particular types of programs could be
"contained" in some sense, in general BPF is too global by its nature
(it observes everything in kernel memory, it can influence system-wide
behaviors, etc).

>
> >
> >> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >
> > BPF is still a privileged thing. You can't just say that any
> > unprivileged application should be able to use BPF. That's why BPF
> > token is about trusting unpriv application in a controlled environment
> > (production) to not do something crazy. It can be enforced further
> > through LSM usage, but in a lot of cases, when dealing with internal
> > production applications it's enough to have a proper application
> > design and rely on code review process to avoid any negative effects.
>
> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>
> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.

Please see Hao's reply ([0]) about his and Google's (not so rosy)
experiences with building and using such BPF proxy. We (Meta)
internally didn't go this route at all and strongly prefer not to.
There are lots of downsides and complications to having a BPF proxy.
In the end, this is just shuffling around where the decision about
trusting a given application with BPF access is being made. BPF proxy
adds lots of unnecessary logistical, operational, and development
complexity, but doesn't magically make anything safer.

  [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/

>
> >
> > So privileged daemon (container manager) will be configured with the
> > knowledge of which services/containers are allowed to use BPF, and
> > will grant BPF token only to those that were explicitly allowlisted.
>
Maryam Tahhan June 22, 2023, 8:22 a.m. UTC | #28
On 22/06/2023 00:48, Andrii Nakryiko wrote:
>
>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>>> BPF is still a privileged thing. You can't just say that any
>>> unprivileged application should be able to use BPF. That's why BPF
>>> token is about trusting unpriv application in a controlled environment
>>> (production) to not do something crazy. It can be enforced further
>>> through LSM usage, but in a lot of cases, when dealing with internal
>>> production applications it's enough to have a proper application
>>> design and rely on code review process to avoid any negative effects.
>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>>
>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> Please see Hao's reply ([0]) about his and Google's (not so rosy)
> experiences with building and using such BPF proxy. We (Meta)
> internally didn't go this route at all and strongly prefer not to.
> There are lots of downsides and complications to having a BPF proxy.
> In the end, this is just shuffling around where the decision about
> trusting a given application with BPF access is being made. BPF proxy
> adds lots of unnecessary logistical, operational, and development
> complexity, but doesn't magically make anything safer.
>
>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>
Apologies for being blunt, but  the token approach to me seems to be a 
work around providing the right level/classification for a pod/container 
in order to say you support unprivileged containers using eBPF. I think 
if your container needs to do privileged things it should have and be 
classified with the right permissions (privileges) to do what it needs 
to do.

The  proxy BPF on behalf of the container approach works for containers 
that don't need to do privileged BPF operations.

I have to say that  the `proxy BPF on behalf of the container` meets the 
needs of unprivileged pods and at the same time giving CAP_BPF to the 
applications meets the needs of these PODs that need to do 
privileged/bpf things without any tokens. Ultimately you are trusting 
these apps in the same way as if you were granting a token.


>>> So privileged daemon (container manager) will be configured with the
>>> knowledge of which services/containers are allowed to use BPF, and
>>> will grant BPF token only to those that were explicitly allowlisted.
Andy Lutomirski June 22, 2023, 4:49 p.m. UTC | #29
On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
> On 22/06/2023 00:48, Andrii Nakryiko wrote:
>>
>>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>>>> BPF is still a privileged thing. You can't just say that any
>>>> unprivileged application should be able to use BPF. That's why BPF
>>>> token is about trusting unpriv application in a controlled environment
>>>> (production) to not do something crazy. It can be enforced further
>>>> through LSM usage, but in a lot of cases, when dealing with internal
>>>> production applications it's enough to have a proper application
>>>> design and rely on code review process to avoid any negative effects.
>>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>>>
>>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
>> Please see Hao's reply ([0]) about his and Google's (not so rosy)
>> experiences with building and using such BPF proxy. We (Meta)
>> internally didn't go this route at all and strongly prefer not to.
>> There are lots of downsides and complications to having a BPF proxy.
>> In the end, this is just shuffling around where the decision about
>> trusting a given application with BPF access is being made. BPF proxy
>> adds lots of unnecessary logistical, operational, and development
>> complexity, but doesn't magically make anything safer.
>>
>>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>>
> Apologies for being blunt, but  the token approach to me seems to be a 
> work around providing the right level/classification for a pod/container 
> in order to say you support unprivileged containers using eBPF. I think 
> if your container needs to do privileged things it should have and be 
> classified with the right permissions (privileges) to do what it needs 
> to do.

Bluntness is great.

I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.

"the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"

That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.

"the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"

The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.

This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.

"the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"

My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.  I even *wrote the code*.  But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.

Please try harder.
Andrii Nakryiko June 22, 2023, 6:20 p.m. UTC | #30
On Thu, Jun 22, 2023 at 1:23 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>
> On 22/06/2023 00:48, Andrii Nakryiko wrote:
> >
> >>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >>> BPF is still a privileged thing. You can't just say that any
> >>> unprivileged application should be able to use BPF. That's why BPF
> >>> token is about trusting unpriv application in a controlled environment
> >>> (production) to not do something crazy. It can be enforced further
> >>> through LSM usage, but in a lot of cases, when dealing with internal
> >>> production applications it's enough to have a proper application
> >>> design and rely on code review process to avoid any negative effects.
> >> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
> >>
> >> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> > Please see Hao's reply ([0]) about his and Google's (not so rosy)
> > experiences with building and using such BPF proxy. We (Meta)
> > internally didn't go this route at all and strongly prefer not to.
> > There are lots of downsides and complications to having a BPF proxy.
> > In the end, this is just shuffling around where the decision about
> > trusting a given application with BPF access is being made. BPF proxy
> > adds lots of unnecessary logistical, operational, and development
> > complexity, but doesn't magically make anything safer.
> >
> >    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
> >
> Apologies for being blunt, but  the token approach to me seems to be a
> work around providing the right level/classification for a pod/container
> in order to say you support unprivileged containers using eBPF. I think
> if your container needs to do privileged things it should have and be
> classified with the right permissions (privileges) to do what it needs
> to do.

For one, when user namespaces are involved, there is no BPF use at
all, no matter how privileged you want to mark the container. I
mentioned this in the cover letter. Now, the claim is that user
namespaces are indeed useful and necessary, and yet we also want such
user-namespaced applications to be able to use BPF.

Currently there is no solution to that. And external BPF service is
not a great one, see [0], for real world users' feedback.

  [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/


>
> The  proxy BPF on behalf of the container approach works for containers
> that don't need to do privileged BPF operations.

BPF usage *is privileged* in all but some tiny use cases that are ok
with heavily limited unprivileged BPF functionality (and even then
recommendation is to disable unprivileged BPF altogether). Whether you
proxy such privileged BPF usage through an external application or you
are granting BPF token to such application is in the same category:
someone has to decide to trust the application to perform privileged
BPF operations.

And the only debatable thing here is whether the application itself
should do bpf() syscalls directly and be able to use the entire BPF
ecosystem of libraries, tools, techniques, and approaches. Or we go
and rewrite the world to use some RPC-based proxy to bpf() syscall?

And to put it bluntly, the latter is not a realistic (or even good) option.

>
> I have to say that  the `proxy BPF on behalf of the container` meets the
> needs of unprivileged pods and at the same time giving CAP_BPF to the

I tried to make it very clear in the cover letter, but granting
CAP_BPF under user namespace means precisely nothing. CAP_BPF is only
useful in the init namespace.

> applications meets the needs of these PODs that need to do
> privileged/bpf things without any tokens. Ultimately you are trusting
> these apps in the same way as if you were granting a token.

Yes, absolutely. As I mentioned very explicitly, it's the question of
trusting application. Service vs token is implementation details, but
the one that has huge implications in how applications are built,
tested, versioned, deployed, etc.

>
>
> >>> So privileged daemon (container manager) will be configured with the
> >>> knowledge of which services/containers are allowed to use BPF, and
> >>> will grant BPF token only to those that were explicitly allowlisted.
>
>
Andrii Nakryiko June 22, 2023, 6:40 p.m. UTC | #31
On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>

Please avoid replying in HTML.

> On 22/06/2023 17:49, Andy Lutomirski wrote:
>
> Apologies for being blunt, but  the token approach to me seems to be a
> work around providing the right level/classification for a pod/container
> in order to say you support unprivileged containers using eBPF. I think
> if your container needs to do privileged things it should have and be
> classified with the right permissions (privileges) to do what it needs
> to do.
>
> Bluntness is great.
>
> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>
> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>
> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>
> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>
> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.  I even *wrote the code*.  But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
>
> Please try harder.
>
> I'm going to be honest, I can't tell if we are in agreement or not :). I'm also going to use pod and container interchangeably throughout my response (bear with me)
>
>
> So just to clarify a few things on my end.  When I said "level/classification" I meant privileges --> A container should have the right level of privileges assigned to it for what it's trying to do in the K8s scenario through it's pod spec. To me it seems like BPF token is a way to work around the permissions assigned to a container in K8s for example: with bpf_token I'm marking a pod as unprivileged but then under the hood, through another service I'm giving it a token to do more than what it was specified in it's pod spec. Yeah I have a separate service controlling the tokens but something about it just seems not right (to me). If CAP_BPF is too broad, can we break it down further into something more granular. Something that can be assigned to the container through the pod spec rather than a separate service that seems to be doing things under the hood? This doesn't even start to
solve the problem I know...

Disclaimer: I don't know anything about Kubernetes, so don't expect me
reply with correct terminology or detailed understanding of
configuration of containers.

But on a more generic and conceptual level, it seems like you are
making some implementation assumptions and arguing based on that.

Like, why container spec cannot have native support for "granted BPF
functionality"? Why would BPF token have to be granted through some
separate service and not integrated into whatever Kubernetes'
"container manager" functionality and just be a natural extension of
the spec?

For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
break it down some more -- please propose. But this is all orthogonal,
because the blocking problem is fundamental incompatibility of user
namespaces (and their implied isolation and sandboxing of workloads)
and BPF functionality, which is global by its very nature. The latter
is unavoidable in principle.

No matter how much you break down CAP_BPF, you can't enforce that BPF
program won't interfere with applications in other containers. Or that
it won't "spy" on them. It's just not what BPF can enforce in
principle.

So that comes back down to a question of trust and then controlled
delegation of BPF functionality. You trust workload with BPF usage
because you reviewed the BPF code, workload, testing, etc? Grant BPF
token and let that container use limited subset of BPF. Employ BPF LSM
to further restrict it beyond what BPF token can control.

You cannot trust an application to not do something harmful? You
shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
user namespace. That's it. Pick your poison.

But all this cannot be mechanically decided or enforced. There has to
be some humans involved in making these decisions. Kernel's job is to
provide building blocks to grant and control BPF functionality to the
extent that it is technically possible.


>
> I understand the difficulties with trying to deploy BPF in K8s and the concerns around privilege escalation for the containers. I understand not all use cases are created equally but I think this falls into at least 2 categories:
>
> - Pods/Containers that need to do privileged BPF ops but not under a CAP_BPF umbrella --> sure we need something for this.
> - Pods/Containers that don't need to do any privileged BPF ops but still use BPF --> these are happy with a proxy service loading/unloading the bpf progs, creating maps and pinning them... But even in this scenario we need something to isolate the pinned maps/progs by different apps (why not DAC rules?), even better if the maps are in the container...

The above doesn't make much sense to me, sorry. If the application is
ok using unprivileged BPF, there is no problem there. They can today
already and there is no BPF proxy or BPF token involved.

As for "something to isolate the pinned maps/progs by different apps
(why not DAC rules?)", there is no such thing, as I've explained
already.

I can install sched_switch raw_tracepoint BPF program (if I'm allowed
to), and that program has system-wide observability. It cannot be
bound to an application. You can't just say "trigger this sched_switch
program only for scheduler decisions within my container". When you
actually start thinking about just that one example, even assuming we
add some per-container filter in the kernel to not trigger your
program, then what do we do when we switch from process A in container
X to process B in container Y? Is that event belonging to container X?
Or container Y? How can you prevent a program from reading a task's
data that doesn't belong to your container, when both are inputs to
this single tracepoint event?

Hopefully you can see where I'm going with this. And this is just one
random tiny example. We can think up tons of other cases to prove BPF
is not isolatable to any sort of "container".

>
> Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better.

I disagree. BPF proxy complicates logistics, operations, and developer
experience, without resolving the issue of determining trust and the
need to delegate or proxy BPF functionality.
Andrii Nakryiko June 22, 2023, 7:05 p.m. UTC | #32
On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
> > On 22/06/2023 00:48, Andrii Nakryiko wrote:
> >>
> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >>>> BPF is still a privileged thing. You can't just say that any
> >>>> unprivileged application should be able to use BPF. That's why BPF
> >>>> token is about trusting unpriv application in a controlled environment
> >>>> (production) to not do something crazy. It can be enforced further
> >>>> through LSM usage, but in a lot of cases, when dealing with internal
> >>>> production applications it's enough to have a proper application
> >>>> design and rely on code review process to avoid any negative effects.
> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
> >>>
> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> >> Please see Hao's reply ([0]) about his and Google's (not so rosy)
> >> experiences with building and using such BPF proxy. We (Meta)
> >> internally didn't go this route at all and strongly prefer not to.
> >> There are lots of downsides and complications to having a BPF proxy.
> >> In the end, this is just shuffling around where the decision about
> >> trusting a given application with BPF access is being made. BPF proxy
> >> adds lots of unnecessary logistical, operational, and development
> >> complexity, but doesn't magically make anything safer.
> >>
> >>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
> >>
> > Apologies for being blunt, but  the token approach to me seems to be a
> > work around providing the right level/classification for a pod/container
> > in order to say you support unprivileged containers using eBPF. I think
> > if your container needs to do privileged things it should have and be
> > classified with the right permissions (privileges) to do what it needs
> > to do.
>
> Bluntness is great.
>
> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.

BPF is not "anything else", it's important to understand that BPF is
inherently not compratmentalizable. And it's vast and generic in its
capabilities. This changes everything. So your analogies are
misleading.

>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>
> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>
> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>
> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>
> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>
> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.

Can you apply DAC rules to which kernel events BPF program can be run
on? Can you apply DAC rules to which in-kernel data structures a BPF
program can look at and make sure that it doesn't access a
task/socket/etc that "belongs" to some other container/user/etc?

Can we limit XDP or AF_XDP BPF programs from seeing and controlling
network traffic that will be eventually routed to a container that XDP
program "should not" have access to? Without making everything so slow
that it's useless?

> I even *wrote the code*.

Did you submit it upstream for review and wide discussion? Did you
test it and integrate it with production workloads to prove that your
solution is actually a viable real-world solution and not a toy?
Writing the code doesn't mean solving the problem.

> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.

I won't speak on behalf of the entire BPF community, but I'm trying to
explain that BPF cannot be reasonably sandboxed and has to be
privileged due to its global nature. And I haven't yet seen any
realistic counter-proposal to change that. And it's not about
ownership of the BPF map or BPF program, it's way beyond that..

>
> Please try harder.

Well, maybe there is something in that "some reason" you mentioned
above that you so quickly dismissed?
Maryam Tahhan June 22, 2023, 9:04 p.m. UTC | #33
On Thu, Jun 22, 2023 at 7:40 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
> >
>
> Please avoid replying in HTML.
>

Sorry.

[...]

>
> Disclaimer: I don't know anything about Kubernetes, so don't expect me
> reply with correct terminology or detailed understanding of
> configuration of containers.
>
> But on a more generic and conceptual level, it seems like you are
> making some implementation assumptions and arguing based on that.
>

Firstly, thank you for taking the time to respond and explain. I can see
where you are coming from.

Yeah, admittedly I did make a few assumptions. I was thrown by the reference
to `unprivileged` processes in the cover letter. It seems like this is a way to
grant namespaced BPF permissions to a process (my gross
oversimplification - sorry).
Looking back throughout your responses there's nothing unprivileged here.

[...]


> Hopefully you can see where I'm going with this. And this is just one
> random tiny example. We can think up tons of other cases to prove BPF
> is not isolatable to any sort of "container".
>
> >
> > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better.
>
> I disagree. BPF proxy complicates logistics, operations, and developer
> experience, without resolving the issue of determining trust and the
> need to delegate or proxy BPF functionality.

I appreciate your viewpoint. I just don't think that this is a one
solution fits every
scenario situation. For example in the case of AF_XDP, I'd like to be
able to run
my containers without any additional privileges. I've been working on a device
plugin for Kubernetes whose job is to provision netdevs with an XDP redirect
program (then later there's a CNI that moves the netdev into the pod network
namespace).  Originally I was using bpf locally in the device plugin
(to load the
bpf program and get the XSK map fd) and SCM rights to pass the XSK_MAP over
UDS but honestly it was relatively cumbersome from an app development POV, very
easy to get wrong, and trying to keep up with the latest bpf api
changes started to
become an issue. If I wanted to add more interesting bpf programs I
had to do a full
recompile...

I've now moved to using bpfd, for the loading and unloading of the bpf
program on my behalf,
it also comes with a bunch of other advantages including being able to
update my trusted bpf
program transparently to both the device plugin my application (I
don't have to respin this either
when I write/want to add a new bpf prog), but mainly I have a trusted
proxy managing bpffs, bpf progs and maps for me. There's still more
work to do here...

I understand this is a much simplified scenario. and I'm sure I can
think of several more where
proxy is useful. All I'm trying to say is, I'm not sure there's just a
one size fits all soln for these issues.

Thanks
Maryam
Andrii Nakryiko June 22, 2023, 11:35 p.m. UTC | #34
On Thu, Jun 22, 2023 at 2:04 PM Maryam Tahhan <mtahhan@redhat.com> wrote:
>
> On Thu, Jun 22, 2023 at 7:40 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
> > >
> >
> > Please avoid replying in HTML.
> >
>
> Sorry.

No worries, the problem is that the mailing list filters out such
messages. So if you go to [0] and scroll to the bottom of the page,
you'll see that your email is not in the lore archive. People not
CC'ed directly will only see what you wrote through my reply quoting
your email.

  [0] https://lore.kernel.org/bpf/CAFdtZitYhOK4TzAJVbFPMfup_homxSSu3Q8zjJCCiHCf22eJvQ@mail.gmail.com/#t

>
> [...]
>
> >
> > Disclaimer: I don't know anything about Kubernetes, so don't expect me
> > reply with correct terminology or detailed understanding of
> > configuration of containers.
> >
> > But on a more generic and conceptual level, it seems like you are
> > making some implementation assumptions and arguing based on that.
> >
>
> Firstly, thank you for taking the time to respond and explain. I can see
> where you are coming from.
>
> Yeah, admittedly I did make a few assumptions. I was thrown by the reference
> to `unprivileged` processes in the cover letter. It seems like this is a way to
> grant namespaced BPF permissions to a process (my gross
> oversimplification - sorry).

Yep, with the caveat that BPF functionality itself cannot be
namespaced (i.e., contained within the container), so this has to be
granted by a fully privileged process/proxy based on trusting the
workload to not do anything harmful.


> Looking back throughout your responses there's nothing unprivileged here.
>
> [...]
>
>
> > Hopefully you can see where I'm going with this. And this is just one
> > random tiny example. We can think up tons of other cases to prove BPF
> > is not isolatable to any sort of "container".
> >
> > >
> > > Anyway - I hope this clarifies my original intent - which is proxy at least starts to solve one part of the puzzle. Whatever approach(es) we take to solve the rest of these problems the more we can stick to tried and trusted mechanisms the better.
> >
> > I disagree. BPF proxy complicates logistics, operations, and developer
> > experience, without resolving the issue of determining trust and the
> > need to delegate or proxy BPF functionality.
>
> I appreciate your viewpoint. I just don't think that this is a one
> solution fits every
> scenario situation.

Absolutely. It's also not my intent or goal to kill any sort of BPF
proxy. What I'm trying to convey is that the BPF proxy approach has
severe downsides, depending on application, deployment practices, etc,
etc. It's not always a (good) answer. So I just want to avoid having
the dichotomy of "BPF token or BPF proxy, there could be only one".

> For example in the case of AF_XDP, I'd like to be
> able to run
> my containers without any additional privileges. I've been working on a device
> plugin for Kubernetes whose job is to provision netdevs with an XDP redirect
> program (then later there's a CNI that moves the netdev into the pod network
> namespace).  Originally I was using bpf locally in the device plugin
> (to load the
> bpf program and get the XSK map fd) and SCM rights to pass the XSK_MAP over
> UDS but honestly it was relatively cumbersome from an app development POV, very
> easy to get wrong, and trying to keep up with the latest bpf api
> changes started to
> become an issue. If I wanted to add more interesting bpf programs I
> had to do a full
> recompile...
>
> I've now moved to using bpfd, for the loading and unloading of the bpf
> program on my behalf,
> it also comes with a bunch of other advantages including being able to
> update my trusted bpf
> program transparently to both the device plugin my application (I
> don't have to respin this either
> when I write/want to add a new bpf prog), but mainly I have a trusted
> proxy managing bpffs, bpf progs and maps for me. There's still more
> work to do here...
>

It's a spectrum, and from my observations networking BPF programs lend
themselves more easily to this model of BPF proxy (at least until they
become complicated ensembles of networking and tracing BPF programs).
Very often networking applications can indeed load BPF program
completely independently from user-space parts, keep them "persisted"
in kernel, occasionally control them through pinned BPF maps, etc.

But the further you go towards tracing applications where BPF parts
are integral part of overall user-space application, this model
doesn't work very well. It's much simple to have BPF parts embedded,
loaded, versioned, initialized and interacted with from inside the
same process. And we have lots of such applications. BPF proxy
approach is a massive complication for such use cases with a bunch of
downsides.

> I understand this is a much simplified scenario. and I'm sure I can
> think of several more where
> proxy is useful. All I'm trying to say is, I'm not sure there's just a
> one size fits all soln for these issues.

100% agree. BPF token won't fit all use cases. And BPF proxy won't fit
all use cases either. Both approaches can and should coexist.

>
> Thanks
> Maryam
>
Andy Lutomirski June 23, 2023, 1:02 a.m. UTC | #35
On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
>
> For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
> break it down some more -- please propose. But this is all orthogonal,
> because the blocking problem is fundamental incompatibility of user
> namespaces (and their implied isolation and sandboxing of workloads)
> and BPF functionality, which is global by its very nature. The latter
> is unavoidable in principle.

How, exactly, is BPF global by its very nature?

The *implementation* has some issues with globalness.  Much of it should be fixable.

>
> No matter how much you break down CAP_BPF, you can't enforce that BPF
> program won't interfere with applications in other containers. Or that
> it won't "spy" on them. It's just not what BPF can enforce in
> principle.

The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do.  There are bugs -- I get that.  There are helper functions that are fundamentally global.  But, in the absence of verifier bugs, BPF has actual boundaries to its functionality.

>
> So that comes back down to a question of trust and then controlled
> delegation of BPF functionality. You trust workload with BPF usage
> because you reviewed the BPF code, workload, testing, etc? Grant BPF
> token and let that container use limited subset of BPF. Employ BPF LSM
> to further restrict it beyond what BPF token can control.
>
> You cannot trust an application to not do something harmful? You
> shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
> user namespace. That's it. Pick your poison.

I think what's lost here is hardening vs restricting intended functionality.

We have access control to restrict intended functionality.  We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it.

BPF needs hardening -- this is well established.  Right now, this is accomplished by restricting it to global root (effectively).  It should have access controls, too, but it doesn't.

>
> But all this cannot be mechanically decided or enforced. There has to
> be some humans involved in making these decisions. Kernel's job is to
> provide building blocks to grant and control BPF functionality to the
> extent that it is technically possible.
>

Exactly.  And it DOES NOT.  bpf maps, etc do not have sensible access controls.  Things that should not be global are global.  I'm saying the kernel should fix THAT.  Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it.

> As for "something to isolate the pinned maps/progs by different apps
> (why not DAC rules?)", there is no such thing, as I've explained
> already.
>
> I can install sched_switch raw_tracepoint BPF program (if I'm allowed
> to), and that program has system-wide observability. It cannot be
> bound to an application.

Great, a real example!

Either:

(a) don't run this in a container.  Have a service for the container to request the help of this program.

(b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container).

> then what do we do when we switch from process A in container
> X to process B in container Y? Is that event belonging to container X?
> Or container Y?

I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality.  If you haven't defined what your program is even supposed to do in a container, don't run it there.


> Hopefully you can see where I'm going with this. And this is just one
> random tiny example. We can think up tons of other cases to prove BPF
> is not isolatable to any sort of "container".

No.  You have not come up with an example of why BPF is not isolatable to a container.  You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security.

Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things.  They're both under the BPF umbrella.  They're not the same thing.

Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter.  It makes no sense.
Andy Lutomirski June 23, 2023, 3:28 a.m. UTC | #36
On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote:
> On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
>>
>>
>>
>> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
>> > On 22/06/2023 00:48, Andrii Nakryiko wrote:
>> >>
>> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>> >>>> BPF is still a privileged thing. You can't just say that any
>> >>>> unprivileged application should be able to use BPF. That's why BPF
>> >>>> token is about trusting unpriv application in a controlled environment
>> >>>> (production) to not do something crazy. It can be enforced further
>> >>>> through LSM usage, but in a lot of cases, when dealing with internal
>> >>>> production applications it's enough to have a proper application
>> >>>> design and rely on code review process to avoid any negative effects.
>> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>> >>>
>> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
>> >> Please see Hao's reply ([0]) about his and Google's (not so rosy)
>> >> experiences with building and using such BPF proxy. We (Meta)
>> >> internally didn't go this route at all and strongly prefer not to.
>> >> There are lots of downsides and complications to having a BPF proxy.
>> >> In the end, this is just shuffling around where the decision about
>> >> trusting a given application with BPF access is being made. BPF proxy
>> >> adds lots of unnecessary logistical, operational, and development
>> >> complexity, but doesn't magically make anything safer.
>> >>
>> >>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>> >>
>> > Apologies for being blunt, but  the token approach to me seems to be a
>> > work around providing the right level/classification for a pod/container
>> > in order to say you support unprivileged containers using eBPF. I think
>> > if your container needs to do privileged things it should have and be
>> > classified with the right permissions (privileges) to do what it needs
>> > to do.
>>
>> Bluntness is great.
>>
>> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
>
> BPF is not "anything else", it's important to understand that BPF is
> inherently not compratmentalizable. And it's vast and generic in its
> capabilities. This changes everything. So your analogies are
> misleading.
>

file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc.  They are infinitely extensible.  They work in containers.

What is so special about BPF?

>>
>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>>
>> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>>
>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>>
>> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>>
>> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>>
>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>>
>> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.
>
> Can you apply DAC rules to which kernel events BPF program can be run
> on? Can you apply DAC rules to which in-kernel data structures a BPF
> program can look at and make sure that it doesn't access a
> task/socket/etc that "belongs" to some other container/user/etc?

No, of course.

If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module.  It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module.

We don't give containers special tokens that let them load arbitrary modules.  We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules.

But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers.  BPF can learn to do this.

>
> Can we limit XDP or AF_XDP BPF programs from seeing and controlling
> network traffic that will be eventually routed to a container that XDP
> program "should not" have access to? Without making everything so slow
> that it's useless?

Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that.  Or a vlan or a macvlan or whatever.  (I'm assuming XDP can be scoped like this.  I'm not that familiar with the details.)

>
>> I even *wrote the code*.
>
> Did you submit it upstream for review and wide discussion?

Yes.

> Did you
> test it and integrate it with production workloads to prove that your
> solution is actually a viable real-world solution and not a toy?

I did test it.  I did not integrate it with production workloads.

> Writing the code doesn't mean solving the problem.

Of course not.  My code was a little step in the right direction.  The BPF community was apparently not interested in it. 

>
>> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
>
> I won't speak on behalf of the entire BPF community, but I'm trying to
> explain that BPF cannot be reasonably sandboxed and has to be
> privileged due to its global nature. And I haven't yet seen any
> realistic counter-proposal to change that. And it's not about
> ownership of the BPF map or BPF program, it's way beyond that..
>

It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created.

If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model.

I'm saying that I think there *can* be a security model.  But until the maintainers start to believe that, there won't be one.
Andy Lutomirski June 23, 2023, 3:10 p.m. UTC | #37
On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>
>> Hopefully you can see where I'm going with this. And this is just one
>> random tiny example. We can think up tons of other cases to prove BPF
>> is not isolatable to any sort of "container".
>
> No.  You have not come up with an example of why BPF is not isolatable 
> to a container.  You have come up with an example of why binding to a 
> sched_switch raw tracepoint does not make sense in a container without 
> additional mechanisms to give it well defined functionality and 
> appropriate security.

Thinking about this some more:

Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.

So here are a couple of possible solutions:

(a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.

(b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container.

I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.

For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.

And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.

If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.

--Andy
Casey Schaufler June 23, 2023, 4:13 p.m. UTC | #38
On 6/22/2023 8:28 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote:
>> On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>>
>>> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
>>>> On 22/06/2023 00:48, Andrii Nakryiko wrote:
>>>>>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
>>>>>>> BPF is still a privileged thing. You can't just say that any
>>>>>>> unprivileged application should be able to use BPF. That's why BPF
>>>>>>> token is about trusting unpriv application in a controlled environment
>>>>>>> (production) to not do something crazy. It can be enforced further
>>>>>>> through LSM usage, but in a lot of cases, when dealing with internal
>>>>>>> production applications it's enough to have a proper application
>>>>>>> design and rely on code review process to avoid any negative effects.
>>>>>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
>>>>>>
>>>>>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
>>>>> Please see Hao's reply ([0]) about his and Google's (not so rosy)
>>>>> experiences with building and using such BPF proxy. We (Meta)
>>>>> internally didn't go this route at all and strongly prefer not to.
>>>>> There are lots of downsides and complications to having a BPF proxy.
>>>>> In the end, this is just shuffling around where the decision about
>>>>> trusting a given application with BPF access is being made. BPF proxy
>>>>> adds lots of unnecessary logistical, operational, and development
>>>>> complexity, but doesn't magically make anything safer.
>>>>>
>>>>>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
>>>>>
>>>> Apologies for being blunt, but  the token approach to me seems to be a
>>>> work around providing the right level/classification for a pod/container
>>>> in order to say you support unprivileged containers using eBPF. I think
>>>> if your container needs to do privileged things it should have and be
>>>> classified with the right permissions (privileges) to do what it needs
>>>> to do.
>>> Bluntness is great.
>>>
>>> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
>> BPF is not "anything else", it's important to understand that BPF is
>> inherently not compratmentalizable. And it's vast and generic in its
>> capabilities. This changes everything. So your analogies are
>> misleading.
>>
> file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc.  They are infinitely extensible.  They work in containers.
>
> What is so special about BPF?
>
>>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
>>>
>>> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
>>>
>>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
>>>
>>> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
>>>
>>> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
>>>
>>> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
>>>
>>> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.
>> Can you apply DAC rules to which kernel events BPF program can be run
>> on? Can you apply DAC rules to which in-kernel data structures a BPF
>> program can look at and make sure that it doesn't access a
>> task/socket/etc that "belongs" to some other container/user/etc?
> No, of course.
>
> If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module.  It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module.
>
> We don't give containers special tokens that let them load arbitrary modules.  We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules.
>
> But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers.  BPF can learn to do this.
>
>> Can we limit XDP or AF_XDP BPF programs from seeing and controlling
>> network traffic that will be eventually routed to a container that XDP
>> program "should not" have access to? Without making everything so slow
>> that it's useless?
> Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that.  Or a vlan or a macvlan or whatever.  (I'm assuming XDP can be scoped like this.  I'm not that familiar with the details.)
>
>>> I even *wrote the code*.
>> Did you submit it upstream for review and wide discussion?
> Yes.
>
>> Did you
>> test it and integrate it with production workloads to prove that your
>> solution is actually a viable real-world solution and not a toy?
> I did test it.  I did not integrate it with production workloads.
>
>> Writing the code doesn't mean solving the problem.
> Of course not.  My code was a little step in the right direction.  The BPF community was apparently not interested in it. 
>
>>> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
>> I won't speak on behalf of the entire BPF community, but I'm trying to
>> explain that BPF cannot be reasonably sandboxed and has to be
>> privileged due to its global nature. And I haven't yet seen any
>> realistic counter-proposal to change that. And it's not about
>> ownership of the BPF map or BPF program, it's way beyond that..
>>
> It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created.

Agreed. Complete security denial makes development so much easier.
In the 1980's we were told that there was no way UNIX could ever be
made secure, especially because of IP networking and window systems.
It wasn't easy, what with everybody screaming (often literally) about
the performance impact and code complexity of every single change, no
matter how small.

I'm *not* advocating adopting it, but you could look at the Zephyr
security model as a worked example of a system similar to BPF that
does have a security model. I understand that there are many ways to
argue that this won't work for BPF, or that the model has issues of
its own, but have a look.

https://docs.zephyrproject.org/latest/security/security-overview.html

>
> If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model.
>
> I'm saying that I think there *can* be a security model.  But until the maintainers start to believe that, there won't be one.
Daniel Borkmann June 23, 2023, 10:18 p.m. UTC | #39
On 6/16/23 12:48 AM, Andrii Nakryiko wrote:
> On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote:
>> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
>>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
>>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
>>>>> <andrii.nakryiko@gmail.com> wrote:
>>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Andrii,
>>>>>>>
>>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>>>>>>>>
>>>>>>>> ...
>>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc.
>>>>>>>
>>>>>>> Is there a reason for coupling this only with the userns?
>>>>>>
>>>>>> There is no coupling. Without userns it is at least possible to grant
>>>>>> CAP_BPF and other capabilities from init ns. With user namespace that
>>>>>> becomes impossible.
>>>>>
>>>>> But these are not the same: delegate full cap vs delegate an fd mask?
>>>>
>>>> What FD mask are we talking about here? I don't recall us talking
>>>> about any FD masks, so this one is a bit confusing without more
>>>> context.
>>>
>>> Ah err, sorry yes referring to fd token (which I assumed is a mask of
>>> allowed operations or something like that).
>>>
>>> So I want the possibility to delegate the fd token in the init userns.
>>>
>>>>>
>>>>> One can argue unprivileged in init userns is the same privileged in
>>>>> nested userns
>>>>> Getting to delegate fd in init userns, then in nested ones seems logical...
>>>>
>>>> Again, sorry, I'm not following. Can you please elaborate what you mean?
>>>
>>> I mean can we use the fd token in the init user namespace too? not
>>> only in the nested user namespaces but in the first one? Sorry I
>>> didn't check the code.
>>>
> 
> [...]
> 
>>>
>>>>> Having the fd or "token" that gives access rights pinned in two
>>>>> separate bpffs mounts seems too much, it crosses namespaces (mount,
>>>>> userns etc), environments setup by privileged...
>>>>
>>>> See above, there is nothing namespaceable about BPF itself, and BPF
>>>> token as well. If some production setup benefits from pinning one BPF
>>>> token in multiple places, I don't see the problem with that.
>>>>
>>>>>
>>>>> I would just make it per bpffs mount and that's it, nothing more. If a
>>>>> program wants to bind mount it somewhere else then it's not a bpf
>>>>> problem.
>>>>
>>>> And if some application wants to pin BPF token, why would that be BPF
>>>> subsystem's problem as well?
>>>
>>> The credentials, capabilities, keyring, different namespaces, etc are
>>> all attached to the owning user namespace, if the BPF subsystem goes
>>> its own way and creates a token to split up CAP_BPF without following
>>> that model, then it's definitely a BPF subsystem problem...  I don't
>>> recommend that.
>>>
>>> Feels it's going more of a system-wide approach opening BPF
>>> functionality where ultimately it clashes with the argument: delegate
>>> a subset of BPF functionality to a *trusted* unprivileged application.
>>> My reading of delegation is within a container/service hierarchy
>>> nothing more.
>>
>> You're making the exact arguments that Lennart, Aleksa, and I have been
>> making in the LSFMM presentation about this topic. It's even recorded:
> 
> Alright, so (I think) I get a pretty good feel now for what the main
> concerns are, and why people are trying to push this to be an FS. And
> it's not so much that BPF token grants bpf() syscall usage to unpriv
> (but trusted) workloads or that BPF itself is not namespaceable. The
> main worry is that BPF token, once issues, could be
> illegally/uncontrollably passed outside of container, intentionally or
> not. And by having this association with mount namespace (through BPF
> FS) we automatically limit the sharing to only contain that has access
> to that BPF FS.

+1

> So I agree that it makes sense to have this mount namespace
> association, but I also would like to keep BPF token to be a separate
> entity from BPF FS itself, and have the ability to have multiple
> different BPF tokens exposed in a single BPF FS instance. I think the
> latter is important.
> 
> So how about this slight modification: when a BPF token is created
> using BPF_TOKEN_CREATE command, the user has to provide an FD for
> "associated" BPF FS instance (superblock). What that does is allows
> BPF token to be created with BPF FS and/or mount namespace association
> set in stone. After that BPF token can only be pinned in that BPF FS
> instance and cannot leave the boundaries of that mount namespace
> (specific details to be worked out, this is new area for me, so I'm
> sorry if I'm missing nuances).

Given bpffs is not a singleton and there can be multiple bpffs instances
in a container, couldn't we make the token a special bpffs mount/mode?
Something like single .token file in that mount (for example) which can
be opened and the fd then passed along for prog/map creation? And given
the multiple mounts, this also allows potentially for multiple tokens?
In other words, this is already set up by the container manager when it
sets up mounts rather than later, and the regular bpffs instance is sth
separate from all that. Meaning, in your container you get the usual
bpffs instance and then one or more special bpffs instances as tokens
at different paths (and in future they could unlock different subset of
bpf functionality for example).

Thanks,
Daniel
Toke Høiland-Jørgensen June 23, 2023, 11:07 p.m. UTC | #40
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

>> applications meets the needs of these PODs that need to do
>> privileged/bpf things without any tokens. Ultimately you are trusting
>> these apps in the same way as if you were granting a token.
>
> Yes, absolutely. As I mentioned very explicitly, it's the question of
> trusting application. Service vs token is implementation details, but
> the one that has huge implications in how applications are built,
> tested, versioned, deployed, etc.

So one thing that I don't really get is why such a "trusted application"
needs to be run in a user namespace in the first place? If it's trusted,
why not simply run it as a privileged container (without the user
namespace) and grant it the right system-level capabilities, instead of
going to all this trouble just to punch a hole in the user namespace
isolation?

-Toke
Daniel Borkmann June 23, 2023, 11:23 p.m. UTC | #41
On 6/23/23 5:10 PM, Andy Lutomirski wrote:
> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>>
>>> Hopefully you can see where I'm going with this. And this is just one
>>> random tiny example. We can think up tons of other cases to prove BPF
>>> is not isolatable to any sort of "container".
>>
>> No.  You have not come up with an example of why BPF is not isolatable
>> to a container.  You have come up with an example of why binding to a
>> sched_switch raw tracepoint does not make sense in a container without
>> additional mechanisms to give it well defined functionality and
>> appropriate security.

One big blocker for the case of BPF is not isolatable to a container are
CPU hardware bugs. There has been plenty of mitigation effort so that the
flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
it's a cat and mouse game and vendors are also not really transparent. So
actual reasonable discussion can be resumed once CPU vendors gets their
stuff fixed.

   [0] https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks

> Thinking about this some more:
> 
> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.

Agree that proxy is a mess for various reasons stated earlier.

> So here are a couple of possible solutions:
> 
> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.

I don't think it's very practical, meaning the vast majority of applications
out there today are tightly coupled BPF code + user space application, and in
a lot of cases programs are dynamically created. This would require somehow
splitting up parts of your application to run outside the container in hostns
and other parts inside the container.. for the sake of the mentioned example
it's something fairly static, but real-world applications look different and
are much more complex.

> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can be used to attach this particular program to this particular tracepoint" and pass that into the container.

Same as above. Programs are in most cases very tightly coupled to the application
itself. I'm not sure if the ask is to redesign/implement all the existing user
space infra.

> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
> 
> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.

Worst case, sure, but it's not the point. These containers which would receive
the tokens are part of your trusted compute base.. so its up to the specific
applications and their surrounding infrastructure with regards to what problem
they solve where and approved by operators/platform engs to deploy in your cluster.
I don't particularly see that there's a performance problem. Andrii specifically
mentioned /trusted unprivileged applications/.

> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
> 
> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
> 
> --Andy
>
Andy Lutomirski June 24, 2023, 1:59 p.m. UTC | #42
On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
> On 6/23/23 5:10 PM, Andy Lutomirski wrote:
>> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
>>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
>>>
>>>> Hopefully you can see where I'm going with this. And this is just one
>>>> random tiny example. We can think up tons of other cases to prove BPF
>>>> is not isolatable to any sort of "container".
>>>
>>> No.  You have not come up with an example of why BPF is not isolatable
>>> to a container.  You have come up with an example of why binding to a
>>> sched_switch raw tracepoint does not make sense in a container without
>>> additional mechanisms to give it well defined functionality and
>>> appropriate security.
>
> One big blocker for the case of BPF is not isolatable to a container are
> CPU hardware bugs. There has been plenty of mitigation effort so that the
> flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
> it's a cat and mouse game and vendors are also not really transparent. So
> actual reasonable discussion can be resumed once CPU vendors gets their
> stuff fixed.
>
>    [0] 
> https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks
>

By this standard, shouldn’t we just give up?  Let everyone map /dev/mem readonly and stop pretending we can implement any form of access control.

Of course, we don’t do this. We try pretty hard to squash bugs and keep programs from doing an end run around OS security.

>> Thinking about this some more:
>> 
>> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.
>
> Agree that proxy is a mess for various reasons stated earlier.
>
>> So here are a couple of possible solutions:
>> 
>> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.
>
> I don't think it's very practical, meaning the vast majority of applications
> out there today are tightly coupled BPF code + user space application, and in
> a lot of cases programs are dynamically created. This would require somehow
> splitting up parts of your application to run outside the container in hostns
> and other parts inside the container.. for the sake of the mentioned example
> it's something fairly static, but real-world applications look different and
> are much more complex.
>

It sounds like you are describing a situation where there is a workload in a container, where the *entire container* is part of the TCB, but the part of the workload that has the explicit right to read all of kernel memory (e.g. bpf_probe_read_kernel) is so tightly coupled to the container that no one outside the container wants to audit it.

And yet someone still wants to run it in a userns.
 
This is IMO a rather bizarre situation.

If I were operating a large fleet, and I had teams developing software to run in a container, I would not want to grant those containers this right without strict controls, and I don’t mean on/off controls. I would want strict auditing of *what exact BPF code* (including source) was run, and why, and who wrote it, and what the intended results are, and what limits access to the results, etc.  After all, we’re talking about the right, BY DESIGN, to access PII, payment card information, medical information, information protected by any jurisdiction’s data control rights, etc. Literally everything.  This ability, as described, isn’t “the right to use BPF.”  It is the right to *read all secrets*, intentionally.  (And modify them, with bpf_probe_write_user, possibly subject to some constraints.)


If this series was about passing a “may load kernel modules” token around, I think it would get an extremely chilly reception, even though we have module signatures.  I don’t see anything about BPF that makes BPF tokens more reasonable unless a real security model is developed first.

>> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can beused to attach this particular program to this particular tracepoint" and pass that into the container.
>
> Same as above. Programs are in most cases very tightly coupled to the 
> application
> itself. I'm not sure if the ask is to redesign/implement all the 
> existing user
> space infra.
>
>> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
>> 
>> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.
>
> Worst case, sure, but it's not the point. These containers which would 
> receive
> the tokens are part of your trusted compute base.. so its up to the 
> specific
> applications and their surrounding infrastructure with regards to what 
> problem
> they solve where and approved by operators/platform engs to deploy in 
> your cluster.
> I don't particularly see that there's a performance problem. Andrii 
> specifically
> mentioned /trusted unprivileged applications/.
>
>> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
>> 
>> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
>> 
>> --Andy
>>
Andy Lutomirski June 24, 2023, 3:28 p.m. UTC | #43
On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:

>
> If this series was about passing a “may load kernel modules” token 
> around, I think it would get an extremely chilly reception, even though 
> we have module signatures.  I don’t see anything about BPF that makes 
> BPF tokens more reasonable unless a real security model is developed 
> first.
>

To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.

BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.

(The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)

If new, more restrictive functions are needed, they could be added.


Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.
Daniel Borkmann June 26, 2023, 3:23 p.m. UTC | #44
On 6/24/23 5:28 PM, Andy Lutomirski wrote:
> On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
>> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
>>
>> If this series was about passing a “may load kernel modules” token
>> around, I think it would get an extremely chilly reception, even though
>> we have module signatures.  I don’t see anything about BPF that makes
>> BPF tokens more reasonable unless a real security model is developed
>> first.
> 
> To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.
> 
> BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.
> 
> (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)
> 
> If new, more restrictive functions are needed, they could be added.

Wasn't this the idea of the BPF tokens proposal, meaning you could create them with
restricted access as you mentioned - allowing an explicit subset of program types to
be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in this token
context upon program load-time (resp. map creation), the verifier is then extended
for restricted access. For example, see the bpf_token_allow_{cmd,map_type,prog_type}()
in this series. The user namespace relation was part of the use cases, but not strictly
part of the mechanism itself in this series.

With regards to the scoping, are you saying that the current design with the bitmasks
in the token create uapi is not flexible enough? If yes, what concrete alternative do
you propose?

> Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.

Thinking about this from an k8s environment angle, I think this wouldn't really be
practical for various reasons.. you now need to maintain two implementations for your
container images which ships BPF one which loads programs as today, and another one
which talks to this proxy if available, then you also need to standardize and support
the various loader libraries for this, you need to deal with yet one more component
in your cluster which could fail (compared to talking to kernel directly), and being
dependent on new proxy functionality becomes similar as with waiting for new kernels
to hit mainstream, it could potentially take a very long time until production upgrades.
What is being proposed here in this regard is less complex given no extra proxy is
involved. I would certainly prefer a kernel-based solution.

Thanks,
Daniel
Andrii Nakryiko June 26, 2023, 10:08 p.m. UTC | #45
On Thu, Jun 22, 2023 at 6:03 PM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> > On Thu, Jun 22, 2023 at 10:38 AM Maryam Tahhan <mtahhan@redhat.com> wrote:
> >
> > For CAP_BPF too broad. It is broad, yes. If you have good ideas how to
> > break it down some more -- please propose. But this is all orthogonal,
> > because the blocking problem is fundamental incompatibility of user
> > namespaces (and their implied isolation and sandboxing of workloads)
> > and BPF functionality, which is global by its very nature. The latter
> > is unavoidable in principle.
>
> How, exactly, is BPF global by its very nature?
>
> The *implementation* has some issues with globalness.  Much of it should be fixable.
>

bpf_probe_read_kernel() is widely used and required for real-world
applications. It's global by its nature and in principle not
restrictable. We can say that we'll just disable applications that use
bpf_probe_read_kernel(), but the goal is to enable applications that
are *practically useful*, not just some restricted set of programs
that are provably contained.

> >
> > No matter how much you break down CAP_BPF, you can't enforce that BPF
> > program won't interfere with applications in other containers. Or that
> > it won't "spy" on them. It's just not what BPF can enforce in
> > principle.
>
> The WHOLE POINT of the verifier is to attempt to constrain what BPF programs can and can't do.  There are bugs -- I get that.  There are helper functions that are fundamentally global.  But, in the absence of verifier bugs, BPF has actual boundaries to its functionality.

looking at your other replies, I think you realized yourself that
there are valid use cases where it's impossible to statically validate
boundaries

>
> >
> > So that comes back down to a question of trust and then controlled
> > delegation of BPF functionality. You trust workload with BPF usage
> > because you reviewed the BPF code, workload, testing, etc? Grant BPF
> > token and let that container use limited subset of BPF. Employ BPF LSM
> > to further restrict it beyond what BPF token can control.
> >
> > You cannot trust an application to not do something harmful? You
> > shouldn't grant it either CAP_BPF in init namespace, nor BPF token in
> > user namespace. That's it. Pick your poison.
>
> I think what's lost here is hardening vs restricting intended functionality.
>
> We have access control to restrict intended functionality.  We have other (and generally fairly ad-hoc and awkward) ways to flip off functionality because we want to reduce exposure to any bugs in it.
>
> BPF needs hardening -- this is well established.  Right now, this is accomplished by restricting it to global root (effectively).  It should have access controls, too, but it doesn't.
>
> >
> > But all this cannot be mechanically decided or enforced. There has to
> > be some humans involved in making these decisions. Kernel's job is to
> > provide building blocks to grant and control BPF functionality to the
> > extent that it is technically possible.
> >
>
> Exactly.  And it DOES NOT.  bpf maps, etc do not have sensible access controls.  Things that should not be global are global.  I'm saying the kernel should fix THAT.  Once it's in a state that it's at least credible to allow BPF in a user namespace, than come up with a way to allow it.
>
> > As for "something to isolate the pinned maps/progs by different apps
> > (why not DAC rules?)", there is no such thing, as I've explained
> > already.
> >
> > I can install sched_switch raw_tracepoint BPF program (if I'm allowed
> > to), and that program has system-wide observability. It cannot be
> > bound to an application.
>
> Great, a real example!
>
> Either:
>
> (a) don't run this in a container.  Have a service for the container to request the help of this program.
>
> (b) have a way to have root approve a particular program and expose *that* program to the container, and let the program have its own access controls internally (e.g. only output info that belongs to that container).
>
> > then what do we do when we switch from process A in container
> > X to process B in container Y? Is that event belonging to container X?
> > Or container Y?
>
> I don't know, but you had better answer this question before you run this thing in a container, not just for security but for basic functionality.  If you haven't defined what your program is even supposed to do in a container, don't run it there.

I think you are missing the point I'm making. A specific BPF program
that will use sched_switch is doing correct and right thing (for
whatever that means in a specific case). We as humans designed,
implemented, validated, reviewed it and are confident enough (as much
as we can be with software) that it does the right thing. It doesn't
try to spy on things, doesn't try to disrupt things.

We know this as humans thanks to our internal development process.

But this is not *provable* in a mechanical sense such that the kernel
can validate and enforce this. And yet it's a practically useful
application which we'd like to be able to launch from inside the
container without rearchitecting and rewriting the entire world and
proxying everything through some external root service.

>
>
> > Hopefully you can see where I'm going with this. And this is just one
> > random tiny example. We can think up tons of other cases to prove BPF
> > is not isolatable to any sort of "container".
>
> No.  You have not come up with an example of why BPF is not isolatable to a container.  You have come up with an example of why binding to a sched_switch raw tracepoint does not make sense in a container without additional mechanisms to give it well defined functionality and appropriate security.
>
> Please stop conflating BPF (programs, maps, etc) with *attachments* of BPF programs to systemwide things.  They're both under the BPF umbrella.  They're not the same thing.

I'm not conflating things. Thinking about BPF maps and BPF programs in
isolation from them being attached somewhere in the kernel and doing
actual and useful work is not useful.

It's the end-to-end functionality including attaching and running BPF
programs is what matters.

Pedantically drawing the line at the BPF program load step and saying
"this is BPF and everything else is not BPF" isn't really helpful. No
one cares about just loading and validating BPF programs. Developers
care about attaching and running them, that's what it all is about.

>
> Passing a token into a container that allow that container to do things like loading its own programs *and attaching them to raw tracepoints* is IMO a complete nonstarter.  It makes no sense.
Andrii Nakryiko June 26, 2023, 10:08 p.m. UTC | #46
On Thu, Jun 22, 2023 at 8:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Jun 22, 2023, at 12:05 PM, Andrii Nakryiko wrote:
> > On Thu, Jun 22, 2023 at 9:50 AM Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >>
> >>
> >> On Thu, Jun 22, 2023, at 1:22 AM, Maryam Tahhan wrote:
> >> > On 22/06/2023 00:48, Andrii Nakryiko wrote:
> >> >>
> >> >>>>> Giving a way to enable BPF in a container is only a small part of the overall task -- making BPF behave sensibly in that container seems like it should also be necessary.
> >> >>>> BPF is still a privileged thing. You can't just say that any
> >> >>>> unprivileged application should be able to use BPF. That's why BPF
> >> >>>> token is about trusting unpriv application in a controlled environment
> >> >>>> (production) to not do something crazy. It can be enforced further
> >> >>>> through LSM usage, but in a lot of cases, when dealing with internal
> >> >>>> production applications it's enough to have a proper application
> >> >>>> design and rely on code review process to avoid any negative effects.
> >> >>> We really shouldn’t be creating new kinds of privileged containers that do uncontained things.
> >> >>>
> >> >>> If you actually want to go this route, I think you would do much better to introduce a way for a container manager to usefully proxy BPF on behalf of the container.
> >> >> Please see Hao's reply ([0]) about his and Google's (not so rosy)
> >> >> experiences with building and using such BPF proxy. We (Meta)
> >> >> internally didn't go this route at all and strongly prefer not to.
> >> >> There are lots of downsides and complications to having a BPF proxy.
> >> >> In the end, this is just shuffling around where the decision about
> >> >> trusting a given application with BPF access is being made. BPF proxy
> >> >> adds lots of unnecessary logistical, operational, and development
> >> >> complexity, but doesn't magically make anything safer.
> >> >>
> >> >>    [0] https://lore.kernel.org/bpf/CA+khW7h95RpurRL8qmKdSJQEXNYuqSWnP16o-uRZ9G0KqCfM4Q@mail.gmail.com/
> >> >>
> >> > Apologies for being blunt, but  the token approach to me seems to be a
> >> > work around providing the right level/classification for a pod/container
> >> > in order to say you support unprivileged containers using eBPF. I think
> >> > if your container needs to do privileged things it should have and be
> >> > classified with the right permissions (privileges) to do what it needs
> >> > to do.
> >>
> >> Bluntness is great.
> >>
> >> I think that this whole level/classification thing is utterly wrong.  Replace "BPF" with basically anything else, and you'll see how absurd it is.
> >
> > BPF is not "anything else", it's important to understand that BPF is
> > inherently not compratmentalizable. And it's vast and generic in its
> > capabilities. This changes everything. So your analogies are
> > misleading.
> >
>
> file descriptors are "vast and generic" -- you can open sockets, files, things in /proc, things in /sys, device nodes, etc.  They are infinitely extensible.  They work in containers.
>
> What is so special about BPF?

Socket with a well-defined and constrained interface that defines what
you can do with it (send and receive bytes, in a controlled fashion),
and BPF programs that intentionally are allowed to have an almost
arbitrarily complex control flow *controlled by user*, and can combine
dozens if not hundreds of "building blocks" (BPF helpers, kfuncs,
various BPF maps, etc) and that could be activated at various points
deep in the kernel (and run that custom user-provided code in kernel
space). I'd say that yeah, BPF is on another level as far as
genericity goes, compared to other interfaces.

And that's BPF's goal and appeal, nothing wrong with it. But I do
think BPF and sockets, files, things in /proc, etc are pretty
different in terms of how they can be proved and enforced to be
sandboxed.

>
> >>
> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using files on disk"
> >>
> >> That's very 1990's.  Maybe 1980's.  Of *course* giving access to a filesystem has some inherent security exposure.  So we can give containers access to *different* filesystems.  Or we can use ACLs.  Or MAC policy.  Or whatever.  We have many solutions, none of which are perfect, and we're doing okay.
> >>
> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using the network"
> >>
> >> The network is a big deal.  For some reason, it's cool these days to treat TCP as highly privileged.  You can get secrets from your favorite (or least favorite) cloud provider with unauthenticated HTTP to a magic IP and port.  You can bypass a whole lot of authenticating/authorizing proxies with unauthenticated HTTP (no TLS!) if you're on the right network.
> >>
> >> This is IMO obnoxious, but we deal with it by having network namespaces and firewalls and rather outdated port <= 1024 rules.
> >>
> >> "the token approach to me seems like a work around providing the right level/classification for a pod/container in order to say you support unprivileged containers using BPF"
> >>
> >> My response is: what's wrong with BPF?  BPF has maps and programs and such, and we could easily apply 1990's style ownership and DAC rules to them.
> >
> > Can you apply DAC rules to which kernel events BPF program can be run
> > on? Can you apply DAC rules to which in-kernel data structures a BPF
> > program can look at and make sure that it doesn't access a
> > task/socket/etc that "belongs" to some other container/user/etc?
>
> No, of course.
>
> If you have a BPF program that is granted the ability to read kernel data structures or to run in response to global events like this, it's basically a kernel module.  It may be subject to a verifier that imposes much stronger type safety than a kernel module is subject to, but it's still effectively a kernel module.
>
> We don't give containers special tokens that let them load arbitrary modules.  We should not give them special tokens that let them do things with BPF that are functionally equivalent to loading arbitrary kernel modules.
>
> But we do have ways that kernel modules (which are "vast and generic", too) can expose their functionality safely to containers.  BPF can learn to do this.
>
> >
> > Can we limit XDP or AF_XDP BPF programs from seeing and controlling
> > network traffic that will be eventually routed to a container that XDP
> > program "should not" have access to? Without making everything so slow
> > that it's useless?
>
> Of course you can -- assign an entire NIC or virtual function to a container, and let the XDP program handle that.  Or a vlan or a macvlan or whatever.  (I'm assuming XDP can be scoped like this.  I'm not that familiar with the details.)
>
> >
> >> I even *wrote the code*.
> >
> > Did you submit it upstream for review and wide discussion?
>
> Yes.
>
> > Did you
> > test it and integrate it with production workloads to prove that your
> > solution is actually a viable real-world solution and not a toy?
>
> I did test it.  I did not integrate it with production workloads.
>

Real-world use cases are the ultimate test of APIs and features. No
matter how brilliant and elegant the solution is, if it doesn't work
with real-world applications, it's pretty useless.

It's not that hard to allow only a very limited and very restrictive
subset of BPF to be allowed to be loaded and attached from containers
without privileged permissions. But the point is to find a solution
that works for complicated (and sometimes very messy) real
applications that were validated by humans (to the best of their
abilities), but can't be proven to be contained within some container.


> > Writing the code doesn't mean solving the problem.
>
> Of course not.  My code was a little step in the right direction.  The BPF community was apparently not interested in it.
>
> >
> >> But for some reason, the BPF community wants to bury its head in the sand, pretend it's 1980, declare that BPF is too privileged to have access control, and instead just have a complicated switch to turn it on and off in different contexts.
> >
> > I won't speak on behalf of the entire BPF community, but I'm trying to
> > explain that BPF cannot be reasonably sandboxed and has to be
> > privileged due to its global nature. And I haven't yet seen any
> > realistic counter-proposal to change that. And it's not about
> > ownership of the BPF map or BPF program, it's way beyond that..
> >
>
> It's really really hard to have a useful discussion about a security model when have, as what appears to be an axiom, that a security model can't be created.
>
> If you actually feel this way, then I think you should not be advocating for allowing unprivileged containers to do the things that you think can't have a security model.
>
> I'm saying that I think there *can* be a security model.  But until the maintainers start to believe that, there won't be one.

See above, whatever security model you have in mind, it should be
workable with real-world applications. Building some elegant system
that will work for just a (rather small) subset of use cases isn't
appealing.
Andrii Nakryiko June 26, 2023, 10:08 p.m. UTC | #47
On Fri, Jun 23, 2023 at 3:18 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/16/23 12:48 AM, Andrii Nakryiko wrote:
> > On Wed, Jun 14, 2023 at 2:39 AM Christian Brauner <brauner@kernel.org> wrote:
> >> On Wed, Jun 14, 2023 at 02:23:02AM +0200, Djalal Harouni wrote:
> >>> On Tue, Jun 13, 2023 at 12:27 AM Andrii Nakryiko
> >>> <andrii.nakryiko@gmail.com> wrote:
> >>>> On Mon, Jun 12, 2023 at 5:02 AM Djalal Harouni <tixxdz@gmail.com> wrote:
> >>>>> On Sat, Jun 10, 2023 at 12:57 AM Andrii Nakryiko
> >>>>> <andrii.nakryiko@gmail.com> wrote:
> >>>>>> On Fri, Jun 9, 2023 at 3:30 PM Djalal Harouni <tixxdz@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Andrii,
> >>>>>>>
> >>>>>>> On Thu, Jun 8, 2023 at 1:54 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >>>>>>>>
> >>>>>>>> ...
> >>>>>>>> creating new BPF objects like BPF programs, BPF maps, etc.
> >>>>>>>
> >>>>>>> Is there a reason for coupling this only with the userns?
> >>>>>>
> >>>>>> There is no coupling. Without userns it is at least possible to grant
> >>>>>> CAP_BPF and other capabilities from init ns. With user namespace that
> >>>>>> becomes impossible.
> >>>>>
> >>>>> But these are not the same: delegate full cap vs delegate an fd mask?
> >>>>
> >>>> What FD mask are we talking about here? I don't recall us talking
> >>>> about any FD masks, so this one is a bit confusing without more
> >>>> context.
> >>>
> >>> Ah err, sorry yes referring to fd token (which I assumed is a mask of
> >>> allowed operations or something like that).
> >>>
> >>> So I want the possibility to delegate the fd token in the init userns.
> >>>
> >>>>>
> >>>>> One can argue unprivileged in init userns is the same privileged in
> >>>>> nested userns
> >>>>> Getting to delegate fd in init userns, then in nested ones seems logical...
> >>>>
> >>>> Again, sorry, I'm not following. Can you please elaborate what you mean?
> >>>
> >>> I mean can we use the fd token in the init user namespace too? not
> >>> only in the nested user namespaces but in the first one? Sorry I
> >>> didn't check the code.
> >>>
> >
> > [...]
> >
> >>>
> >>>>> Having the fd or "token" that gives access rights pinned in two
> >>>>> separate bpffs mounts seems too much, it crosses namespaces (mount,
> >>>>> userns etc), environments setup by privileged...
> >>>>
> >>>> See above, there is nothing namespaceable about BPF itself, and BPF
> >>>> token as well. If some production setup benefits from pinning one BPF
> >>>> token in multiple places, I don't see the problem with that.
> >>>>
> >>>>>
> >>>>> I would just make it per bpffs mount and that's it, nothing more. If a
> >>>>> program wants to bind mount it somewhere else then it's not a bpf
> >>>>> problem.
> >>>>
> >>>> And if some application wants to pin BPF token, why would that be BPF
> >>>> subsystem's problem as well?
> >>>
> >>> The credentials, capabilities, keyring, different namespaces, etc are
> >>> all attached to the owning user namespace, if the BPF subsystem goes
> >>> its own way and creates a token to split up CAP_BPF without following
> >>> that model, then it's definitely a BPF subsystem problem...  I don't
> >>> recommend that.
> >>>
> >>> Feels it's going more of a system-wide approach opening BPF
> >>> functionality where ultimately it clashes with the argument: delegate
> >>> a subset of BPF functionality to a *trusted* unprivileged application.
> >>> My reading of delegation is within a container/service hierarchy
> >>> nothing more.
> >>
> >> You're making the exact arguments that Lennart, Aleksa, and I have been
> >> making in the LSFMM presentation about this topic. It's even recorded:
> >
> > Alright, so (I think) I get a pretty good feel now for what the main
> > concerns are, and why people are trying to push this to be an FS. And
> > it's not so much that BPF token grants bpf() syscall usage to unpriv
> > (but trusted) workloads or that BPF itself is not namespaceable. The
> > main worry is that BPF token, once issues, could be
> > illegally/uncontrollably passed outside of container, intentionally or
> > not. And by having this association with mount namespace (through BPF
> > FS) we automatically limit the sharing to only contain that has access
> > to that BPF FS.
>
> +1
>
> > So I agree that it makes sense to have this mount namespace
> > association, but I also would like to keep BPF token to be a separate
> > entity from BPF FS itself, and have the ability to have multiple
> > different BPF tokens exposed in a single BPF FS instance. I think the
> > latter is important.
> >
> > So how about this slight modification: when a BPF token is created
> > using BPF_TOKEN_CREATE command, the user has to provide an FD for
> > "associated" BPF FS instance (superblock). What that does is allows
> > BPF token to be created with BPF FS and/or mount namespace association
> > set in stone. After that BPF token can only be pinned in that BPF FS
> > instance and cannot leave the boundaries of that mount namespace
> > (specific details to be worked out, this is new area for me, so I'm
> > sorry if I'm missing nuances).
>
> Given bpffs is not a singleton and there can be multiple bpffs instances
> in a container, couldn't we make the token a special bpffs mount/mode?
> Something like single .token file in that mount (for example) which can
> be opened and the fd then passed along for prog/map creation? And given
> the multiple mounts, this also allows potentially for multiple tokens?
> In other words, this is already set up by the container manager when it
> sets up mounts rather than later, and the regular bpffs instance is sth
> separate from all that. Meaning, in your container you get the usual
> bpffs instance and then one or more special bpffs instances as tokens
> at different paths (and in future they could unlock different subset of
> bpf functionality for example).

Just from a technical point of view we could do that. But I see a lot
of value in keeping BPF token creation as part of BPF syscall and its
API. And the main issue, I believe, was not allowing BPF token to
escape the intended container, which should be more than covered by
BPF_TOKEN_CREATE pinning a token into provided BPF FS instance and not
allowing it to be repinned after that.

>
> Thanks,
> Daniel
Andrii Nakryiko June 26, 2023, 10:08 p.m. UTC | #48
On Fri, Jun 23, 2023 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> >> applications meets the needs of these PODs that need to do
> >> privileged/bpf things without any tokens. Ultimately you are trusting
> >> these apps in the same way as if you were granting a token.
> >
> > Yes, absolutely. As I mentioned very explicitly, it's the question of
> > trusting application. Service vs token is implementation details, but
> > the one that has huge implications in how applications are built,
> > tested, versioned, deployed, etc.
>
> So one thing that I don't really get is why such a "trusted application"
> needs to be run in a user namespace in the first place? If it's trusted,
> why not simply run it as a privileged container (without the user
> namespace) and grant it the right system-level capabilities, instead of
> going to all this trouble just to punch a hole in the user namespace
> isolation?

Because it's still useful to provide isolation that user namespace
provides in all other aspects besides BPF usage.

The fact that it's a trusted application doesn't mean that bugs don't
happen, or that some action that was not intended might be attempted
(due to a bug, some deep unintended library "feature", or just because
someone didn't anticipate some interaction).

Trusted here means we believe our BPF usage is not going to spy on
sensitive data, or attempt to disrupt other workloads, because of
design and code reviews, and we intend to maintain that property. But
people are still involved, of course, and bugs do happen. We'd like to
get as much protection as possible, and that's what the user namespace
is offering.

For BPF-side of things, we have to trust the process because there is
no technical solution. Running outside the user namespace we also
don't have any guarantees about BPF. We just have even less protection
in all other aspects outside of BPF. We are trying to improve our
story with user namespace to mitigate what's mitigatable.


>
> -Toke
>
Andrii Nakryiko June 26, 2023, 10:31 p.m. UTC | #49
On Sat, Jun 24, 2023 at 7:00 AM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
> > On 6/23/23 5:10 PM, Andy Lutomirski wrote:
> >> On Thu, Jun 22, 2023, at 6:02 PM, Andy Lutomirski wrote:
> >>> On Thu, Jun 22, 2023, at 11:40 AM, Andrii Nakryiko wrote:
> >>>
> >>>> Hopefully you can see where I'm going with this. And this is just one
> >>>> random tiny example. We can think up tons of other cases to prove BPF
> >>>> is not isolatable to any sort of "container".
> >>>
> >>> No.  You have not come up with an example of why BPF is not isolatable
> >>> to a container.  You have come up with an example of why binding to a
> >>> sched_switch raw tracepoint does not make sense in a container without
> >>> additional mechanisms to give it well defined functionality and
> >>> appropriate security.
> >
> > One big blocker for the case of BPF is not isolatable to a container are
> > CPU hardware bugs. There has been plenty of mitigation effort so that the
> > flexibility cannot be abused as a tool e.g. discussed in [0], but ultimately
> > it's a cat and mouse game and vendors are also not really transparent. So
> > actual reasonable discussion can be resumed once CPU vendors gets their
> > stuff fixed.
> >
> >    [0]
> > https://popl22.sigplan.org/details/prisc-2022-papers/11/BPF-and-Spectre-Mitigating-transient-execution-attacks
> >
>
> By this standard, shouldn’t we just give up?  Let everyone map /dev/mem readonly and stop pretending we can implement any form of access control.
>
> Of course, we don’t do this. We try pretty hard to squash bugs and keep programs from doing an end run around OS security.
>
> >> Thinking about this some more:
> >>
> >> Suppose the goal is to allow a workload in a container to monitor itself by attaching to a tracepoint (something in the scheduler, for example).  The workload is in the container.  The tracepoint is global.  Kernel memory is global unless something that is trusted and understands the containers is doing the reading.  And proxying BPF is a mess.
> >
> > Agree that proxy is a mess for various reasons stated earlier.
> >
> >> So here are a couple of possible solutions:
> >>
> >> (a) Improve BPF maps a bit so that BPF maps work well in containers.  It should be possible to create a map and share it (the file descriptor!) between the outside and the container without running into various snags.  (IIRC my patch series was a decent step in this direction,)  Now load the BPF program and attach it to the tracepoint outside the container but have it write its gathered data to the map that's in the container.  So you end up with a daemon outside the container that gets a request like "help me monitor such-and-such by running BPF program such-and-such (where the BPF program code presumably comes from a library outside the container", and the daemon arranges for the requesting container to have access to the map it needs to get the data.
> >
> > I don't think it's very practical, meaning the vast majority of applications
> > out there today are tightly coupled BPF code + user space application, and in
> > a lot of cases programs are dynamically created. This would require somehow
> > splitting up parts of your application to run outside the container in hostns
> > and other parts inside the container.. for the sake of the mentioned example
> > it's something fairly static, but real-world applications look different and
> > are much more complex.
> >
>
> It sounds like you are describing a situation where there is a workload in a container, where the *entire container* is part of the TCB, but the part of the workload that has the explicit right to read all of kernel memory (e.g. bpf_probe_read_kernel) is so tightly coupled to the container that no one outside the container wants to audit it.
>
> And yet someone still wants to run it in a userns.
>

Yes, to get all the other benefits of userns. Yes, BPF isolation
cannot be enforced and we rely on a human-driven process to decide
whether it's ok to run BPF inside each specific container. But why
can't we also get all the other benefits of userns outside of BPF
usage.

BPF parts are critical for such applications, but they also normally
have a huge user-space part, and use large common libraries, so there
is a lot of benefit to having as much userns-provided isolation as
possible.


> This is IMO a rather bizarre situation.
>
> If I were operating a large fleet, and I had teams developing software to run in a container, I would not want to grant those containers this right without strict controls, and I don’t mean on/off controls. I would want strict auditing of *what exact BPF code* (including source) was run, and why, and who wrote it, and what the intended results are, and what limits access to the results, etc.  After all, we’re talking about the right, BY DESIGN, to access PII, payment card information, medical information, information protected by any jurisdiction’s data control rights, etc. Literally everything.  This ability, as described, isn’t “the right to use BPF.”  It is the right to *read all secrets*, intentionally.  (And modify them, with bpf_probe_write_user, possibly subject to some constraints.)

What makes you think this is not how it's actually done in practice
already (except right now we don't have BPF token, so it's
all-or-nothin, userns or not, root or not, which is overall worse than
what we'll get with BPF token + userns)?

Audit, code review, proper development practices. Then discussions and
reviews between team running container manager and team with BPF-based
workload to make decisions whether it's safe to allow BPF access (and
to what degree) and how teams will maintain privacy and safety
obligations.


>
>
> If this series was about passing a “may load kernel modules” token around, I think it would get an extremely chilly reception, even though we have module signatures.  I don’t see anything about BPF that makes BPF tokens more reasonable unless a real security model is developed first.

If we had dozens of teams developing and loading/unloading their
custom kernel modules all the time, it might not have sounded so
ridiculous?

>
> >> (b) Make a way to pass a pre-approved program into a container.  So a daemon outside loads the program and does some new magic to say "make an fd that can beused to attach this particular program to this particular tracepoint" and pass that into the container.
> >
> > Same as above. Programs are in most cases very tightly coupled to the
> > application
> > itself. I'm not sure if the ask is to redesign/implement all the
> > existing user
> > space infra.
> >
> >> I think (a) is better.  In particular, if you have a workload with many containers, and they all want to monitor the same tracepoint as it relates to their container, you will get much better performance if a single BPF program does the monitoring and sends the data out to each container as needed instead of having one copy of the program per container.
> >>
> >> For what it's worth, BPF tokens seem like they'll have the same performance problem -- without coordination, you can end up with N containers generating N hooks all targeting the same global resource, resulting in overhead that scales linearly with the number of containers.
> >
> > Worst case, sure, but it's not the point. These containers which would
> > receive
> > the tokens are part of your trusted compute base.. so its up to the
> > specific
> > applications and their surrounding infrastructure with regards to what
> > problem
> > they solve where and approved by operators/platform engs to deploy in
> > your cluster.
> > I don't particularly see that there's a performance problem. Andrii
> > specifically
> > mentioned /trusted unprivileged applications/.

Yep, performance is not why this is being done.

> >
> >> And, again, I'm not an XDP expert, but if you have one NIC, and you attach N XDP programs to it, and each one is inspecting packets and sending some to one particular container's AF_XDP socket, you are not going to get good performance.  You want *one* XDP program fanning the packets out to the relevant containers.
> >>
> >> If this is hard right now, perhaps you could add new kernel mechanisms as needed to improve the situation.
> >>
> >> --Andy
> >>
Djalal Harouni June 27, 2023, 10:22 a.m. UTC | #50
On Sat, Jun 24, 2023 at 5:28 PM Andy Lutomirski <luto@kernel.org> wrote:
>
>
>
> On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
> > On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
>
> >
> > If this series was about passing a “may load kernel modules” token
> > around, I think it would get an extremely chilly reception, even though
> > we have module signatures.  I don’t see anything about BPF that makes
> > BPF tokens more reasonable unless a real security model is developed
> > first.
> >
>
> To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.
>
> BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.
>
> (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)
>
> If new, more restrictive functions are needed, they could be added.
>

This seems to align with BPF fd/token delegation. I asked in another
thread if more context/policies could be provided from user space when
configuring the fd and the answer: it can be on top as a follow up...

The user namespace is just one single use case of many, also confirmed
in this reply [0] . Getting it to work in init userns should be the
first logical step anyway, then once you have an fd you can delegate
it or pass it around to childs that create nested user namespaces, etc
as it is currently done within container managers when they setup the
environments including the uid mapping... and of course there should
be some sort of mechanism to ensure that the delegated fd comes say
from a parent user namespace before using it and deny any cross
namespaces usage...


> Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.
>

Even a *limited* BPF proxying seems more in the opposite direction of
what you are suggesting above?

If I have an fd or the bpffs mount with a token properly setup by the
manager I can directly use it inside my containers, load small bpf
programs without talking to another external API of another
container... I assume the manager passed me the rights or already
pre-approved the operation...

Of course there is also the case of approving the attachment of bpf
programs without passing an fd/token which I assume is your point or
in other words denying it which makes perfectly sense indeed, then
yes: an outside daemon could do this, systemd / container managers etc
with the help of LSMs could *deny* attachment of BPF programs without
any external API changes (they already support LSMs), IIRC there is
already a hook part of bpf() syscall to restrict some program types
maybe, so future cases of bpf token should add in kernel and LSMs +
bpf-lsm hooks, ensure they are properly called with the full context
and restrict further...

So for the "limited form of BPF proxying... to approve attachment..."
I think with fd delegation of bpffs mount (that requires privileges to
set it up) then an in kernel LSM hooks on top to tighten this up is
the way to go


[0] https://lore.kernel.org/bpf/CAEf4BzbjGBY2=XGmTBWX3Vrgkc7h0FRQMTbB-SeKEf28h6OhAQ@mail.gmail.com/
Andy Lutomirski July 4, 2023, 8:48 p.m. UTC | #51
On Mon, Jun 26, 2023, at 8:23 AM, Daniel Borkmann wrote:
> On 6/24/23 5:28 PM, Andy Lutomirski wrote:
>> On Sat, Jun 24, 2023, at 6:59 AM, Andy Lutomirski wrote:
>>> On Fri, Jun 23, 2023, at 4:23 PM, Daniel Borkmann wrote:
>>>
>>> If this series was about passing a “may load kernel modules” token
>>> around, I think it would get an extremely chilly reception, even though
>>> we have module signatures.  I don’t see anything about BPF that makes
>>> BPF tokens more reasonable unless a real security model is developed
>>> first.
>> 
>> To be clear, I'm not saying that there should not be a mechanism to use BPF from a user namespace.  I'm saying the mechanism should have explicit access control.  It wouldn't need to solve all problems right away, but it should allow incrementally more features to be enabled as the access control solution gets more powerful over time.
>> 
>> BPF, unlike kernel modules, has a verifier.  While it would be a departure from current practice, permission to use BPF could come with an explicit list of allowed functions and allowed hooks.
>> 
>> (The hooks wouldn't just be a list, presumably -- premission to install an XDP program would be scoped to networks over which one has CAP_NET_ADMIN, presumably.  Other hooks would have their own scoping.  Attaching to a cgroup should (and maybe already does?) require some kind of permission on the cgroup.  Etc.)
>> 
>> If new, more restrictive functions are needed, they could be added.
>
> Wasn't this the idea of the BPF tokens proposal, meaning you could 
> create them with
> restricted access as you mentioned - allowing an explicit subset of 
> program types to
> be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in 
> this token
> context upon program load-time (resp. map creation), the verifier is 
> then extended
> for restricted access. For example, see the 
> bpf_token_allow_{cmd,map_type,prog_type}()
> in this series. The user namespace relation was part of the use cases, 
> but not strictly
> part of the mechanism itself in this series.

Hmm. It's very coarse grained.

Also, the bpf() attach API seems to be largely (completely?) missing what I would expect to be basic access controls on the things being attached to.   For example, the whole cgroup_bpf_prog_attach() path seems to be entirely missing any checks as to whether its caller has any particular permission over the cgroup in question.  It doesn't even check whether the cgroup is being accessed from the current userns (i.e. whether the fd refers to a struct file with f_path.mnt belonging to the current userns).  So the API in this patchset has no way to restrict permission to attach to cgroups to only apply to cgroups belonging to the container.

>
> With regards to the scoping, are you saying that the current design 
> with the bitmasks
> in the token create uapi is not flexible enough? If yes, what concrete 
> alternative do
> you propose?
>
>> Alternatively, people could try a limited form of BPF proxying.  It wouldn't need to be a full proxy -- an outside daemon really could approve the attachment of a BPF program, and it could parse the program, examine the list of function it uses and what the proposed attachment is to, and make an educated decision.  This would need some API changes (maybe), but it seems eminently doable.
>
> Thinking about this from an k8s environment angle, I think this 
> wouldn't really be
> practical for various reasons.. you now need to maintain two 
> implementations for your
> container images which ships BPF one which loads programs as today, and 
> another one
> which talks to this proxy if available, 

This seems fairly trivially solvable. Agree on an API, say using UNIX sockets to /var/run/bpfd/whatever.socket.  (Or maybe /var/lib?  I’m not sure there’s universal agreement on where things like this to.) The exact same API works uncontained (bpfd running, probably socket-activated) from a binary in the system and as a bind-mount from outside.

I don’t know k8s well at all, but it looks like hostPath can do exactly this.  Off the top of my head, I don’t know whether systemd’s .socket can be configured the right way so the same configuration would work contained and uncontained.  One could certainly work around *that* by having two different paths tried in succession, but that seems a bit silly.

This actually seems easier than supplying bpf tokens to a container.

> then you also need to 
> standardize and support
> the various loader libraries for this, you need to deal with yet one 
> more component
> in your cluster which could fail (compared to talking to kernel 
> directly), and being
> dependent on new proxy functionality becomes similar as with waiting 
> for new kernels
> to hit mainstream, it could potentially take a very long time until 
> production upgrades.
> What is being proposed here in this regard is less complex given no 
> extra proxy is
> involved. I would certainly prefer a kernel-based solution.

A userspace solution makes it easy to apply some kind of flexible approval and audit policy to the BPF program. I can imagine all kinds of ways that a fleet operator might want to control what can run, and trying to stick it in the kernel seems rather complex and awkward to customize.

I suppose a bpf token could be set up to call out to its creator for permission to load a program, which would involve a different set of tradeoffs.
Andy Lutomirski July 4, 2023, 9:05 p.m. UTC | #52
On Mon, Jun 26, 2023, at 3:08 PM, Andrii Nakryiko wrote:
> On Fri, Jun 23, 2023 at 4:07 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> >> applications meets the needs of these PODs that need to do
>> >> privileged/bpf things without any tokens. Ultimately you are trusting
>> >> these apps in the same way as if you were granting a token.
>> >
>> > Yes, absolutely. As I mentioned very explicitly, it's the question of
>> > trusting application. Service vs token is implementation details, but
>> > the one that has huge implications in how applications are built,
>> > tested, versioned, deployed, etc.
>>
>> So one thing that I don't really get is why such a "trusted application"
>> needs to be run in a user namespace in the first place? If it's trusted,
>> why not simply run it as a privileged container (without the user
>> namespace) and grant it the right system-level capabilities, instead of
>> going to all this trouble just to punch a hole in the user namespace
>> isolation?
>
> Because it's still useful to provide isolation that user namespace
> provides in all other aspects besides BPF usage.
>
> The fact that it's a trusted application doesn't mean that bugs don't
> happen, or that some action that was not intended might be attempted
> (due to a bug, some deep unintended library "feature", or just because
> someone didn't anticipate some interaction).
>
> Trusted here means we believe our BPF usage is not going to spy on
> sensitive data, or attempt to disrupt other workloads, because of
> design and code reviews, and we intend to maintain that property. But
> people are still involved, of course, and bugs do happen. We'd like to
> get as much protection as possible, and that's what the user namespace
> is offering.
>

I'm wondering if your approach makes sense for Meta but maybe not outside Meta.  I think Meta is a bit unusual in that it operates a huge fleet, but the developers of the software in that fleet are a fairly tight group.   (I'm speculating here.  I don't know much about what goes on inside Meta, obviously.)

Concretely, you say "we believe our BPF usage is not going to spy on sensitive data".  Who is this "we"?  The kernel developers?  The people developing the BPF programs?  The people setting policy for the fleet?  The people creating container images that want to use BPF and run within the fleet?  Are these all the same "we"?

For a company with actual outside tenants, or a company that needs to comply with various privacy rules for some, but not all, of its applications, there are a lot of "we"s involved.  Some group develops software (or this is outsourced -- the BPF maintainership is essentially within Meta, after all).  Some group administers the fleet.  Some group develops BPF programs (or downloads them from outside and hopefully vets them).  Some group builds container images that want to use those programs.  Some group deploys these images via kubernetes or whatever.  Some group prepares reports for that say that certain services offered comply with PCI or HIPAA or FedRAMP or GDPR or whatever.  They're not all the same people.

Obviously bugs exist and mistakes happen.  But, at the end of the day, someone is going to read a BPF program (or a kernel module, or whatever) and take some degree of responsibility for saying "I read this thing, and I approve its use in a certain context".  And then *that permission* should be granted.  With your patchset as it is, the permission granted is not "run this program I approved" but rather "read all kernel memory".  And I don't think that will fly with a lot of potential users.

> For BPF-side of things, we have to trust the process because there is
> no technical solution. Running outside the user namespace we also
> don't have any guarantees about BPF. We just have even less protection
> in all other aspects outside of BPF. We are trying to improve our
> story with user namespace to mitigate what's mitigatable.

But there *are* technical solutions.  At least two broad types, as I've been trying to say.

1. Stronger and more flexible controls as to which specific programs can be loaded and run.  The people doing the trusting may very well want to trust specific things (and audit which things they've trusted, etc.)

2. Stronger and more flexible controls as to what programs can do.  Right now, bpf() can attach to essentially any cgroup or tracepoint if it can attach to any at all.  Programs can acccess all kernel memory (because alternatives to bpf_probe_kernel_read() aren't really available, and there is no incentive right now to add them, because there isn't even a way AFAIK to turn off bpf_probe_kernel_read()).

Progress on either one of these could go a long way.
Andy Lutomirski July 4, 2023, 9:06 p.m. UTC | #53
On Tue, Jul 4, 2023, at 1:48 PM, Andy Lutomirski wrote:
> On Mon, Jun 26, 2023, at 8:23 AM, Daniel Borkmann wrote:
>> On 6/24/23 5:28 PM, Andy Lutomirski wrote:
>>
>> Wasn't this the idea of the BPF tokens proposal, meaning you could 
>> create them with
>> restricted access as you mentioned - allowing an explicit subset of 
>> program types to
>> be loaded, subset of helpers/kfuncs, map types, etc.. Given you pass in 
>> this token
>> context upon program load-time (resp. map creation), the verifier is 
>> then extended
>> for restricted access. For example, see the 
>> bpf_token_allow_{cmd,map_type,prog_type}()
>> in this series. The user namespace relation was part of the use cases, 
>> but not strictly
>> part of the mechanism itself in this series.
>
> Hmm. It's very coarse grained.
>
> Also, the bpf() attach API seems to be largely (completely?) missing 
> what I would expect to be basic access controls on the things being 
> attached to.   For example, the whole cgroup_bpf_prog_attach() path 
> seems to be entirely missing any checks as to whether its caller has 
> any particular permission over the cgroup in question.  It doesn't even 
> check whether the cgroup is being accessed from the current userns 
> (i.e. whether the fd refers to a struct file with f_path.mnt belonging 
> to the current userns).  So the API in this patchset has no way to 
> restrict permission to attach to cgroups to only apply to cgroups 
> belonging to the container.
>

Forgot to mention: there's also no way to limit the functions that can be called.  While it's currently a bit of a pipe dream to do much useful work without bpf_probe_kernel_read(), it's at least conceptually possible to accomplish quite a bit without it, but there's no way to make that be part of the policy.