mbox series

[v5,0/3] implement OA2_CRED_INHERIT flag for openat2()

Message ID 20240426133310.1159976-1-stsp2@yandex.ru (mailing list archive)
Headers show
Series implement OA2_CRED_INHERIT flag for openat2() | expand

Message

stsp April 26, 2024, 1:33 p.m. UTC
This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
It is needed to perform an open operation with the creds that were in
effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
flag. This allows the process to pre-open some dirs and switch eUID
(and other UIDs/GIDs) to the less-privileged user, while still retaining
the possibility to open/create files within the pre-opened directory set.

The sand-boxing is security-oriented: symlinks leading outside of a
sand-box are rejected. /proc magic links are rejected. fds opened with
O_CRED_ALLOW are always closed on exec() and cannot be passed via unix
socket.
The more detailed description (including security considerations)
is available in the log messages of individual patches.

Changes in v5:
- rename OA2_INHERIT_CRED to OA2_CRED_INHERIT
- add an "opt-in" flag O_CRED_ALLOW as was suggested by many reviewers
- stop using 64bit types, as suggested by
  Christian Brauner <brauner@kernel.org>
- add BUILD_BUG_ON() for VALID_OPENAT2_FLAGS, based on Christian Brauner's
  comments
- fixed problems reported by patch-testing bot
- made O_CRED_ALLOW fds not passable via unix sockets and exec(),
  based on Christian Brauner's comments

Changes in v4:
- add optimizations suggested by David Laight <David.Laight@ACULAB.COM>
- move security checks to build_open_flags()
- force RESOLVE_NO_MAGICLINKS as suggested by Andy Lutomirski <luto@kernel.org>

Changes in v3:
- partially revert v2 changes to avoid overriding capabilities.
  Only the bare minimum is overridden: fsuid, fsgid and group_info.
  Document the fact the full cred override is unwanted, as it may
  represent an unneeded security risk.

Changes in v2:
- capture full struct cred instead of just fsuid/fsgid.
  Suggested by Stefan Metzmacher <metze@samba.org>

CC: Stefan Metzmacher <metze@samba.org>
CC: Eric Biederman <ebiederm@xmission.com>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Andy Lutomirski <luto@kernel.org>
CC: Christian Brauner <brauner@kernel.org>
CC: Jan Kara <jack@suse.cz>
CC: Jeff Layton <jlayton@kernel.org>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Alexander Aring <alex.aring@gmail.com>
CC: David Laight <David.Laight@ACULAB.COM>
CC: linux-fsdevel@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Christian Göttsche <cgzones@googlemail.com>

Comments

Andy Lutomirski April 28, 2024, 4:41 p.m. UTC | #1
> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> It is needed to perform an open operation with the creds that were in
> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> flag. This allows the process to pre-open some dirs and switch eUID
> (and other UIDs/GIDs) to the less-privileged user, while still retaining
> the possibility to open/create files within the pre-opened directory set.
>

I’ve been contemplating this, and I want to propose a different solution.

First, the problem Stas is solving is quite narrow and doesn’t
actually need kernel support: if I want to write a user program that
sandboxes itself, I have at least three solutions already.  I can make
a userns and a mountns; I can use landlock; and I can have a separate
process that brokers filesystem access using SCM_RIGHTS.

But what if I want to run a container, where the container can access
a specific host directory, and the contained application is not aware
of the exact technology being used?  I recently started using
containers in anger in a production setting, and “anger” was
definitely the right word: binding part of a filesystem in is
*miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
actually gave up on making one of my use cases work on a Fedora
system.

Here’s what I wanted to do, logically, in production: pick a host
directory, pick a host *principal* (UID, GID, label, etc), and have
the *entire container* access the directory as that principal. This is
what happens automatically if I run the whole container as a userns
with only a single UID mapped, but I don’t really want to do that for
a whole variety and of reasons.

So maybe reimagining Stas’ feature a bit can actually solve this
problem.  Instead of a special dirfd, what if there was a special
subtree (in the sense of open_tree) that captures a set of creds and
does all opens inside the subtree using those creds?

This isn’t a fully formed proposal, but I *think* it should be
generally fairly safe for even an unprivileged user to clone a subtree
with a specific flag set to do this. Maybe a capability would be
needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
this to a daemon if a privilege is needed, and getting the API right
might be a bit tricky.

Then two different things could be done:

1. The subtree could be used unmounted or via /proc magic links. This
would be for programs that are aware of this interface.

2. The subtree could be mounted, and accessed through the mount would
use the captured creds.

(Hmm. What would a new open_tree() pointing at this special subtree do?)


With all this done, if userspace wired it up, a container user could
do something like:

—bind-capture-creds source=dest

And the contained program would access source *as the user who started
the container*, and this would just work without relabeling or
fiddling with owner uids or gids or ACLs, and it would continue to
work even if the container has multiple dynamically allocated subuids
mapped (e.g. one for “root” and one for the actual application).

Bonus points for the ability to revoke the creds in an already opened
subtree. Or even for the creds to automatically revoke themselves when
the opener exits (or maybe when a specific cred-pinning fd goes away).

(This should work for single files as well as for directories.)

New LSM hooks or extensions of existing hooks might be needed to make
LSMs comfortable with this.

What do you all think?
stsp April 28, 2024, 5:39 p.m. UTC | #2
28.04.2024 19:41, Andy Lutomirski пишет:
>> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
>> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
>> It is needed to perform an open operation with the creds that were in
>> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
>> flag. This allows the process to pre-open some dirs and switch eUID
>> (and other UIDs/GIDs) to the less-privileged user, while still retaining
>> the possibility to open/create files within the pre-opened directory set.
>>
> Then two different things could be done:
>
> 1. The subtree could be used unmounted or via /proc magic links. This
> would be for programs that are aware of this interface.
>
> 2. The subtree could be mounted, and accessed through the mount would
> use the captured creds.
Doesn't this have the same problem
that was pointed to me? Namely (explaining
my impl first), that if someone puts the cred
fd to an unaware process's fd table, such
process can't fully drop its privs. He may not
want to access these dirs, but once its hacked,
the hacker will access these dirs with the
creds came from an outside.
My solution was to close such fds on
exec and disallowing SCM_RIGHTS passage.
SCM_RIGHTS can be allowed in the future,
but the receiver will need to use some
new flag to indicate that he is willing to
get such an fd. Passage via exec() can
probably never be allowed however.

If I understand your model correctly, you
put a magic sub-tree to the fs scope of some
unaware process. He may not want to access
it, but once hacked, the hacker will access
it with the creds from an outside.
And, unlike in my impl, in yours there is
probably no way to prevent that?

In short: my impl confines the hassle within
the single process. It can be extended, and
then the receiver will need to explicitly allow
adding such fds to his fd table.
But your idea seems to inherently require
2 processes, and there is probably no way
for the second process to say "ok, I allow
such sub-tree in my fs scope". And even if
he could, in my impl he can just close the
cred fd, while in yours it seems to persist.

Sorry if I misunderstood your idea.
stsp April 28, 2024, 7:15 p.m. UTC | #3
28.04.2024 20:39, stsp пишет:
> In short: my impl confines the hassle within
> the single process. It can be extended, and
> then the receiver will need to explicitly allow
> adding such fds to his fd table.
> But your idea seems to inherently require
> 2 processes, and there is probably no way
> for the second process to say "ok, I allow
> such sub-tree in my fs scope".
There is probably 1 more detail I had
to make explicit: if my model is extended
to allow SCM_RIGHTS passage by the use
of a new flag on a receiver's side, the kernel
will be able to check that the receiver
currently has the same creds as the sender.
Please note that my model is needed for the
process that does setuid() to a less-privileged
user. He would need to receive the cred fd
_before_ such setuid, he would need to use
the special flag to receive it, and the kernel
will also need to check that he has the same
creds anyway, so no escalation can ever happen.
Only that sequence of events makes the
SCM_RIGHTS passage safe, AFAICT. But if
you don't allow SCM_RIGHTS, as my current
impl does, then you are sure the process
can raise the creds only to his own ones.

Now talking about your sub-tree idea, would
it be possible to check that the process had
initially enough creds to access it w/o inheriting
anything? That part looks important to me,
i.e. the process must not access anything
that he couldn't access before dropping privs.
But I am not sure if your idea even includes
the priv drop.
Andy Lutomirski April 28, 2024, 8:19 p.m. UTC | #4
> On Apr 28, 2024, at 10:39 AM, stsp <stsp2@yandex.ru> wrote:
>
> 28.04.2024 19:41, Andy Lutomirski пишет:
>>>> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
>>> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
>>> It is needed to perform an open operation with the creds that were in
>>> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
>>> flag. This allows the process to pre-open some dirs and switch eUID
>>> (and other UIDs/GIDs) to the less-privileged user, while still retaining
>>> the possibility to open/create files within the pre-opened directory set.
>>>
>> Then two different things could be done:
>>
>> 1. The subtree could be used unmounted or via /proc magic links. This
>> would be for programs that are aware of this interface.
>>
>> 2. The subtree could be mounted, and accessed through the mount would
>> use the captured creds.
> Doesn't this have the same problem
> that was pointed to me? Namely (explaining
> my impl first), that if someone puts the cred
> fd to an unaware process's fd table, such
> process can't fully drop its privs. He may not
> want to access these dirs, but once its hacked,
> the hacker will access these dirs with the
> creds came from an outside.

This is not a real problem. If I have a writable fd for /etc/shadow or
an fd for /dev/mem, etc, then I need close them to fully drop privs.

The problem is that, in current kernels, directory fds don’t allow
access using their f_cred, and changing that means that existing
software that does not intend to create a capability will start to do
so.

> My solution was to close such fds on
> exec and disallowing SCM_RIGHTS passage.

I don't see what problem this solves.

> SCM_RIGHTS can be allowed in the future,
> but the receiver will need to use some
> new flag to indicate that he is willing to
> get such an fd. Passage via exec() can
> probably never be allowed however.
>
> If I understand your model correctly, you
> put a magic sub-tree to the fs scope of some
> unaware process.

Only if I want that process to have access!

> He may not want to access
> it, but once hacked, the hacker will access
> it with the creds from an outside.
> And, unlike in my impl, in yours there is
> probably no way to prevent that?

This is fundamental to the whole model. If I stick a FAT formatted USB
drive in the system and mount it, then any process that can find its
way to the mountpoint can write to it.  And if I open a dirfd, any
process with that dirfd can write it.  This is old news and isn't a
problem.


>
> In short: my impl confines the hassle within
> the single process. It can be extended, and
> then the receiver will need to explicitly allow
> adding such fds to his fd table.
> But your idea seems to inherently require
> 2 processes, and there is probably no way
> for the second process to say "ok, I allow
> such sub-tree in my fs scope".

A process could use my proposal to open_tree directory, make a whole
new mountns that is otherwise empty, switch to the mountns, mount the
directory, then change its UID and drop all privs, and still have
access to that one directory.

But even right now, a process could unshare userns and mountns, map
its uid as some nonzero number, rearrange its mountns so it only has
access to that one directory, drop all privs, and seccomp itself, and
only have access to that directory, as it still has the same UID.
Take your pick.
stsp April 28, 2024, 9:14 p.m. UTC | #5
28.04.2024 23:19, Andy Lutomirski пишет:
>> Doesn't this have the same problem
>> that was pointed to me? Namely (explaining
>> my impl first), that if someone puts the cred
>> fd to an unaware process's fd table, such
>> process can't fully drop its privs. He may not
>> want to access these dirs, but once its hacked,
>> the hacker will access these dirs with the
>> creds came from an outside.
> This is not a real problem. If I have a writable fd for /etc/shadow or
> an fd for /dev/mem, etc, then I need close them to fully drop privs.

But isn't that becoming a problem once
you are (maliciously) passed such fds via
exec() or SCM_RIGHTS? You may not know
about them (or about their creds), so you
won't close them. Or?

> The problem is that, in current kernels, directory fds don’t allow
> access using their f_cred, and changing that means that existing
> software that does not intend to create a capability will start to do
> so.

Which is what I am trying to do. :)

>> My solution was to close such fds on
>> exec and disallowing SCM_RIGHTS passage.
> I don't see what problem this solves.

That the process that received them,
doesn't know they have O_CRED_ALLOW
within. So it won't deduce to close them
in time.
Again, this is not the only possible solution.
The receiver can indicate its will to receive
them anyway, and the kernel can check if
such transaction is safe. But it was simpler
to just disallow, who needs to pass those?

>> SCM_RIGHTS can be allowed in the future,
>> but the receiver will need to use some
>> new flag to indicate that he is willing to
>> get such an fd. Passage via exec() can
>> probably never be allowed however.
>>
>> If I understand your model correctly, you
>> put a magic sub-tree to the fs scope of some
>> unaware process.
> Only if I want that process to have access!

Who is "I" in that context?
Can it be some malicious entity?

>> He may not want to access
>> it, but once hacked, the hacker will access
>> it with the creds from an outside.
>> And, unlike in my impl, in yours there is
>> probably no way to prevent that?
> This is fundamental to the whole model. If I stick a FAT formatted USB
> drive in the system and mount it, then any process that can find its
> way to the mountpoint can write to it.  And if I open a dirfd, any
> process with that dirfd can write it.  This is old news and isn't a
> problem.

But IIRC O_DIRECTORY only allows O_RDONLY.
I even re-checked now, and O_DIRECTORY|O_RDWR
gives EISDIR. So is it actually true that
whoever has dir_fd, can write to it?

>> In short: my impl confines the hassle within
>> the single process. It can be extended, and
>> then the receiver will need to explicitly allow
>> adding such fds to his fd table.
>> But your idea seems to inherently require
>> 2 processes, and there is probably no way
>> for the second process to say "ok, I allow
>> such sub-tree in my fs scope".
> A process could use my proposal to open_tree directory, make a whole
> new mountns that is otherwise empty, switch to the mountns, mount the
> directory, then change its UID and drop all privs, and still have
> access to that one directory.

Ok, if that requires actions that can't
be done after priv drop, then it can
indeed fully drop privs w/o mounting
anything, if he doesn't want such access.
Then the only security-related difference
with my approach is that mine guarantees
nothing new can be accessed, no matter
who passes what. Currently nothing can
be passed at all, but if it can - my approach
would still guarantee only the same creds
can be passed, not a new ones.
Can the same restriction be applied in
your case?
Andy Lutomirski April 28, 2024, 9:30 p.m. UTC | #6
On Sun, Apr 28, 2024 at 2:15 PM stsp <stsp2@yandex.ru> wrote:
>
> 28.04.2024 23:19, Andy Lutomirski пишет:
> >> Doesn't this have the same problem
> >> that was pointed to me? Namely (explaining
> >> my impl first), that if someone puts the cred
> >> fd to an unaware process's fd table, such
> >> process can't fully drop its privs. He may not
> >> want to access these dirs, but once its hacked,
> >> the hacker will access these dirs with the
> >> creds came from an outside.
> > This is not a real problem. If I have a writable fd for /etc/shadow or
> > an fd for /dev/mem, etc, then I need close them to fully drop privs.
>
> But isn't that becoming a problem once
> you are (maliciously) passed such fds via
> exec() or SCM_RIGHTS? You may not know
> about them (or about their creds), so you
> won't close them. Or?

Wait, who's the malicious party?  Anyone who can open a directory has,
at the time they do so, permission to do so.  If you send that fd to
someone via SCM_RIGHTS, all you accomplish is that they now have the
fd.

In my scenario, the malicious party attacks an *existing* program that
opens an fd for purposes that it doesn't think are dangerous.  And
then it gives the fd *to the malicious program* by whatever means
(could be as simple as dropping privs then doing dlopen).  Then the
malicious program does OA2_INHERIT_CREDS and gets privileges it
shouldn't have.

But if the *whole point* of opening the fd was to capture privileges
and preserve them across a privilege drop, and the program loads
malicious code after dropping privs, then that's a risk that's taken
intentionally.  This is like how, if you do curl
http://whatever.com/foo.sh | bash, you are granting all kinds of
permissions to unknown code.

> >> My solution was to close such fds on
> >> exec and disallowing SCM_RIGHTS passage.
> > I don't see what problem this solves.
>
> That the process that received them,
> doesn't know they have O_CRED_ALLOW
> within. So it won't deduce to close them
> in time.

Hold on -- what exactly are you talking about?  A process does
recvmsg() and doesn't trust the party at the other end.  Then it
doesn't close the received fd.  Then it does setuid(getuid()).  Then
it does dlopen or exec of an untrusted program.

Okay, so the program now has a completely unknown fd.  This is already
part of the thread model.  It could be a cred-capturing fd, it could
be a device node, it could be a socket, it could be a memfd -- it
could be just about anything.  How do any of your proposals or my
proposals cause an actual new problem here?

> > This is fundamental to the whole model. If I stick a FAT formatted USB
> > drive in the system and mount it, then any process that can find its
> > way to the mountpoint can write to it.  And if I open a dirfd, any
> > process with that dirfd can write it.  This is old news and isn't a
> > problem.
>
> But IIRC O_DIRECTORY only allows O_RDONLY.
> I even re-checked now, and O_DIRECTORY|O_RDWR
> gives EISDIR. So is it actually true that
> whoever has dir_fd, can write to it?

If the filesystem grants that UID permission to write, then it can write.
stsp April 28, 2024, 10:12 p.m. UTC | #7
29.04.2024 00:30, Andy Lutomirski пишет:
> On Sun, Apr 28, 2024 at 2:15 PM stsp <stsp2@yandex.ru> wrote:
>> But isn't that becoming a problem once
>> you are (maliciously) passed such fds via
>> exec() or SCM_RIGHTS? You may not know
>> about them (or about their creds), so you
>> won't close them. Or?
> Wait, who's the malicious party?

Someone who opens an fd with O_CRED_ALLOW
and passes it to an unsuspecting process. This
is at least how I understood the Christian Brauner's
point about "unsuspecting userspace".


>    Anyone who can open a directory has,
> at the time they do so, permission to do so.  If you send that fd to
> someone via SCM_RIGHTS, all you accomplish is that they now have the
> fd.

Normally yes.
But fd with O_CRED_ALLOW prevents the
receiver from fully dropping his privs, even
if he doesn't want to deal with it.

> In my scenario, the malicious party attacks an *existing* program that
> opens an fd for purposes that it doesn't think are dangerous.  And
> then it gives the fd *to the malicious program* by whatever means
> (could be as simple as dropping privs then doing dlopen).  Then the
> malicious program does OA2_INHERIT_CREDS and gets privileges it
> shouldn't have.

But what about an inverse scenario?
Malicious program passes an fd to the
"unaware" program, putting it under a
risk. That program probably never cared
about security, since it doesn't play with
privs. But suddenly it has privs, passed
out of nowhere (via exec() for example),
and someone who hacks it, takes them.

>>>> My solution was to close such fds on
>>>> exec and disallowing SCM_RIGHTS passage.
>>> I don't see what problem this solves.
>> That the process that received them,
>> doesn't know they have O_CRED_ALLOW
>> within. So it won't deduce to close them
>> in time.
> Hold on -- what exactly are you talking about?  A process does
> recvmsg() and doesn't trust the party at the other end.  Then it
> doesn't close the received fd.  Then it does setuid(getuid()).  Then
> it does dlopen or exec of an untrusted program.
>
> Okay, so the program now has a completely unknown fd.  This is already
> part of the thread model.  It could be a cred-capturing fd, it could
> be a device node, it could be a socket, it could be a memfd -- it
> could be just about anything.  How do any of your proposals or my
> proposals cause an actual new problem here?

I am not actually sure how widely
does this spread. I.e. /dev/mem is
restricted these days, but if you can
freely pass device nodes around, then
perhaps the ability to pass an r/o dir fd
that can suddenly give creds, is probably
not something new...
But I really don't like to add to this
particular set of cases. I don't think
its safe, I just think its legacy, so while
it is done that way currently, doesn't
mean I can do the same thing and
call it "secure" just because something
like this was already possible.
Or is this actually completely safe?
Does it hurt to have O_CRED_ALLOW
non-passable?

>>> This is fundamental to the whole model. If I stick a FAT formatted USB
>>> drive in the system and mount it, then any process that can find its
>>> way to the mountpoint can write to it.  And if I open a dirfd, any
>>> process with that dirfd can write it.  This is old news and isn't a
>>> problem.
>> But IIRC O_DIRECTORY only allows O_RDONLY.
>> I even re-checked now, and O_DIRECTORY|O_RDWR
>> gives EISDIR. So is it actually true that
>> whoever has dir_fd, can write to it?
> If the filesystem grants that UID permission to write, then it can write.

Which to me sounds like owning an
O_DIRECTORY fd only gives you the
ability to skip the permission checks
of the outer path components, but not
the inner ones. So passing it w/o O_CRED_ALLOW
was quite safe and didn't give you any
new abilities.
stsp April 29, 2024, 1:12 a.m. UTC | #8
29.04.2024 01:12, stsp пишет:
> freely pass device nodes around, then
> perhaps the ability to pass an r/o dir fd
> that can suddenly give creds, is probably
> not something new...
Actually there is probably something
new anyway. The process knows to close
all sensitive files before dropping privs.
But this may not be the case with dirs,
because dir_fd pretty much invalidates
itself when you drop privs: you'll get
EPERM from openat() if you no longer
have rights to open the file, and dir_fd
doesn't help.
Which is why someone may neglect
to close dir_fd before dropping privs,
but with O_CRED_ALLOW that would
be a mistake.

Or what about this scenario: receiver
gets this fd thinking its some useful and
harmless file fd that would be needed
even after priv drop. So he makes sure
with F_GETFL that this fd is O_RDONLY
and doesn't close it. But it appears to be
malicious O_CRED_ALLOW dir_fd, which
basically exposes many files for a write.

So I am concerned about the cases where
an fd was not closed before a priv drop,
because the process didn't realize the threat.

> But if the*whole point*  of opening the fd was to capture privileges
> and preserve them across a privilege drop, and the program loads
> malicious code after dropping privs, then that's a risk that's taken
> intentionally.
If you opened an fd yourself, then yes.
You know what have you opened, and
you care to close sensitive fds before
dropping privs, or you take the risk.
But if you requested something from
another process and got some fd as
the result, the kernel doesn't guarantee
you got an fd to what you have requested.
You can get a malicious fd, which is not
"so" possible when you open an fd yourself.
So if you want to keep such an fd open,
you will probably at least make sure its
read-only, its not a device node (with fstat)
and so on.
All checks pass, and oops, O_CRED_ALLOW...

So why my patch makes O_CRED_ALLOW
non-passable? Because the receiver has
no way to see the potential harm within
such fd. So if he intended to keep an fd
open after some basic safety checks, he
will be tricked.
So while I think its a rather bad idea to
leave the received fds open after priv drop,
I don't completely exclude the possibility
that someone did that, together with a few
safety checks which would be tricked by
O_CRED_ALLOW.
Christian Brauner April 29, 2024, 9:12 a.m. UTC | #9
On Sun, Apr 28, 2024 at 09:41:20AM -0700, Andy Lutomirski wrote:
> > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > It is needed to perform an open operation with the creds that were in
> > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > flag. This allows the process to pre-open some dirs and switch eUID
> > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > the possibility to open/create files within the pre-opened directory set.
> >
> 
> I’ve been contemplating this, and I want to propose a different solution.
> 
> First, the problem Stas is solving is quite narrow and doesn’t
> actually need kernel support: if I want to write a user program that
> sandboxes itself, I have at least three solutions already.  I can make
> a userns and a mountns; I can use landlock; and I can have a separate
> process that brokers filesystem access using SCM_RIGHTS.
> 
> But what if I want to run a container, where the container can access
> a specific host directory, and the contained application is not aware
> of the exact technology being used?  I recently started using
> containers in anger in a production setting, and “anger” was
> definitely the right word: binding part of a filesystem in is
> *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.

Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and
move_mount(). I rewrote the bind-mount logic in systemd based on that
and util-linux uses that as well now.
https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html

> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> actually gave up on making one of my use cases work on a Fedora
> system.
> 
> Here’s what I wanted to do, logically, in production: pick a host
> directory, pick a host *principal* (UID, GID, label, etc), and have
> the *entire container* access the directory as that principal. This is
> what happens automatically if I run the whole container as a userns
> with only a single UID mapped, but I don’t really want to do that for
> a whole variety and of reasons.

You're describing idmapped mounts for the most part which are upstream
and are used in exactly that way by a lot of userspace.

> 
> So maybe reimagining Stas’ feature a bit can actually solve this
> problem.  Instead of a special dirfd, what if there was a special
> subtree (in the sense of open_tree) that captures a set of creds and
> does all opens inside the subtree using those creds?

That would mean override creds in the VFS layer when accessing a
specific subtree which is a terrible idea imho. Not just because it will
quickly become a potential dos when you do that with a lot of subtrees
it will also have complex interactions with overlayfs.

> 
> This isn’t a fully formed proposal, but I *think* it should be
> generally fairly safe for even an unprivileged user to clone a subtree
> with a specific flag set to do this. Maybe a capability would be
> needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> this to a daemon if a privilege is needed, and getting the API right
> might be a bit tricky.
> 
> Then two different things could be done:
> 
> 1. The subtree could be used unmounted or via /proc magic links. This
> would be for programs that are aware of this interface.
> 
> 2. The subtree could be mounted, and accessed through the mount would
> use the captured creds.
> 
> (Hmm. What would a new open_tree() pointing at this special subtree do?)
> 
> 
> With all this done, if userspace wired it up, a container user could
> do something like:
> 
> —bind-capture-creds source=dest
> 
> And the contained program would access source *as the user who started
> the container*, and this would just work without relabeling or
> fiddling with owner uids or gids or ACLs, and it would continue to
> work even if the container has multiple dynamically allocated subuids
> mapped (e.g. one for “root” and one for the actual application).
> 
> Bonus points for the ability to revoke the creds in an already opened
> subtree. Or even for the creds to automatically revoke themselves when
> the opener exits (or maybe when a specific cred-pinning fd goes away).
> 
> (This should work for single files as well as for directories.)
> 
> New LSM hooks or extensions of existing hooks might be needed to make
> LSMs comfortable with this.
> 
> What do you all think?

I think the problem you're describing is already mostly solved.
Aleksa Sarai May 6, 2024, 7:13 a.m. UTC | #10
On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote:
> > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > It is needed to perform an open operation with the creds that were in
> > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > flag. This allows the process to pre-open some dirs and switch eUID
> > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > the possibility to open/create files within the pre-opened directory set.
> >
> 
> I’ve been contemplating this, and I want to propose a different solution.
> 
> First, the problem Stas is solving is quite narrow and doesn’t
> actually need kernel support: if I want to write a user program that
> sandboxes itself, I have at least three solutions already.  I can make
> a userns and a mountns; I can use landlock; and I can have a separate
> process that brokers filesystem access using SCM_RIGHTS.
> 
> But what if I want to run a container, where the container can access
> a specific host directory, and the contained application is not aware
> of the exact technology being used?  I recently started using
> containers in anger in a production setting, and “anger” was
> definitely the right word: binding part of a filesystem in is
> *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> actually gave up on making one of my use cases work on a Fedora
> system.
> 
> Here’s what I wanted to do, logically, in production: pick a host
> directory, pick a host *principal* (UID, GID, label, etc), and have
> the *entire container* access the directory as that principal. This is
> what happens automatically if I run the whole container as a userns
> with only a single UID mapped, but I don’t really want to do that for
> a whole variety and of reasons.
> 
> So maybe reimagining Stas’ feature a bit can actually solve this
> problem.  Instead of a special dirfd, what if there was a special
> subtree (in the sense of open_tree) that captures a set of creds and
> does all opens inside the subtree using those creds?
> 
> This isn’t a fully formed proposal, but I *think* it should be
> generally fairly safe for even an unprivileged user to clone a subtree
> with a specific flag set to do this. Maybe a capability would be
> needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> this to a daemon if a privilege is needed, and getting the API right
> might be a bit tricky.

Tying this to an actual mount rather than a file handle sounds like a
more plausible proposal than OA2_CRED_INHERIT, but it just seems that
this is going to re-create all of the work that went into id-mapped
mounts but with the extra-special step of making the generic VFS
permissions no longer work normally (unless the idea is that everything
would pretend to be owned by current_fsuid()?).

IMHO it also isn't enough to just make open work, you need to make all
operations work (which leads to a non-trivial amount of
filesystem-specific handling), which is just idmapped mounts. A lot of
work was put into making sure that is safe, and collapsing owners seems
like it will cause a lot of headaches.

I also find it somewhat amusing that this proposal is to basically give
up on multi-user permissions for this one directory tree because it's
too annoying to deal with. In that case, isn't chmod 777 a simpler
solution? (I'm being a bit flippant, of course there is a difference,
but the net result is that all users in the container would have the
same permissions with all of the fun issues that implies.)

In short, AFAICS idmapped mounts pretty much solve this problem (minus
the ability to collapse users, which I suspect is not a good idea in
general)?

> Then two different things could be done:
> 
> 1. The subtree could be used unmounted or via /proc magic links. This
> would be for programs that are aware of this interface.
> 
> 2. The subtree could be mounted, and accessed through the mount would
> use the captured creds.
> 
> (Hmm. What would a new open_tree() pointing at this special subtree do?)
> 
> 
> With all this done, if userspace wired it up, a container user could
> do something like:
> 
> —bind-capture-creds source=dest
> 
> And the contained program would access source *as the user who started
> the container*, and this would just work without relabeling or
> fiddling with owner uids or gids or ACLs, and it would continue to
> work even if the container has multiple dynamically allocated subuids
> mapped (e.g. one for “root” and one for the actual application).
> 
> Bonus points for the ability to revoke the creds in an already opened
> subtree. Or even for the creds to automatically revoke themselves when
> the opener exits (or maybe when a specific cred-pinning fd goes away).
> 
> (This should work for single files as well as for directories.)
> 
> New LSM hooks or extensions of existing hooks might be needed to make
> LSMs comfortable with this.
> 
> What do you all think?
Andy Lutomirski May 6, 2024, 5:29 p.m. UTC | #11
Replying to a couple emails at once...

On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote:
> > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote:
> > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall.
> > > It is needed to perform an open operation with the creds that were in
> > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW
> > > flag. This allows the process to pre-open some dirs and switch eUID
> > > (and other UIDs/GIDs) to the less-privileged user, while still retaining
> > > the possibility to open/create files within the pre-opened directory set.
> > >
> >
> > I’ve been contemplating this, and I want to propose a different solution.
> >
> > First, the problem Stas is solving is quite narrow and doesn’t
> > actually need kernel support: if I want to write a user program that
> > sandboxes itself, I have at least three solutions already.  I can make
> > a userns and a mountns; I can use landlock; and I can have a separate
> > process that brokers filesystem access using SCM_RIGHTS.
> >
> > But what if I want to run a container, where the container can access
> > a specific host directory, and the contained application is not aware
> > of the exact technology being used?  I recently started using
> > containers in anger in a production setting, and “anger” was
> > definitely the right word: binding part of a filesystem in is
> > *miserable*.  Getting the DAC rules right is nasty.  LSMs are worse.
> > Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
> > actually gave up on making one of my use cases work on a Fedora
> > system.
> >
> > Here’s what I wanted to do, logically, in production: pick a host
> > directory, pick a host *principal* (UID, GID, label, etc), and have
> > the *entire container* access the directory as that principal. This is
> > what happens automatically if I run the whole container as a userns
> > with only a single UID mapped, but I don’t really want to do that for
> > a whole variety and of reasons.
> >
> > So maybe reimagining Stas’ feature a bit can actually solve this
> > problem.  Instead of a special dirfd, what if there was a special
> > subtree (in the sense of open_tree) that captures a set of creds and
> > does all opens inside the subtree using those creds?
> >
> > This isn’t a fully formed proposal, but I *think* it should be
> > generally fairly safe for even an unprivileged user to clone a subtree
> > with a specific flag set to do this. Maybe a capability would be
> > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating
> > this to a daemon if a privilege is needed, and getting the API right
> > might be a bit tricky.
>
> Tying this to an actual mount rather than a file handle sounds like a
> more plausible proposal than OA2_CRED_INHERIT, but it just seems that
> this is going to re-create all of the work that went into id-mapped
> mounts but with the extra-special step of making the generic VFS
> permissions no longer work normally (unless the idea is that everything
> would pretend to be owned by current_fsuid()?).

I was assuming that the owner uid and gid would be show to stat, etc
as usual.  But the permission checks would be done against the
captured creds.

>
> IMHO it also isn't enough to just make open work, you need to make all
> operations work (which leads to a non-trivial amount of
> filesystem-specific handling), which is just idmapped mounts. A lot of
> work was put into making sure that is safe, and collapsing owners seems
> like it will cause a lot of headaches.
>
> I also find it somewhat amusing that this proposal is to basically give
> up on multi-user permissions for this one directory tree because it's
> too annoying to deal with. In that case, isn't chmod 777 a simpler
> solution? (I'm being a bit flippant, of course there is a difference,
> but the net result is that all users in the container would have the
> same permissions with all of the fun issues that implies.)
>
> In short, AFAICS idmapped mounts pretty much solve this problem (minus
> the ability to collapse users, which I suspect is not a good idea in
> general)?
>

With my kernel hat on, maybe I agree.  But with my *user* hat on, I
think I pretty strongly disagree.  Look, idmapis lousy for
unprivileged use:

$ install -m 0700 -d test_directory
$ echo 'hi there' >test_directory/file
$ podman run -it --rm
--mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
# cat /tmp/file
hi there

<-- Hey, look, this kind of works!

# setpriv --reuid=1 ls /tmp
ls: cannot open directory '/tmp': Permission denied

<-- Gee, thanks, Linux!


Obviously this is a made up example.  But it's quite analogous to a
real example.  Suppose I want to make a directory that will contain
some MySQL data.  I don't want to share this directory with anyone
else, so I set its mode to 0700.  Then I want to fire up an
unprivileged MySQL container, so I build or download it, and then I
run it and bind my directory to /var/lib/mysql and I run it.  I don't
need to think about UIDs or anything because it's 2024 and containers
just work.  Okay, I need to setenforce 0 because I'm on Fedora and
SELinux makes absolutely no sense in a container world, but I can live
with that.

Except that it doesn't work!  Because unless I want to manually futz
with the idmaps to get mysql to have access to the directory inside
the container, only *root* gets to get in.  But I bet that even
futzing with the idmap doesn't work, because software like mysql often
expects that root *and* a user can access data.  And some software
even does privilege separation and uses more than one UID.

So I want a way to give *an entire container* access to a directory.
Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
could learn a way to squash multiple ids down to one.  Or maybe
something like my silly credential-capturing mount proposal could
work.  But the status quo is not actually amazing IMO.

I haven't looked at the idmap implementation nearly enough to have any
opinion as to whether squashing UID is practical or whether there's
any sensible way to specify it in the configuration.

> On Apr 29, 2024, at 2:12 AM, Christian Brauner <brauner@kernel.org> wrote:
>
> Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and
> move_mount(). I rewrote the bind-mount logic in systemd based on that
> and util-linux uses that as well now.
> https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html
>

Yep, I remember that.

>> Podman’s “bind,relabel” feature is IMO utterly disgusting.  I think I
>> actually gave up on making one of my use cases work on a Fedora
>> system.
>>
>> Here’s what I wanted to do, logically, in production: pick a host
>> directory, pick a host *principal* (UID, GID, label, etc), and have
>> the *entire container* access the directory as that principal. This is
>> what happens automatically if I run the whole container as a userns
>> with only a single UID mapped, but I don’t really want to do that for
>> a whole variety and of reasons.
>
> You're describing idmapped mounts for the most part which are upstream
> and are used in exactly that way by a lot of userspace.
>

See above...

>>
>> So maybe reimagining Stas’ feature a bit can actually solve this
>> problem.  Instead of a special dirfd, what if there was a special
>> subtree (in the sense of open_tree) that captures a set of creds and
>> does all opens inside the subtree using those creds?
>
> That would mean override creds in the VFS layer when accessing a
> specific subtree which is a terrible idea imho. Not just because it will
> quickly become a potential dos when you do that with a lot of subtrees
> it will also have complex interactions with overlayfs.

I was deliberately talking about semantics, not implementation. This
may well be impossible to implement straightforwardly.
Andy Lutomirski May 6, 2024, 5:34 p.m. UTC | #12
On Mon, May 6, 2024 at 10:29 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> Replying to a couple emails at once...
>
> On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote:

> > I also find it somewhat amusing that this proposal is to basically give
> > up on multi-user permissions for this one directory tree because it's
> > too annoying to deal with. In that case, isn't chmod 777 a simpler
> > solution? (I'm being a bit flippant, of course there is a difference,
> > but the net result is that all users in the container would have the
> > same permissions with all of the fun issues that implies.)
> >
> > In short, AFAICS idmapped mounts pretty much solve this problem (minus
> > the ability to collapse users, which I suspect is not a good idea in
> > general)?
> >
>
> With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> think I pretty strongly disagree.  Look, idmapis lousy for
> unprivileged use:
>
> $ install -m 0700 -d test_directory
> $ echo 'hi there' >test_directory/file
> $ podman run -it --rm
> --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> # cat /tmp/file
> hi there
>
> <-- Hey, look, this kind of works!
>
> # setpriv --reuid=1 ls /tmp
> ls: cannot open directory '/tmp': Permission denied
>
> <-- Gee, thanks, Linux!

I should add: this is lousy even for privileged use.  On a normal
non-containerized system:

$ ls -ld /var/lib/mysql
drwxr-xr-x. 3 mysql mysql 4096 Sep 20  2023 /var/lib/mysql

This makes perfect sense.

But if I want to run mysql in a container in a sane way, my only real
choice is either to trust the container manager quite strongly (so it
only maps this directory into the correct container) or, if I want to
separate out management of this container into its own UID (which is a
good practice), then I'm forced to do some kind of fragile hack like
making a directory only accessible to the correct UID and then
creating a 0777 directory inside it and bind-mounting *that*.
David Laight May 6, 2024, 7:34 p.m. UTC | #13
...
> So I want a way to give *an entire container* access to a directory.
> Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
> could learn a way to squash multiple ids down to one.  Or maybe
> something like my silly credential-capturing mount proposal could
> work.  But the status quo is not actually amazing IMO.

Isn't that what gids are for :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Andy Lutomirski May 6, 2024, 9:53 p.m. UTC | #14
On Mon, May 6, 2024 at 12:35 PM David Laight <David.Laight@aculab.com> wrote:
>
> ...
> > So I want a way to give *an entire container* access to a directory.
> > Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
> > could learn a way to squash multiple ids down to one.  Or maybe
> > something like my silly credential-capturing mount proposal could
> > work.  But the status quo is not actually amazing IMO.
>
> Isn't that what gids are for :-)

I dunno.  How, exactly, is a regular non-root user of a Linux computer
supposed to configure gids in their home directory so that a container
(which uses subgids, possibly dynamically allocated) gets access to
the correct thing?  And why should that poor user need to think about
this at all?

--Andy
Christian Brauner May 7, 2024, 7:42 a.m. UTC | #15
> With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> think I pretty strongly disagree.  Look, idmapis lousy for
> unprivileged use:
> 
> $ install -m 0700 -d test_directory
> $ echo 'hi there' >test_directory/file
> $ podman run -it --rm
> --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]

$ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]

as an unprivileged user doesn't use idmapped mounts at all. So I'm not
sure what this is showing. I suppose you're talking about idmaps in
general.

> # cat /tmp/file
> hi there
> 
> <-- Hey, look, this kind of works!
> 
> # setpriv --reuid=1 ls /tmp
> ls: cannot open directory '/tmp': Permission denied
> 
> <-- Gee, thanks, Linux!
> 
> 
> Obviously this is a made up example.  But it's quite analogous to a
> real example.  Suppose I want to make a directory that will contain
> some MySQL data.  I don't want to share this directory with anyone
> else, so I set its mode to 0700.  Then I want to fire up an
> unprivileged MySQL container, so I build or download it, and then I
> run it and bind my directory to /var/lib/mysql and I run it.  I don't
> need to think about UIDs or anything because it's 2024 and containers
> just work.  Okay, I need to setenforce 0 because I'm on Fedora and
> SELinux makes absolutely no sense in a container world, but I can live
> with that.
> 
> Except that it doesn't work!  Because unless I want to manually futz
> with the idmaps to get mysql to have access to the directory inside
> the container, only *root* gets to get in.  But I bet that even
> futzing with the idmap doesn't work, because software like mysql often
> expects that root *and* a user can access data.  And some software
> even does privilege separation and uses more than one UID.

If the directory is 700 and it's owned by say root:root on the host and
you want to share that with arbitrary container users then this isn't
something you can do today (ignoring group permissions and ACLs for the
sake of your argument) even on the host so that's not a limitation of
userns or idmapped mounts. That means many to one mappings of uids/gids.

> So I want a way to give *an entire container* access to a directory.
> Classic UNIX DAC is just *wrong* for this use case.  Maybe idmaps
> could learn a way to squash multiple ids down to one.  Or maybe

Many idmappings to one is in principle possible and I've noted that idea
down as a possible extension at
https://github.com/uapi-group/kernel-features quite a while (2 years?) ago.

> I haven't looked at the idmap implementation nearly enough to have any
> opinion as to whether squashing UID is practical or whether there's

It's doable. The interesting bit to me was that if we want to allow
writes we'd need a way to determine what the uid/gid would be to write
down. Imho, that's not super difficult to solve though. The most obvious
one is that userspace can just determine it when creating the idmapped
mount.
Andy Lutomirski May 7, 2024, 8:38 p.m. UTC | #16
On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> > think I pretty strongly disagree.  Look, idmapis lousy for
> > unprivileged use:
> >
> > $ install -m 0700 -d test_directory
> > $ echo 'hi there' >test_directory/file
> > $ podman run -it --rm
> > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
>
> $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
>
> as an unprivileged user doesn't use idmapped mounts at all. So I'm not
> sure what this is showing. I suppose you're talking about idmaps in
> general.

Meh, fair enough.  But I don't think this would have worked any better
with privilege.

Can idmaps be programmed by an otherwise unprivileged owner of a
userns and a mountns inside?

> Many idmappings to one is in principle possible and I've noted that idea
> down as a possible extension at
> https://github.com/uapi-group/kernel-features quite a while (2 years?) ago.
>
> > I haven't looked at the idmap implementation nearly enough to have any
> > opinion as to whether squashing UID is practical or whether there's
>
> It's doable. The interesting bit to me was that if we want to allow
> writes we'd need a way to determine what the uid/gid would be to write
> down. Imho, that's not super difficult to solve though. The most obvious
> one is that userspace can just determine it when creating the idmapped
> mount.

Seems reasonable to me.  If this is set up by someone unprivileged, it
would need to be that uid/gid.
Christian Brauner May 8, 2024, 7:32 a.m. UTC | #17
On Tue, May 07, 2024 at 01:38:42PM -0700, Andy Lutomirski wrote:
> On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > > With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> > > think I pretty strongly disagree.  Look, idmapis lousy for
> > > unprivileged use:
> > >
> > > $ install -m 0700 -d test_directory
> > > $ echo 'hi there' >test_directory/file
> > > $ podman run -it --rm
> > > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> >
> > $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> >
> > as an unprivileged user doesn't use idmapped mounts at all. So I'm not
> > sure what this is showing. I suppose you're talking about idmaps in
> > general.
> 
> Meh, fair enough.  But I don't think this would have worked any better
> with privilege.
> 
> Can idmaps be programmed by an otherwise unprivileged owner of a
> userns and a mountns inside?

Yes, but only for userns mountable filesystems that support idmapped
mounts. IOW, you need privilege over the superblock and the idmapping
you're trying to use.

> 
> > Many idmappings to one is in principle possible and I've noted that idea
> > down as a possible extension at
> > https://github.com/uapi-group/kernel-features quite a while (2 years?) ago.
> >
> > > I haven't looked at the idmap implementation nearly enough to have any
> > > opinion as to whether squashing UID is practical or whether there's
> >
> > It's doable. The interesting bit to me was that if we want to allow
> > writes we'd need a way to determine what the uid/gid would be to write
> > down. Imho, that's not super difficult to solve though. The most obvious
> > one is that userspace can just determine it when creating the idmapped
> > mount.
> 
> Seems reasonable to me.  If this is set up by someone unprivileged, it
> would need to be that uid/gid.
Andy Lutomirski May 8, 2024, 5:30 p.m. UTC | #18
On Wed, May 8, 2024 at 12:32 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Tue, May 07, 2024 at 01:38:42PM -0700, Andy Lutomirski wrote:
> > On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > > With my kernel hat on, maybe I agree.  But with my *user* hat on, I
> > > > think I pretty strongly disagree.  Look, idmapis lousy for
> > > > unprivileged use:
> > > >
> > > > $ install -m 0700 -d test_directory
> > > > $ echo 'hi there' >test_directory/file
> > > > $ podman run -it --rm
> > > > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> > >
> > > $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim]
> > >
> > > as an unprivileged user doesn't use idmapped mounts at all. So I'm not
> > > sure what this is showing. I suppose you're talking about idmaps in
> > > general.
> >
> > Meh, fair enough.  But I don't think this would have worked any better
> > with privilege.
> >
> > Can idmaps be programmed by an otherwise unprivileged owner of a
> > userns and a mountns inside?
>
> Yes, but only for userns mountable filesystems that support idmapped
> mounts. IOW, you need privilege over the superblock and the idmapping
> you're trying to use.

Hmm.  Is there a good reason to require privilege over the superblock?
 Obviously creating an idmap that allows one to impersonate someone
else seems like a problem, but if an unprivileged task already "owns"
(see below) a UID or GID, then effectively delegating that UID or GID
is would need caution but is not fundamentally terrible.

So, if I'm 1000:1000, then creating an idmap that makes some other
task (that isn't 1000:1000) get to act as 1000:1000 doesn't grant new
powers.  But maybe something even more general could be done (although
I'm not sure this is worthwhile): if I own a userns and that userns
has an outside UID 1001 mapped (via newuidmap, for example), then
perhaps letting me configure an idmap that grants UID 1001 seems not
especially dangerous.  But maybe that particular job should also be
delegated to newuidmap.

Out of an abundance of caution, maybe this whole thing should be
opt-in.  For example, there could be a new CAP_DELEGATE that allows
delegation of one's own uid and gid.  The idea is that it should be
safe to grant regular users CAP_DELEGATE as an ambient capability.

--Andy