Message ID | 20240426133310.1159976-1-stsp2@yandex.ru (mailing list archive) |
---|---|
Headers | show |
Series | implement OA2_CRED_INHERIT flag for openat2() | expand |
> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote: > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. > It is needed to perform an open operation with the creds that were in > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW > flag. This allows the process to pre-open some dirs and switch eUID > (and other UIDs/GIDs) to the less-privileged user, while still retaining > the possibility to open/create files within the pre-opened directory set. > I’ve been contemplating this, and I want to propose a different solution. First, the problem Stas is solving is quite narrow and doesn’t actually need kernel support: if I want to write a user program that sandboxes itself, I have at least three solutions already. I can make a userns and a mountns; I can use landlock; and I can have a separate process that brokers filesystem access using SCM_RIGHTS. But what if I want to run a container, where the container can access a specific host directory, and the contained application is not aware of the exact technology being used? I recently started using containers in anger in a production setting, and “anger” was definitely the right word: binding part of a filesystem in is *miserable*. Getting the DAC rules right is nasty. LSMs are worse. Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I actually gave up on making one of my use cases work on a Fedora system. Here’s what I wanted to do, logically, in production: pick a host directory, pick a host *principal* (UID, GID, label, etc), and have the *entire container* access the directory as that principal. This is what happens automatically if I run the whole container as a userns with only a single UID mapped, but I don’t really want to do that for a whole variety and of reasons. So maybe reimagining Stas’ feature a bit can actually solve this problem. Instead of a special dirfd, what if there was a special subtree (in the sense of open_tree) that captures a set of creds and does all opens inside the subtree using those creds? This isn’t a fully formed proposal, but I *think* it should be generally fairly safe for even an unprivileged user to clone a subtree with a specific flag set to do this. Maybe a capability would be needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating this to a daemon if a privilege is needed, and getting the API right might be a bit tricky. Then two different things could be done: 1. The subtree could be used unmounted or via /proc magic links. This would be for programs that are aware of this interface. 2. The subtree could be mounted, and accessed through the mount would use the captured creds. (Hmm. What would a new open_tree() pointing at this special subtree do?) With all this done, if userspace wired it up, a container user could do something like: —bind-capture-creds source=dest And the contained program would access source *as the user who started the container*, and this would just work without relabeling or fiddling with owner uids or gids or ACLs, and it would continue to work even if the container has multiple dynamically allocated subuids mapped (e.g. one for “root” and one for the actual application). Bonus points for the ability to revoke the creds in an already opened subtree. Or even for the creds to automatically revoke themselves when the opener exits (or maybe when a specific cred-pinning fd goes away). (This should work for single files as well as for directories.) New LSM hooks or extensions of existing hooks might be needed to make LSMs comfortable with this. What do you all think?
28.04.2024 19:41, Andy Lutomirski пишет: >> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote: >> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. >> It is needed to perform an open operation with the creds that were in >> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW >> flag. This allows the process to pre-open some dirs and switch eUID >> (and other UIDs/GIDs) to the less-privileged user, while still retaining >> the possibility to open/create files within the pre-opened directory set. >> > Then two different things could be done: > > 1. The subtree could be used unmounted or via /proc magic links. This > would be for programs that are aware of this interface. > > 2. The subtree could be mounted, and accessed through the mount would > use the captured creds. Doesn't this have the same problem that was pointed to me? Namely (explaining my impl first), that if someone puts the cred fd to an unaware process's fd table, such process can't fully drop its privs. He may not want to access these dirs, but once its hacked, the hacker will access these dirs with the creds came from an outside. My solution was to close such fds on exec and disallowing SCM_RIGHTS passage. SCM_RIGHTS can be allowed in the future, but the receiver will need to use some new flag to indicate that he is willing to get such an fd. Passage via exec() can probably never be allowed however. If I understand your model correctly, you put a magic sub-tree to the fs scope of some unaware process. He may not want to access it, but once hacked, the hacker will access it with the creds from an outside. And, unlike in my impl, in yours there is probably no way to prevent that? In short: my impl confines the hassle within the single process. It can be extended, and then the receiver will need to explicitly allow adding such fds to his fd table. But your idea seems to inherently require 2 processes, and there is probably no way for the second process to say "ok, I allow such sub-tree in my fs scope". And even if he could, in my impl he can just close the cred fd, while in yours it seems to persist. Sorry if I misunderstood your idea.
28.04.2024 20:39, stsp пишет: > In short: my impl confines the hassle within > the single process. It can be extended, and > then the receiver will need to explicitly allow > adding such fds to his fd table. > But your idea seems to inherently require > 2 processes, and there is probably no way > for the second process to say "ok, I allow > such sub-tree in my fs scope". There is probably 1 more detail I had to make explicit: if my model is extended to allow SCM_RIGHTS passage by the use of a new flag on a receiver's side, the kernel will be able to check that the receiver currently has the same creds as the sender. Please note that my model is needed for the process that does setuid() to a less-privileged user. He would need to receive the cred fd _before_ such setuid, he would need to use the special flag to receive it, and the kernel will also need to check that he has the same creds anyway, so no escalation can ever happen. Only that sequence of events makes the SCM_RIGHTS passage safe, AFAICT. But if you don't allow SCM_RIGHTS, as my current impl does, then you are sure the process can raise the creds only to his own ones. Now talking about your sub-tree idea, would it be possible to check that the process had initially enough creds to access it w/o inheriting anything? That part looks important to me, i.e. the process must not access anything that he couldn't access before dropping privs. But I am not sure if your idea even includes the priv drop.
> On Apr 28, 2024, at 10:39 AM, stsp <stsp2@yandex.ru> wrote: > > 28.04.2024 19:41, Andy Lutomirski пишет: >>>> On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote: >>> This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. >>> It is needed to perform an open operation with the creds that were in >>> effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW >>> flag. This allows the process to pre-open some dirs and switch eUID >>> (and other UIDs/GIDs) to the less-privileged user, while still retaining >>> the possibility to open/create files within the pre-opened directory set. >>> >> Then two different things could be done: >> >> 1. The subtree could be used unmounted or via /proc magic links. This >> would be for programs that are aware of this interface. >> >> 2. The subtree could be mounted, and accessed through the mount would >> use the captured creds. > Doesn't this have the same problem > that was pointed to me? Namely (explaining > my impl first), that if someone puts the cred > fd to an unaware process's fd table, such > process can't fully drop its privs. He may not > want to access these dirs, but once its hacked, > the hacker will access these dirs with the > creds came from an outside. This is not a real problem. If I have a writable fd for /etc/shadow or an fd for /dev/mem, etc, then I need close them to fully drop privs. The problem is that, in current kernels, directory fds don’t allow access using their f_cred, and changing that means that existing software that does not intend to create a capability will start to do so. > My solution was to close such fds on > exec and disallowing SCM_RIGHTS passage. I don't see what problem this solves. > SCM_RIGHTS can be allowed in the future, > but the receiver will need to use some > new flag to indicate that he is willing to > get such an fd. Passage via exec() can > probably never be allowed however. > > If I understand your model correctly, you > put a magic sub-tree to the fs scope of some > unaware process. Only if I want that process to have access! > He may not want to access > it, but once hacked, the hacker will access > it with the creds from an outside. > And, unlike in my impl, in yours there is > probably no way to prevent that? This is fundamental to the whole model. If I stick a FAT formatted USB drive in the system and mount it, then any process that can find its way to the mountpoint can write to it. And if I open a dirfd, any process with that dirfd can write it. This is old news and isn't a problem. > > In short: my impl confines the hassle within > the single process. It can be extended, and > then the receiver will need to explicitly allow > adding such fds to his fd table. > But your idea seems to inherently require > 2 processes, and there is probably no way > for the second process to say "ok, I allow > such sub-tree in my fs scope". A process could use my proposal to open_tree directory, make a whole new mountns that is otherwise empty, switch to the mountns, mount the directory, then change its UID and drop all privs, and still have access to that one directory. But even right now, a process could unshare userns and mountns, map its uid as some nonzero number, rearrange its mountns so it only has access to that one directory, drop all privs, and seccomp itself, and only have access to that directory, as it still has the same UID. Take your pick.
28.04.2024 23:19, Andy Lutomirski пишет: >> Doesn't this have the same problem >> that was pointed to me? Namely (explaining >> my impl first), that if someone puts the cred >> fd to an unaware process's fd table, such >> process can't fully drop its privs. He may not >> want to access these dirs, but once its hacked, >> the hacker will access these dirs with the >> creds came from an outside. > This is not a real problem. If I have a writable fd for /etc/shadow or > an fd for /dev/mem, etc, then I need close them to fully drop privs. But isn't that becoming a problem once you are (maliciously) passed such fds via exec() or SCM_RIGHTS? You may not know about them (or about their creds), so you won't close them. Or? > The problem is that, in current kernels, directory fds don’t allow > access using their f_cred, and changing that means that existing > software that does not intend to create a capability will start to do > so. Which is what I am trying to do. :) >> My solution was to close such fds on >> exec and disallowing SCM_RIGHTS passage. > I don't see what problem this solves. That the process that received them, doesn't know they have O_CRED_ALLOW within. So it won't deduce to close them in time. Again, this is not the only possible solution. The receiver can indicate its will to receive them anyway, and the kernel can check if such transaction is safe. But it was simpler to just disallow, who needs to pass those? >> SCM_RIGHTS can be allowed in the future, >> but the receiver will need to use some >> new flag to indicate that he is willing to >> get such an fd. Passage via exec() can >> probably never be allowed however. >> >> If I understand your model correctly, you >> put a magic sub-tree to the fs scope of some >> unaware process. > Only if I want that process to have access! Who is "I" in that context? Can it be some malicious entity? >> He may not want to access >> it, but once hacked, the hacker will access >> it with the creds from an outside. >> And, unlike in my impl, in yours there is >> probably no way to prevent that? > This is fundamental to the whole model. If I stick a FAT formatted USB > drive in the system and mount it, then any process that can find its > way to the mountpoint can write to it. And if I open a dirfd, any > process with that dirfd can write it. This is old news and isn't a > problem. But IIRC O_DIRECTORY only allows O_RDONLY. I even re-checked now, and O_DIRECTORY|O_RDWR gives EISDIR. So is it actually true that whoever has dir_fd, can write to it? >> In short: my impl confines the hassle within >> the single process. It can be extended, and >> then the receiver will need to explicitly allow >> adding such fds to his fd table. >> But your idea seems to inherently require >> 2 processes, and there is probably no way >> for the second process to say "ok, I allow >> such sub-tree in my fs scope". > A process could use my proposal to open_tree directory, make a whole > new mountns that is otherwise empty, switch to the mountns, mount the > directory, then change its UID and drop all privs, and still have > access to that one directory. Ok, if that requires actions that can't be done after priv drop, then it can indeed fully drop privs w/o mounting anything, if he doesn't want such access. Then the only security-related difference with my approach is that mine guarantees nothing new can be accessed, no matter who passes what. Currently nothing can be passed at all, but if it can - my approach would still guarantee only the same creds can be passed, not a new ones. Can the same restriction be applied in your case?
On Sun, Apr 28, 2024 at 2:15 PM stsp <stsp2@yandex.ru> wrote: > > 28.04.2024 23:19, Andy Lutomirski пишет: > >> Doesn't this have the same problem > >> that was pointed to me? Namely (explaining > >> my impl first), that if someone puts the cred > >> fd to an unaware process's fd table, such > >> process can't fully drop its privs. He may not > >> want to access these dirs, but once its hacked, > >> the hacker will access these dirs with the > >> creds came from an outside. > > This is not a real problem. If I have a writable fd for /etc/shadow or > > an fd for /dev/mem, etc, then I need close them to fully drop privs. > > But isn't that becoming a problem once > you are (maliciously) passed such fds via > exec() or SCM_RIGHTS? You may not know > about them (or about their creds), so you > won't close them. Or? Wait, who's the malicious party? Anyone who can open a directory has, at the time they do so, permission to do so. If you send that fd to someone via SCM_RIGHTS, all you accomplish is that they now have the fd. In my scenario, the malicious party attacks an *existing* program that opens an fd for purposes that it doesn't think are dangerous. And then it gives the fd *to the malicious program* by whatever means (could be as simple as dropping privs then doing dlopen). Then the malicious program does OA2_INHERIT_CREDS and gets privileges it shouldn't have. But if the *whole point* of opening the fd was to capture privileges and preserve them across a privilege drop, and the program loads malicious code after dropping privs, then that's a risk that's taken intentionally. This is like how, if you do curl http://whatever.com/foo.sh | bash, you are granting all kinds of permissions to unknown code. > >> My solution was to close such fds on > >> exec and disallowing SCM_RIGHTS passage. > > I don't see what problem this solves. > > That the process that received them, > doesn't know they have O_CRED_ALLOW > within. So it won't deduce to close them > in time. Hold on -- what exactly are you talking about? A process does recvmsg() and doesn't trust the party at the other end. Then it doesn't close the received fd. Then it does setuid(getuid()). Then it does dlopen or exec of an untrusted program. Okay, so the program now has a completely unknown fd. This is already part of the thread model. It could be a cred-capturing fd, it could be a device node, it could be a socket, it could be a memfd -- it could be just about anything. How do any of your proposals or my proposals cause an actual new problem here? > > This is fundamental to the whole model. If I stick a FAT formatted USB > > drive in the system and mount it, then any process that can find its > > way to the mountpoint can write to it. And if I open a dirfd, any > > process with that dirfd can write it. This is old news and isn't a > > problem. > > But IIRC O_DIRECTORY only allows O_RDONLY. > I even re-checked now, and O_DIRECTORY|O_RDWR > gives EISDIR. So is it actually true that > whoever has dir_fd, can write to it? If the filesystem grants that UID permission to write, then it can write.
29.04.2024 00:30, Andy Lutomirski пишет: > On Sun, Apr 28, 2024 at 2:15 PM stsp <stsp2@yandex.ru> wrote: >> But isn't that becoming a problem once >> you are (maliciously) passed such fds via >> exec() or SCM_RIGHTS? You may not know >> about them (or about their creds), so you >> won't close them. Or? > Wait, who's the malicious party? Someone who opens an fd with O_CRED_ALLOW and passes it to an unsuspecting process. This is at least how I understood the Christian Brauner's point about "unsuspecting userspace". > Anyone who can open a directory has, > at the time they do so, permission to do so. If you send that fd to > someone via SCM_RIGHTS, all you accomplish is that they now have the > fd. Normally yes. But fd with O_CRED_ALLOW prevents the receiver from fully dropping his privs, even if he doesn't want to deal with it. > In my scenario, the malicious party attacks an *existing* program that > opens an fd for purposes that it doesn't think are dangerous. And > then it gives the fd *to the malicious program* by whatever means > (could be as simple as dropping privs then doing dlopen). Then the > malicious program does OA2_INHERIT_CREDS and gets privileges it > shouldn't have. But what about an inverse scenario? Malicious program passes an fd to the "unaware" program, putting it under a risk. That program probably never cared about security, since it doesn't play with privs. But suddenly it has privs, passed out of nowhere (via exec() for example), and someone who hacks it, takes them. >>>> My solution was to close such fds on >>>> exec and disallowing SCM_RIGHTS passage. >>> I don't see what problem this solves. >> That the process that received them, >> doesn't know they have O_CRED_ALLOW >> within. So it won't deduce to close them >> in time. > Hold on -- what exactly are you talking about? A process does > recvmsg() and doesn't trust the party at the other end. Then it > doesn't close the received fd. Then it does setuid(getuid()). Then > it does dlopen or exec of an untrusted program. > > Okay, so the program now has a completely unknown fd. This is already > part of the thread model. It could be a cred-capturing fd, it could > be a device node, it could be a socket, it could be a memfd -- it > could be just about anything. How do any of your proposals or my > proposals cause an actual new problem here? I am not actually sure how widely does this spread. I.e. /dev/mem is restricted these days, but if you can freely pass device nodes around, then perhaps the ability to pass an r/o dir fd that can suddenly give creds, is probably not something new... But I really don't like to add to this particular set of cases. I don't think its safe, I just think its legacy, so while it is done that way currently, doesn't mean I can do the same thing and call it "secure" just because something like this was already possible. Or is this actually completely safe? Does it hurt to have O_CRED_ALLOW non-passable? >>> This is fundamental to the whole model. If I stick a FAT formatted USB >>> drive in the system and mount it, then any process that can find its >>> way to the mountpoint can write to it. And if I open a dirfd, any >>> process with that dirfd can write it. This is old news and isn't a >>> problem. >> But IIRC O_DIRECTORY only allows O_RDONLY. >> I even re-checked now, and O_DIRECTORY|O_RDWR >> gives EISDIR. So is it actually true that >> whoever has dir_fd, can write to it? > If the filesystem grants that UID permission to write, then it can write. Which to me sounds like owning an O_DIRECTORY fd only gives you the ability to skip the permission checks of the outer path components, but not the inner ones. So passing it w/o O_CRED_ALLOW was quite safe and didn't give you any new abilities.
29.04.2024 01:12, stsp пишет: > freely pass device nodes around, then > perhaps the ability to pass an r/o dir fd > that can suddenly give creds, is probably > not something new... Actually there is probably something new anyway. The process knows to close all sensitive files before dropping privs. But this may not be the case with dirs, because dir_fd pretty much invalidates itself when you drop privs: you'll get EPERM from openat() if you no longer have rights to open the file, and dir_fd doesn't help. Which is why someone may neglect to close dir_fd before dropping privs, but with O_CRED_ALLOW that would be a mistake. Or what about this scenario: receiver gets this fd thinking its some useful and harmless file fd that would be needed even after priv drop. So he makes sure with F_GETFL that this fd is O_RDONLY and doesn't close it. But it appears to be malicious O_CRED_ALLOW dir_fd, which basically exposes many files for a write. So I am concerned about the cases where an fd was not closed before a priv drop, because the process didn't realize the threat. > But if the*whole point* of opening the fd was to capture privileges > and preserve them across a privilege drop, and the program loads > malicious code after dropping privs, then that's a risk that's taken > intentionally. If you opened an fd yourself, then yes. You know what have you opened, and you care to close sensitive fds before dropping privs, or you take the risk. But if you requested something from another process and got some fd as the result, the kernel doesn't guarantee you got an fd to what you have requested. You can get a malicious fd, which is not "so" possible when you open an fd yourself. So if you want to keep such an fd open, you will probably at least make sure its read-only, its not a device node (with fstat) and so on. All checks pass, and oops, O_CRED_ALLOW... So why my patch makes O_CRED_ALLOW non-passable? Because the receiver has no way to see the potential harm within such fd. So if he intended to keep an fd open after some basic safety checks, he will be tricked. So while I think its a rather bad idea to leave the received fds open after priv drop, I don't completely exclude the possibility that someone did that, together with a few safety checks which would be tricked by O_CRED_ALLOW.
On Sun, Apr 28, 2024 at 09:41:20AM -0700, Andy Lutomirski wrote: > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote: > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. > > It is needed to perform an open operation with the creds that were in > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW > > flag. This allows the process to pre-open some dirs and switch eUID > > (and other UIDs/GIDs) to the less-privileged user, while still retaining > > the possibility to open/create files within the pre-opened directory set. > > > > I’ve been contemplating this, and I want to propose a different solution. > > First, the problem Stas is solving is quite narrow and doesn’t > actually need kernel support: if I want to write a user program that > sandboxes itself, I have at least three solutions already. I can make > a userns and a mountns; I can use landlock; and I can have a separate > process that brokers filesystem access using SCM_RIGHTS. > > But what if I want to run a container, where the container can access > a specific host directory, and the contained application is not aware > of the exact technology being used? I recently started using > containers in anger in a production setting, and “anger” was > definitely the right word: binding part of a filesystem in is > *miserable*. Getting the DAC rules right is nasty. LSMs are worse. Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and move_mount(). I rewrote the bind-mount logic in systemd based on that and util-linux uses that as well now. https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I > actually gave up on making one of my use cases work on a Fedora > system. > > Here’s what I wanted to do, logically, in production: pick a host > directory, pick a host *principal* (UID, GID, label, etc), and have > the *entire container* access the directory as that principal. This is > what happens automatically if I run the whole container as a userns > with only a single UID mapped, but I don’t really want to do that for > a whole variety and of reasons. You're describing idmapped mounts for the most part which are upstream and are used in exactly that way by a lot of userspace. > > So maybe reimagining Stas’ feature a bit can actually solve this > problem. Instead of a special dirfd, what if there was a special > subtree (in the sense of open_tree) that captures a set of creds and > does all opens inside the subtree using those creds? That would mean override creds in the VFS layer when accessing a specific subtree which is a terrible idea imho. Not just because it will quickly become a potential dos when you do that with a lot of subtrees it will also have complex interactions with overlayfs. > > This isn’t a fully formed proposal, but I *think* it should be > generally fairly safe for even an unprivileged user to clone a subtree > with a specific flag set to do this. Maybe a capability would be > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating > this to a daemon if a privilege is needed, and getting the API right > might be a bit tricky. > > Then two different things could be done: > > 1. The subtree could be used unmounted or via /proc magic links. This > would be for programs that are aware of this interface. > > 2. The subtree could be mounted, and accessed through the mount would > use the captured creds. > > (Hmm. What would a new open_tree() pointing at this special subtree do?) > > > With all this done, if userspace wired it up, a container user could > do something like: > > —bind-capture-creds source=dest > > And the contained program would access source *as the user who started > the container*, and this would just work without relabeling or > fiddling with owner uids or gids or ACLs, and it would continue to > work even if the container has multiple dynamically allocated subuids > mapped (e.g. one for “root” and one for the actual application). > > Bonus points for the ability to revoke the creds in an already opened > subtree. Or even for the creds to automatically revoke themselves when > the opener exits (or maybe when a specific cred-pinning fd goes away). > > (This should work for single files as well as for directories.) > > New LSM hooks or extensions of existing hooks might be needed to make > LSMs comfortable with this. > > What do you all think? I think the problem you're describing is already mostly solved.
On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote: > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote: > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. > > It is needed to perform an open operation with the creds that were in > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW > > flag. This allows the process to pre-open some dirs and switch eUID > > (and other UIDs/GIDs) to the less-privileged user, while still retaining > > the possibility to open/create files within the pre-opened directory set. > > > > I’ve been contemplating this, and I want to propose a different solution. > > First, the problem Stas is solving is quite narrow and doesn’t > actually need kernel support: if I want to write a user program that > sandboxes itself, I have at least three solutions already. I can make > a userns and a mountns; I can use landlock; and I can have a separate > process that brokers filesystem access using SCM_RIGHTS. > > But what if I want to run a container, where the container can access > a specific host directory, and the contained application is not aware > of the exact technology being used? I recently started using > containers in anger in a production setting, and “anger” was > definitely the right word: binding part of a filesystem in is > *miserable*. Getting the DAC rules right is nasty. LSMs are worse. > Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I > actually gave up on making one of my use cases work on a Fedora > system. > > Here’s what I wanted to do, logically, in production: pick a host > directory, pick a host *principal* (UID, GID, label, etc), and have > the *entire container* access the directory as that principal. This is > what happens automatically if I run the whole container as a userns > with only a single UID mapped, but I don’t really want to do that for > a whole variety and of reasons. > > So maybe reimagining Stas’ feature a bit can actually solve this > problem. Instead of a special dirfd, what if there was a special > subtree (in the sense of open_tree) that captures a set of creds and > does all opens inside the subtree using those creds? > > This isn’t a fully formed proposal, but I *think* it should be > generally fairly safe for even an unprivileged user to clone a subtree > with a specific flag set to do this. Maybe a capability would be > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating > this to a daemon if a privilege is needed, and getting the API right > might be a bit tricky. Tying this to an actual mount rather than a file handle sounds like a more plausible proposal than OA2_CRED_INHERIT, but it just seems that this is going to re-create all of the work that went into id-mapped mounts but with the extra-special step of making the generic VFS permissions no longer work normally (unless the idea is that everything would pretend to be owned by current_fsuid()?). IMHO it also isn't enough to just make open work, you need to make all operations work (which leads to a non-trivial amount of filesystem-specific handling), which is just idmapped mounts. A lot of work was put into making sure that is safe, and collapsing owners seems like it will cause a lot of headaches. I also find it somewhat amusing that this proposal is to basically give up on multi-user permissions for this one directory tree because it's too annoying to deal with. In that case, isn't chmod 777 a simpler solution? (I'm being a bit flippant, of course there is a difference, but the net result is that all users in the container would have the same permissions with all of the fun issues that implies.) In short, AFAICS idmapped mounts pretty much solve this problem (minus the ability to collapse users, which I suspect is not a good idea in general)? > Then two different things could be done: > > 1. The subtree could be used unmounted or via /proc magic links. This > would be for programs that are aware of this interface. > > 2. The subtree could be mounted, and accessed through the mount would > use the captured creds. > > (Hmm. What would a new open_tree() pointing at this special subtree do?) > > > With all this done, if userspace wired it up, a container user could > do something like: > > —bind-capture-creds source=dest > > And the contained program would access source *as the user who started > the container*, and this would just work without relabeling or > fiddling with owner uids or gids or ACLs, and it would continue to > work even if the container has multiple dynamically allocated subuids > mapped (e.g. one for “root” and one for the actual application). > > Bonus points for the ability to revoke the creds in an already opened > subtree. Or even for the creds to automatically revoke themselves when > the opener exits (or maybe when a specific cred-pinning fd goes away). > > (This should work for single files as well as for directories.) > > New LSM hooks or extensions of existing hooks might be needed to make > LSMs comfortable with this. > > What do you all think?
Replying to a couple emails at once... On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2024-04-28, Andy Lutomirski <luto@amacapital.net> wrote: > > > On Apr 26, 2024, at 6:39 AM, Stas Sergeev <stsp2@yandex.ru> wrote: > > > This patch-set implements the OA2_CRED_INHERIT flag for openat2() syscall. > > > It is needed to perform an open operation with the creds that were in > > > effect when the dir_fd was opened, if the dir was opened with O_CRED_ALLOW > > > flag. This allows the process to pre-open some dirs and switch eUID > > > (and other UIDs/GIDs) to the less-privileged user, while still retaining > > > the possibility to open/create files within the pre-opened directory set. > > > > > > > I’ve been contemplating this, and I want to propose a different solution. > > > > First, the problem Stas is solving is quite narrow and doesn’t > > actually need kernel support: if I want to write a user program that > > sandboxes itself, I have at least three solutions already. I can make > > a userns and a mountns; I can use landlock; and I can have a separate > > process that brokers filesystem access using SCM_RIGHTS. > > > > But what if I want to run a container, where the container can access > > a specific host directory, and the contained application is not aware > > of the exact technology being used? I recently started using > > containers in anger in a production setting, and “anger” was > > definitely the right word: binding part of a filesystem in is > > *miserable*. Getting the DAC rules right is nasty. LSMs are worse. > > Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I > > actually gave up on making one of my use cases work on a Fedora > > system. > > > > Here’s what I wanted to do, logically, in production: pick a host > > directory, pick a host *principal* (UID, GID, label, etc), and have > > the *entire container* access the directory as that principal. This is > > what happens automatically if I run the whole container as a userns > > with only a single UID mapped, but I don’t really want to do that for > > a whole variety and of reasons. > > > > So maybe reimagining Stas’ feature a bit can actually solve this > > problem. Instead of a special dirfd, what if there was a special > > subtree (in the sense of open_tree) that captures a set of creds and > > does all opens inside the subtree using those creds? > > > > This isn’t a fully formed proposal, but I *think* it should be > > generally fairly safe for even an unprivileged user to clone a subtree > > with a specific flag set to do this. Maybe a capability would be > > needed (CAP_CAPTURE_CREDS?), but it would be nice to allow delegating > > this to a daemon if a privilege is needed, and getting the API right > > might be a bit tricky. > > Tying this to an actual mount rather than a file handle sounds like a > more plausible proposal than OA2_CRED_INHERIT, but it just seems that > this is going to re-create all of the work that went into id-mapped > mounts but with the extra-special step of making the generic VFS > permissions no longer work normally (unless the idea is that everything > would pretend to be owned by current_fsuid()?). I was assuming that the owner uid and gid would be show to stat, etc as usual. But the permission checks would be done against the captured creds. > > IMHO it also isn't enough to just make open work, you need to make all > operations work (which leads to a non-trivial amount of > filesystem-specific handling), which is just idmapped mounts. A lot of > work was put into making sure that is safe, and collapsing owners seems > like it will cause a lot of headaches. > > I also find it somewhat amusing that this proposal is to basically give > up on multi-user permissions for this one directory tree because it's > too annoying to deal with. In that case, isn't chmod 777 a simpler > solution? (I'm being a bit flippant, of course there is a difference, > but the net result is that all users in the container would have the > same permissions with all of the fun issues that implies.) > > In short, AFAICS idmapped mounts pretty much solve this problem (minus > the ability to collapse users, which I suspect is not a good idea in > general)? > With my kernel hat on, maybe I agree. But with my *user* hat on, I think I pretty strongly disagree. Look, idmapis lousy for unprivileged use: $ install -m 0700 -d test_directory $ echo 'hi there' >test_directory/file $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] # cat /tmp/file hi there <-- Hey, look, this kind of works! # setpriv --reuid=1 ls /tmp ls: cannot open directory '/tmp': Permission denied <-- Gee, thanks, Linux! Obviously this is a made up example. But it's quite analogous to a real example. Suppose I want to make a directory that will contain some MySQL data. I don't want to share this directory with anyone else, so I set its mode to 0700. Then I want to fire up an unprivileged MySQL container, so I build or download it, and then I run it and bind my directory to /var/lib/mysql and I run it. I don't need to think about UIDs or anything because it's 2024 and containers just work. Okay, I need to setenforce 0 because I'm on Fedora and SELinux makes absolutely no sense in a container world, but I can live with that. Except that it doesn't work! Because unless I want to manually futz with the idmaps to get mysql to have access to the directory inside the container, only *root* gets to get in. But I bet that even futzing with the idmap doesn't work, because software like mysql often expects that root *and* a user can access data. And some software even does privilege separation and uses more than one UID. So I want a way to give *an entire container* access to a directory. Classic UNIX DAC is just *wrong* for this use case. Maybe idmaps could learn a way to squash multiple ids down to one. Or maybe something like my silly credential-capturing mount proposal could work. But the status quo is not actually amazing IMO. I haven't looked at the idmap implementation nearly enough to have any opinion as to whether squashing UID is practical or whether there's any sensible way to specify it in the configuration. > On Apr 29, 2024, at 2:12 AM, Christian Brauner <brauner@kernel.org> wrote: > > Nowadays it's extremely simple due tue open_tree(OPEN_TREE_CLONE) and > move_mount(). I rewrote the bind-mount logic in systemd based on that > and util-linux uses that as well now. > https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html > Yep, I remember that. >> Podman’s “bind,relabel” feature is IMO utterly disgusting. I think I >> actually gave up on making one of my use cases work on a Fedora >> system. >> >> Here’s what I wanted to do, logically, in production: pick a host >> directory, pick a host *principal* (UID, GID, label, etc), and have >> the *entire container* access the directory as that principal. This is >> what happens automatically if I run the whole container as a userns >> with only a single UID mapped, but I don’t really want to do that for >> a whole variety and of reasons. > > You're describing idmapped mounts for the most part which are upstream > and are used in exactly that way by a lot of userspace. > See above... >> >> So maybe reimagining Stas’ feature a bit can actually solve this >> problem. Instead of a special dirfd, what if there was a special >> subtree (in the sense of open_tree) that captures a set of creds and >> does all opens inside the subtree using those creds? > > That would mean override creds in the VFS layer when accessing a > specific subtree which is a terrible idea imho. Not just because it will > quickly become a potential dos when you do that with a lot of subtrees > it will also have complex interactions with overlayfs. I was deliberately talking about semantics, not implementation. This may well be impossible to implement straightforwardly.
On Mon, May 6, 2024 at 10:29 AM Andy Lutomirski <luto@amacapital.net> wrote: > > Replying to a couple emails at once... > > On Mon, May 6, 2024 at 12:14 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > I also find it somewhat amusing that this proposal is to basically give > > up on multi-user permissions for this one directory tree because it's > > too annoying to deal with. In that case, isn't chmod 777 a simpler > > solution? (I'm being a bit flippant, of course there is a difference, > > but the net result is that all users in the container would have the > > same permissions with all of the fun issues that implies.) > > > > In short, AFAICS idmapped mounts pretty much solve this problem (minus > > the ability to collapse users, which I suspect is not a good idea in > > general)? > > > > With my kernel hat on, maybe I agree. But with my *user* hat on, I > think I pretty strongly disagree. Look, idmapis lousy for > unprivileged use: > > $ install -m 0700 -d test_directory > $ echo 'hi there' >test_directory/file > $ podman run -it --rm > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] > # cat /tmp/file > hi there > > <-- Hey, look, this kind of works! > > # setpriv --reuid=1 ls /tmp > ls: cannot open directory '/tmp': Permission denied > > <-- Gee, thanks, Linux! I should add: this is lousy even for privileged use. On a normal non-containerized system: $ ls -ld /var/lib/mysql drwxr-xr-x. 3 mysql mysql 4096 Sep 20 2023 /var/lib/mysql This makes perfect sense. But if I want to run mysql in a container in a sane way, my only real choice is either to trust the container manager quite strongly (so it only maps this directory into the correct container) or, if I want to separate out management of this container into its own UID (which is a good practice), then I'm forced to do some kind of fragile hack like making a directory only accessible to the correct UID and then creating a 0777 directory inside it and bind-mounting *that*.
... > So I want a way to give *an entire container* access to a directory. > Classic UNIX DAC is just *wrong* for this use case. Maybe idmaps > could learn a way to squash multiple ids down to one. Or maybe > something like my silly credential-capturing mount proposal could > work. But the status quo is not actually amazing IMO. Isn't that what gids are for :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, May 6, 2024 at 12:35 PM David Laight <David.Laight@aculab.com> wrote: > > ... > > So I want a way to give *an entire container* access to a directory. > > Classic UNIX DAC is just *wrong* for this use case. Maybe idmaps > > could learn a way to squash multiple ids down to one. Or maybe > > something like my silly credential-capturing mount proposal could > > work. But the status quo is not actually amazing IMO. > > Isn't that what gids are for :-) I dunno. How, exactly, is a regular non-root user of a Linux computer supposed to configure gids in their home directory so that a container (which uses subgids, possibly dynamically allocated) gets access to the correct thing? And why should that poor user need to think about this at all? --Andy
> With my kernel hat on, maybe I agree. But with my *user* hat on, I > think I pretty strongly disagree. Look, idmapis lousy for > unprivileged use: > > $ install -m 0700 -d test_directory > $ echo 'hi there' >test_directory/file > $ podman run -it --rm > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] as an unprivileged user doesn't use idmapped mounts at all. So I'm not sure what this is showing. I suppose you're talking about idmaps in general. > # cat /tmp/file > hi there > > <-- Hey, look, this kind of works! > > # setpriv --reuid=1 ls /tmp > ls: cannot open directory '/tmp': Permission denied > > <-- Gee, thanks, Linux! > > > Obviously this is a made up example. But it's quite analogous to a > real example. Suppose I want to make a directory that will contain > some MySQL data. I don't want to share this directory with anyone > else, so I set its mode to 0700. Then I want to fire up an > unprivileged MySQL container, so I build or download it, and then I > run it and bind my directory to /var/lib/mysql and I run it. I don't > need to think about UIDs or anything because it's 2024 and containers > just work. Okay, I need to setenforce 0 because I'm on Fedora and > SELinux makes absolutely no sense in a container world, but I can live > with that. > > Except that it doesn't work! Because unless I want to manually futz > with the idmaps to get mysql to have access to the directory inside > the container, only *root* gets to get in. But I bet that even > futzing with the idmap doesn't work, because software like mysql often > expects that root *and* a user can access data. And some software > even does privilege separation and uses more than one UID. If the directory is 700 and it's owned by say root:root on the host and you want to share that with arbitrary container users then this isn't something you can do today (ignoring group permissions and ACLs for the sake of your argument) even on the host so that's not a limitation of userns or idmapped mounts. That means many to one mappings of uids/gids. > So I want a way to give *an entire container* access to a directory. > Classic UNIX DAC is just *wrong* for this use case. Maybe idmaps > could learn a way to squash multiple ids down to one. Or maybe Many idmappings to one is in principle possible and I've noted that idea down as a possible extension at https://github.com/uapi-group/kernel-features quite a while (2 years?) ago. > I haven't looked at the idmap implementation nearly enough to have any > opinion as to whether squashing UID is practical or whether there's It's doable. The interesting bit to me was that if we want to allow writes we'd need a way to determine what the uid/gid would be to write down. Imho, that's not super difficult to solve though. The most obvious one is that userspace can just determine it when creating the idmapped mount.
On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote: > > > With my kernel hat on, maybe I agree. But with my *user* hat on, I > > think I pretty strongly disagree. Look, idmapis lousy for > > unprivileged use: > > > > $ install -m 0700 -d test_directory > > $ echo 'hi there' >test_directory/file > > $ podman run -it --rm > > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] > > $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] > > as an unprivileged user doesn't use idmapped mounts at all. So I'm not > sure what this is showing. I suppose you're talking about idmaps in > general. Meh, fair enough. But I don't think this would have worked any better with privilege. Can idmaps be programmed by an otherwise unprivileged owner of a userns and a mountns inside? > Many idmappings to one is in principle possible and I've noted that idea > down as a possible extension at > https://github.com/uapi-group/kernel-features quite a while (2 years?) ago. > > > I haven't looked at the idmap implementation nearly enough to have any > > opinion as to whether squashing UID is practical or whether there's > > It's doable. The interesting bit to me was that if we want to allow > writes we'd need a way to determine what the uid/gid would be to write > down. Imho, that's not super difficult to solve though. The most obvious > one is that userspace can just determine it when creating the idmapped > mount. Seems reasonable to me. If this is set up by someone unprivileged, it would need to be that uid/gid.
On Tue, May 07, 2024 at 01:38:42PM -0700, Andy Lutomirski wrote: > On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote: > > > > > With my kernel hat on, maybe I agree. But with my *user* hat on, I > > > think I pretty strongly disagree. Look, idmapis lousy for > > > unprivileged use: > > > > > > $ install -m 0700 -d test_directory > > > $ echo 'hi there' >test_directory/file > > > $ podman run -it --rm > > > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] > > > > $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] > > > > as an unprivileged user doesn't use idmapped mounts at all. So I'm not > > sure what this is showing. I suppose you're talking about idmaps in > > general. > > Meh, fair enough. But I don't think this would have worked any better > with privilege. > > Can idmaps be programmed by an otherwise unprivileged owner of a > userns and a mountns inside? Yes, but only for userns mountable filesystems that support idmapped mounts. IOW, you need privilege over the superblock and the idmapping you're trying to use. > > > Many idmappings to one is in principle possible and I've noted that idea > > down as a possible extension at > > https://github.com/uapi-group/kernel-features quite a while (2 years?) ago. > > > > > I haven't looked at the idmap implementation nearly enough to have any > > > opinion as to whether squashing UID is practical or whether there's > > > > It's doable. The interesting bit to me was that if we want to allow > > writes we'd need a way to determine what the uid/gid would be to write > > down. Imho, that's not super difficult to solve though. The most obvious > > one is that userspace can just determine it when creating the idmapped > > mount. > > Seems reasonable to me. If this is set up by someone unprivileged, it > would need to be that uid/gid.
On Wed, May 8, 2024 at 12:32 AM Christian Brauner <brauner@kernel.org> wrote: > > On Tue, May 07, 2024 at 01:38:42PM -0700, Andy Lutomirski wrote: > > On Tue, May 7, 2024 at 12:42 AM Christian Brauner <brauner@kernel.org> wrote: > > > > > > > With my kernel hat on, maybe I agree. But with my *user* hat on, I > > > > think I pretty strongly disagree. Look, idmapis lousy for > > > > unprivileged use: > > > > > > > > $ install -m 0700 -d test_directory > > > > $ echo 'hi there' >test_directory/file > > > > $ podman run -it --rm > > > > --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] > > > > > > $ podman run -it --rm --mount=type=bind,src=test_directory,dst=/tmp,idmap [debian-slim] > > > > > > as an unprivileged user doesn't use idmapped mounts at all. So I'm not > > > sure what this is showing. I suppose you're talking about idmaps in > > > general. > > > > Meh, fair enough. But I don't think this would have worked any better > > with privilege. > > > > Can idmaps be programmed by an otherwise unprivileged owner of a > > userns and a mountns inside? > > Yes, but only for userns mountable filesystems that support idmapped > mounts. IOW, you need privilege over the superblock and the idmapping > you're trying to use. Hmm. Is there a good reason to require privilege over the superblock? Obviously creating an idmap that allows one to impersonate someone else seems like a problem, but if an unprivileged task already "owns" (see below) a UID or GID, then effectively delegating that UID or GID is would need caution but is not fundamentally terrible. So, if I'm 1000:1000, then creating an idmap that makes some other task (that isn't 1000:1000) get to act as 1000:1000 doesn't grant new powers. But maybe something even more general could be done (although I'm not sure this is worthwhile): if I own a userns and that userns has an outside UID 1001 mapped (via newuidmap, for example), then perhaps letting me configure an idmap that grants UID 1001 seems not especially dangerous. But maybe that particular job should also be delegated to newuidmap. Out of an abundance of caution, maybe this whole thing should be opt-in. For example, there could be a new CAP_DELEGATE that allows delegation of one's own uid and gid. The idea is that it should be safe to grant regular users CAP_DELEGATE as an ambient capability. --Andy