Message ID | 20201102174737.2740-1-sargun@sargun.me (mailing list archive) |
---|---|
Headers | show |
Series | NFS: Fix interaction between fs_context and user namespaces | expand |
Hi, I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS share with a different user namespace. fsopen() is done in the container namespaces (user, mnt and net namespaces) while fsconfig(), fsmount() and move_mount() are done on the host namespaces. The mount on the host is available in the container via mount propagation from the host mount. With this, the files on the NFS server with uid 0 are available in the container with uid 0. On the host, they are available with uid 4294967294 (make_kuid(&init_user_ns, -2)). The code to reproduce my test is available at: https://github.com/kinvolk/nfs-mount-in-userns And the results and traces are attached at the end. While the basic feature works, I have some thoughts. First, doing the fsopen() in the container namespaces implements two features in one step: 1. Selection of the userns for the id mapping translation. 2. Selection of the netns for the connection to the NFS server. I was wondering if this only considers the scenario where the user wants to make the connection to the NFS server from the network namespace of the container. I think there is another valid use case to use the userns of the container but the netns of the host or a third-party netns. We can use the correct set of setns() to do the fsopen() in the container userns but in the host netns, but then we’d be in a netns that does not belong to the current userns, so we would not have any capability over it. In my tests, that seems to work fine when the netns and the userns of the fs_context are not related. Still, I would find the API cleaner if the userns and netns were selected explicitly with something like: sfd = fsopen("nfs4", FSOPEN_CLOEXEC); usernsfd = pidfd_open(...); or usernsfd = open(“/proc/pid/ns/user”) fsconfig(sfd, FSCONFIG_SET_FD, "userns", NULL, usernsfd); netnsfd = pidfd_open(...); or netnsfd = open(“/proc/pid/ns/net”) fsconfig(sfd, FSCONFIG_SET_FD, "netns", NULL, netnsfd); This would avoid the need for fd passing after the fsopen(). This would require fsconfig() (possibly in nfs_fs_context_parse_param()) to do the capability check but making it more explicit sounds better to me. Second, the capability check in fsopen() is the following: if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)) This means that we cannot just create a temporary userns with the desired id mapping, but we additionally need to enter a mntns owned by the userns. However the code in fsopen() does not seem to do anything with the mntns (The new mount will only be associated with the current mntns at move_mount() time), so we could just create a temporary userns + mntns. It seems weird to me that the capability check is done in relation to the current mntns even though the code does not do anything with it. In Kubernetes, the NFS mount is done before creating the user namespace. pkg/kubelet/kubelet.go Kubelet's syncPod() will do the following in this order: 1. Mount the volumes with CSI or other volume implementations: WaitForAttachAndMount() line 1667 2. Call the CRI's createPodSandbox via kl.containerRuntime.SyncPod() line 1678 to create the user namespace and network namespace. This means that at the time of the NFS mount, we have not yet created the user namespace or the network namespace, and even less configured it with the CNI plugin. With this API where the id mapping for the NFS mount is decided at the superblock level, we would need to refactor the Kubelet code to be able to call the CSI mount after the creation of the sandbox, and after the configuration with CNI. This will be more complicated to integrate in Kubernetes than the idmapped mounts patch set where the id mapping is set at the bind mount level (https://lists.linuxfoundation.org/pipermail/containers/2020-October/042477.html). However, it is less invasive. This approach works for NFS volumes in Kubernetes but would not work with other volumes like hostPath (bind mount from the host) where we don’t have a new superblock. Lastly, I checked the implementation of nfs_compare_super() and it seems fine. In Kubernetes, we want to be able to create several Kubernetes pods with different userns and mount the same NFS share in several pods. The kernel will have to create different NFS superblocks for that scenario and it does that correctly in nfs_compare_super() by comparing the userns and comparing the netns as well. ----- Running ./nfs-mount-in-userns strace: Process 4022 attached [pid 4022] fsopen("nfs4", FSOPEN_CLOEXEC) = 6 [pid 4022] +++ exited with 0 +++ --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=4022, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- fsconfig(7, FSCONFIG_SET_STRING, "source", "127.0.0.1:/server", 0) = 0 fsconfig(7, FSCONFIG_SET_STRING, "vers", "4.2", 0) = 0 fsconfig(7, FSCONFIG_SET_STRING, "addr", "127.0.0.1", 0) = 0 fsconfig(7, FSCONFIG_SET_STRING, "clientaddr", "127.0.0.1", 0) = 0 fsconfig(7, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0 fsmount(7, FSMOUNT_CLOEXEC, 0) = 6 move_mount(6, "", AT_FDCWD, "/mnt/nfs", MOVE_MOUNT_F_EMPTY_PATH) = 0 +++ exited with 0 +++ ./nfs-mount-in-userns returned 0 last dmesg line about nfs4_create_server [55258.702256] nfs4_create_server: Using creds from non-init userns 459 55 0:40 / /mnt/nfs rw,relatime shared:187 - nfs4 127.0.0.1:/server rw,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1 + : 'Files on the NFS server:' + ls -lani /server/ total 20 1048578 drwxr-xr-x. 5 0 0 4096 Nov 10 09:19 . 2 dr-xr-xr-x. 21 0 0 4096 Nov 9 14:25 .. 1048582 drwx------. 2 0 0 4096 Nov 10 09:19 dir-0 1048583 drwx------. 2 1000 1000 4096 Nov 10 09:19 dir-1000 1048584 drwx------. 2 3000 3000 4096 Nov 10 09:19 dir-3000 1048579 -rw-------. 1 0 0 0 Nov 10 09:19 file-0 1048580 -rw-------. 1 1000 1000 0 Nov 10 09:19 file-1000 1048581 -rw-------. 1 3000 3000 0 Nov 10 09:19 file-3000 + : 'Files on the NFS mountpoint (from container PoV):' + nsenter -U -m -n -t 4002 sh -c 'ls -lani /mnt/nfs' total 20 1048578 drwxr-xr-x. 5 0 0 4096 Nov 10 09:19 . 786433 drwxr-xr-x. 3 65534 65534 4096 May 16 16:08 .. 1048582 drwx------. 2 0 0 4096 Nov 10 09:19 dir-0 1048583 drwx------. 2 65534 65534 4096 Nov 10 09:19 dir-1000 1048584 drwx------. 2 65534 65534 4096 Nov 10 09:19 dir-3000 1048579 -rw-------. 1 0 0 0 Nov 10 09:19 file-0 1048580 -rw-------. 1 65534 65534 0 Nov 10 09:19 file-1000 1048581 -rw-------. 1 65534 65534 0 Nov 10 09:19 file-3000 + : 'Files on the NFS mountpoint (from host PoV):' + ls -lani /mnt/nfs/ total 20 1048578 drwxr-xr-x. 5 1000 1000 4096 Nov 10 09:19 . 786433 drwxr-xr-x. 3 0 0 4096 May 16 16:08 .. 1048582 drwx------. 2 1000 1000 4096 Nov 10 09:19 dir-0 1048583 drwx------. 2 4294967294 4294967294 4096 Nov 10 09:19 dir-1000 1048584 drwx------. 2 4294967294 4294967294 4096 Nov 10 09:19 dir-3000 1048579 -rw-------. 1 1000 1000 0 Nov 10 09:19 file-0 1048580 -rw-------. 1 4294967294 4294967294 0 Nov 10 09:19 file-1000 1048581 -rw-------. 1 4294967294 4294967294 0 Nov 10 09:19 file-3000 Alban On Mon, 2 Nov 2020 at 18:48, Sargun Dhillon <sargun@sargun.me> wrote: > > This is effectively a resend, but re-based atop Anna's current tree. I can > add the samples back in an another patchset. > > Right now, it is possible to mount NFS with an non-matching super block > user ns, and NFS sunrpc user ns. This (for the user) results in an awkward > set of interactions if using anything other than auth_null, where the UIDs > being sent to the server are different than the local UIDs being checked. > This can cause "breakage", where if you try to communicate with the NFS > server with any other set of mappings, it breaks. > > This is after the initial v5.10 merge window, so hopefully this patchset > can be reconsidered, and maybe we can make forward progress? I think that > it takes a relatively conservative approach in enabling user namespaces, > and it prevents the case where someone is using auth_gss (for now), as the > mappings are non-trivial. > > Changes since v3: > * Rebase atop Anna's tree > Changes since v2: > * Removed samples > * Split out NFSv2/v3 patchset from NFSv4 patchset > * Added restrictions around use > Changes since v1: > * Added samples > > Sargun Dhillon (2): > NFS: NFSv2/NFSv3: Use cred from fs_context during mount > NFSv4: Refactor NFS to use user namespaces > > fs/nfs/client.c | 10 ++++++++-- > fs/nfs/nfs4client.c | 27 ++++++++++++++++++++++++++- > fs/nfs/nfs4idmap.c | 2 +- > fs/nfs/nfs4idmap.h | 3 ++- > 4 files changed, 37 insertions(+), 5 deletions(-) > > > base-commit: 8c39076c276be0b31982e44654e2c2357473258a > -- > 2.25.1 >
On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote: > Hi, > > I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS > share with a different user namespace. fsopen() is done in the > container namespaces (user, mnt and net namespaces) while fsconfig(), > fsmount() and move_mount() are done on the host namespaces. The mount > on the host is available in the container via mount propagation from > the host mount. > > With this, the files on the NFS server with uid 0 are available in > the > container with uid 0. On the host, they are available with uid > 4294967294 (make_kuid(&init_user_ns, -2)). > Can someone please tell me what is broken with the _current_ design before we start trying to push "fixes" that clearly break it? The current design assumes that the user namespace being used is the one where the mount itself is performed. That means that the uids and gids or usernames and groupnames that go on the wire match the uids and gids of the container in which the mount occurred. The assumption is that the server has authenticated that client as belonging to a domain that it recognises (either through strong RPCSEC_GSS/krb5 authentication, or through weaker matching of IP addresses to a list of acceptable clients). If you go ahead and change the user namespace on the client without going through the mount process again to mount a different super block with a different user namespace, then you will now get the exact same behaviour as if you do that with any other filesystem.
On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote: > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote: > > Hi, > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS > > share with a different user namespace. fsopen() is done in the > > container namespaces (user, mnt and net namespaces) while fsconfig(), > > fsmount() and move_mount() are done on the host namespaces. The mount > > on the host is available in the container via mount propagation from > > the host mount. > > > > With this, the files on the NFS server with uid 0 are available in > > the > > container with uid 0. On the host, they are available with uid > > 4294967294 (make_kuid(&init_user_ns, -2)). > > > > Can someone please tell me what is broken with the _current_ design > before we start trying to push "fixes" that clearly break it? Currently the mechanism of mounting nfs4 in a user namespace is as follows: Parent: fork() Child: setns(userns) C: fsopen("nfs4") = 3 C->P: Send FD 3 P: FSConfig... P: fsmount... (This is where the CAP_SYS_ADMIN check happens)) Right now, when you mount an NFS filesystem in a non-init user namespace, and you have UIDs / GIDs on, the UIDs / GIDs which are sent to the server are not the UIDs from the mounting namespace, instead they are the UIDs from the init user ns. The reason for this is that you can call fsopen("nfs4") in the unprivileged namespace, and that configures fs_context with all the right information for that user namespace, but we currently require CAP_SYS_ADMIN in the init user namespace to call fsmount. This means that the superblock's user namespace is set "correctly" to the container, but there's absolutely no way nfs4uidmap to consume an unprivileged user namespace. This behaviour happens "the other way" as well, where the UID in the container may be 0, but the corresponding kuid is 1000. When a response from an NFS server comes in we decode it according to the idmap userns[1]. The userns used to get create idmap is generated at fsmount time, and not as fsopen time. So, even if the filesystem is in the user namespace, and the server responds with UID 0, it'll come up with an unmapped UID. This is because we do Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS from_kuid(container_ns, 0) -> invalid uid This is broken behaviour, in my humble opinion as is it makes it impossible to use NFSv4 (and v3 for that matter) out of the box with unprivileged user namespaces. At least in our environment, using usernames / GSS isn't an option, so we have to rely on UIDs being set correctly [at least from the container's perspective]. > > The current design assumes that the user namespace being used is the one where > the mount itself is performed. That means that the uids and gids or usernames > and groupnames that go on the wire match the uids and gids of the container in > which the mount occurred. > Right now, NFS does not have the ability for the fsmount() call to be called in an unprivileged user namespace. We can change that behaviour elsewhere if we want, but it's orthogonal to this. > The assumption is that the server has authenticated that client as > belonging to a domain that it recognises (either through strong > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP > addresses to a list of acceptable clients). > I added a rejection for upcalls because upcalls can happen in the init namespaces. We can drop that restriction from the nfs4 patch if you'd like. I *believe* (and I'm not a little out of my depth) that the request-key handler gets called with the *network namespace* of the NFS mount, but the userns is a privileged one, allowing for potential hazards. The reason I added that block there is that I didn't imagine anyone was running NFS in an unprivileged user namespace, and relying on upcalls (potentially into privileged namespaces) in order to do authz. > If you go ahead and change the user namespace on the client without > going through the mount process again to mount a different super block > with a different user namespace, then you will now get the exact same > behaviour as if you do that with any other filesystem. Not exactly, because other filesystems *only* use the s_user_ns for conversion of UIDs, whereas NFS uses the currend_cred() acquired at mount time, which doesn't match s_user_ns, leading to this behaviour. 1. Mistranslated UIDs in encoding RPCs 2. The UID / GID exposed to VFS do not match the user ns. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > > -Thanks, Sargun [1]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782 [2]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154
On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote: > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote: > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote: > > > Hi, > > > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an > > > NFS > > > share with a different user namespace. fsopen() is done in the > > > container namespaces (user, mnt and net namespaces) while > > > fsconfig(), > > > fsmount() and move_mount() are done on the host namespaces. The > > > mount > > > on the host is available in the container via mount propagation > > > from > > > the host mount. > > > > > > With this, the files on the NFS server with uid 0 are available > > > in > > > the > > > container with uid 0. On the host, they are available with uid > > > 4294967294 (make_kuid(&init_user_ns, -2)). > > > > > > > Can someone please tell me what is broken with the _current_ design > > before we start trying to push "fixes" that clearly break it? > Currently the mechanism of mounting nfs4 in a user namespace is as > follows: > > Parent: fork() > Child: setns(userns) > C: fsopen("nfs4") = 3 > C->P: Send FD 3 > P: FSConfig... > P: fsmount... (This is where the CAP_SYS_ADMIN check happens)) > > > Right now, when you mount an NFS filesystem in a non-init user > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which > are sent to the server are not the UIDs from the mounting namespace, > instead they are the UIDs from the init user ns. > > The reason for this is that you can call fsopen("nfs4") in the > unprivileged > namespace, and that configures fs_context with all the right > information for > that user namespace, but we currently require CAP_SYS_ADMIN in the > init user > namespace to call fsmount. This means that the superblock's user > namespace is > set "correctly" to the container, but there's absolutely no way > nfs4uidmap > to consume an unprivileged user namespace. > > This behaviour happens "the other way" as well, where the UID in the > container > may be 0, but the corresponding kuid is 1000. When a response from an > NFS > server comes in we decode it according to the idmap userns[1]. The > userns > used to get create idmap is generated at fsmount time, and not as > fsopen > time. So, even if the filesystem is in the user namespace, and the > server > responds with UID 0, it'll come up with an unmapped UID. > > This is because we do > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS > from_kuid(container_ns, 0) -> invalid uid > > This is broken behaviour, in my humble opinion as is it makes it > impossible to > use NFSv4 (and v3 for that matter) out of the box with unprivileged > user > namespaces. At least in our environment, using usernames / GSS isn't > an option, > so we have to rely on UIDs being set correctly [at least from the > container's > perspective]. > The current code for setting server->cred was developed independently of fsopen() (and predates it actually). I'm fine with the change to have server->cred be the cred of the user that called fsopen(). That's in line with what we used to do for sys_mount(). However all the other stuff to throw errors when the user namespace is not init_user_ns introduces massive regressions. > > > > The current design assumes that the user namespace being used is > > the one where > > the mount itself is performed. That means that the uids and gids or > > usernames > > and groupnames that go on the wire match the uids and gids of the > > container in > > which the mount occurred. > > > > Right now, NFS does not have the ability for the fsmount() call to be > called in an unprivileged user namespace. We can change that > behaviour > elsewhere if we want, but it's orthogonal to this. > > > The assumption is that the server has authenticated that client as > > belonging to a domain that it recognises (either through strong > > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP > > addresses to a list of acceptable clients). > > > I added a rejection for upcalls because upcalls can happen in the > init > namespaces. We can drop that restriction from the nfs4 patch if you'd > like. I > *believe* (and I'm not a little out of my depth) that the request-key > handler gets called with the *network namespace* of the NFS mount, > but the userns is a privileged one, allowing for potential hazards. > The idmapper already rejects upcalls to the keyring '/sbin/request-key' utility if you're running with your own user namespace. Quite frankly, switching to using the keyring was a mistake which I'd undo if I could. Aside from not supporting containers, it is horribly slow due to requiring a full process startup/teardown for every upcall, so it scales poorly to large numbers of identities (particularly with an operation like readdir() in which you're doing serial upcalls). However nothing stops you from using the old NFSv4 idmapper daemon (a.k.a. rpc.idmapd) in the context of the container that called fsopen() so that it can translate identities correctly using whatever userspace tools (ldap, sssd, winbind...) that the container has configured. > The reason I added that block there is that I didn't imagine anyone > was running > NFS in an unprivileged user namespace, and relying on upcalls > (potentially into > privileged namespaces) in order to do authz. > > > > If you go ahead and change the user namespace on the client without > > going through the mount process again to mount a different super > > block > > with a different user namespace, then you will now get the exact > > same > > behaviour as if you do that with any other filesystem. > > Not exactly, because other filesystems *only* use the s_user_ns for > conversion > of UIDs, whereas NFS uses the currend_cred() acquired at mount time, > which > doesn't match s_user_ns, leading to this behaviour. > > 1. Mistranslated UIDs in encoding RPCs > 2. The UID / GID exposed to VFS do not match the user ns. > > > > > -- > > Trond Myklebust > > Linux NFS client maintainer, Hammerspace > > trond.myklebust@hammerspace.com > > > > > -Thanks, > Sargun > > [1]: > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782 > [2]: > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154
On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote: > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote: > > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote: > > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote: > > > > Hi, > > > > > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an > > > > NFS > > > > share with a different user namespace. fsopen() is done in the > > > > container namespaces (user, mnt and net namespaces) while > > > > fsconfig(), > > > > fsmount() and move_mount() are done on the host namespaces. The > > > > mount > > > > on the host is available in the container via mount propagation > > > > from > > > > the host mount. > > > > > > > > With this, the files on the NFS server with uid 0 are available > > > > in > > > > the > > > > container with uid 0. On the host, they are available with uid > > > > 4294967294 (make_kuid(&init_user_ns, -2)). > > > > > > > > > > Can someone please tell me what is broken with the _current_ design > > > before we start trying to push "fixes" that clearly break it? > > Currently the mechanism of mounting nfs4 in a user namespace is as > > follows: > > > > Parent: fork() > > Child: setns(userns) > > C: fsopen("nfs4") = 3 > > C->P: Send FD 3 > > P: FSConfig... > > P: fsmount... (This is where the CAP_SYS_ADMIN check happens)) > > > > > > Right now, when you mount an NFS filesystem in a non-init user > > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which > > are sent to the server are not the UIDs from the mounting namespace, > > instead they are the UIDs from the init user ns. > > > > The reason for this is that you can call fsopen("nfs4") in the > > unprivileged > > namespace, and that configures fs_context with all the right > > information for > > that user namespace, but we currently require CAP_SYS_ADMIN in the > > init user > > namespace to call fsmount. This means that the superblock's user > > namespace is > > set "correctly" to the container, but there's absolutely no way > > nfs4uidmap > > to consume an unprivileged user namespace. > > > > This behaviour happens "the other way" as well, where the UID in the > > container > > may be 0, but the corresponding kuid is 1000. When a response from an > > NFS > > server comes in we decode it according to the idmap userns[1]. The > > userns > > used to get create idmap is generated at fsmount time, and not as > > fsopen > > time. So, even if the filesystem is in the user namespace, and the > > server > > responds with UID 0, it'll come up with an unmapped UID. > > > > This is because we do > > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS > > from_kuid(container_ns, 0) -> invalid uid > > > > This is broken behaviour, in my humble opinion as is it makes it > > impossible to > > use NFSv4 (and v3 for that matter) out of the box with unprivileged > > user > > namespaces. At least in our environment, using usernames / GSS isn't > > an option, > > so we have to rely on UIDs being set correctly [at least from the > > container's > > perspective]. > > > > The current code for setting server->cred was developed independently > of fsopen() (and predates it actually). I'm fine with the change to > have server->cred be the cred of the user that called fsopen(). That's > in line with what we used to do for sys_mount(). > Just curious, without FS_USERNS, how were you mounting NFSv4 in an unprivileged user ns? > However all the other stuff to throw errors when the user namespace is > not init_user_ns introduces massive regressions. > I can remove that and respin the patch. How do you feel about that? I would still like to keep the log lines though because it is a uapi change. I am worried that someone might exercise this path with GSS and allow for upcalls into the main namespaces by accident -- or be confused of why they're seeing upcalls "in a different namespace". Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from fs_context during mount") without any changes? I can respin ("NFSv4: Refactor NFS to use user namespaces") without: /* * nfs4idmap is not fully isolated by user namespaces. It is currently * only network namespace aware. If upcalls never happen, we do not * need to worry as nfs_client instances aren't shared between * user namespaces. */ if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && !(server->caps & NFS_CAP_UIDGID_NOMAP)) { error = -EINVAL; errorf(fc, "Mount credentials are from non init user namespace and ID mapping is enabled. This is not allowed."); goto error; } (and making it so we can call idmap_userns) > > > > > > The current design assumes that the user namespace being used is > > > the one where > > > the mount itself is performed. That means that the uids and gids or > > > usernames > > > and groupnames that go on the wire match the uids and gids of the > > > container in > > > which the mount occurred. > > > > > > > Right now, NFS does not have the ability for the fsmount() call to be > > called in an unprivileged user namespace. We can change that > > behaviour > > elsewhere if we want, but it's orthogonal to this. > > > > > The assumption is that the server has authenticated that client as > > > belonging to a domain that it recognises (either through strong > > > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP > > > addresses to a list of acceptable clients). > > > > > I added a rejection for upcalls because upcalls can happen in the > > init > > namespaces. We can drop that restriction from the nfs4 patch if you'd > > like. I > > *believe* (and I'm not a little out of my depth) that the request-key > > handler gets called with the *network namespace* of the NFS mount, > > but the userns is a privileged one, allowing for potential hazards. > > > > The idmapper already rejects upcalls to the keyring '/sbin/request-key' > utility if you're running with your own user namespace. > > Quite frankly, switching to using the keyring was a mistake which I'd > undo if I could. Aside from not supporting containers, it is horribly > slow due to requiring a full process startup/teardown for every upcall, > so it scales poorly to large numbers of identities (particularly with > an operation like readdir() in which you're doing serial upcalls). > > However nothing stops you from using the old NFSv4 idmapper daemon > (a.k.a. rpc.idmapd) in the context of the container that called > fsopen() so that it can translate identities correctly using whatever > userspace tools (ldap, sssd, winbind...) that the container has > configured. > 1. We see this as a potential security risk [this being upcalls] into the unconfined portion of the system. Although, I'm sure that the userspace handlers are written perfectly well, it allows for information leakage to occur. 2. Is there a way to do this for NFSv3? 3. Can rpc.idmapd get the user namespace that the call is from (and is the keyring per-userns?). In general, I think that this change follows the principal of least surprise. > > The reason I added that block there is that I didn't imagine anyone > > was running > > NFS in an unprivileged user namespace, and relying on upcalls > > (potentially into > > privileged namespaces) in order to do authz. > > > > > > > If you go ahead and change the user namespace on the client without > > > going through the mount process again to mount a different super > > > block > > > with a different user namespace, then you will now get the exact > > > same > > > behaviour as if you do that with any other filesystem. > > > > Not exactly, because other filesystems *only* use the s_user_ns for > > conversion > > of UIDs, whereas NFS uses the currend_cred() acquired at mount time, > > which > > doesn't match s_user_ns, leading to this behaviour. > > > > 1. Mistranslated UIDs in encoding RPCs > > 2. The UID / GID exposed to VFS do not match the user ns. > > > > > > > > -- > > > Trond Myklebust > > > Linux NFS client maintainer, Hammerspace > > > trond.myklebust@hammerspace.com > > > > > > > > -Thanks, > > Sargun > > > > [1]: > > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782 > > [2]: > > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154 > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote: > On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote: > > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote: > > > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote: > > > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote: > > > > > Hi, > > > > > > > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount > > > > > an > > > > > NFS > > > > > share with a different user namespace. fsopen() is done in > > > > > the > > > > > container namespaces (user, mnt and net namespaces) while > > > > > fsconfig(), > > > > > fsmount() and move_mount() are done on the host namespaces. > > > > > The > > > > > mount > > > > > on the host is available in the container via mount > > > > > propagation > > > > > from > > > > > the host mount. > > > > > > > > > > With this, the files on the NFS server with uid 0 are > > > > > available > > > > > in > > > > > the > > > > > container with uid 0. On the host, they are available with > > > > > uid > > > > > 4294967294 (make_kuid(&init_user_ns, -2)). > > > > > > > > > > > > > Can someone please tell me what is broken with the _current_ > > > > design > > > > before we start trying to push "fixes" that clearly break it? > > > Currently the mechanism of mounting nfs4 in a user namespace is > > > as > > > follows: > > > > > > Parent: fork() > > > Child: setns(userns) > > > C: fsopen("nfs4") = 3 > > > C->P: Send FD 3 > > > P: FSConfig... > > > P: fsmount... (This is where the CAP_SYS_ADMIN check happens)) > > > > > > > > > Right now, when you mount an NFS filesystem in a non-init user > > > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which > > > are sent to the server are not the UIDs from the mounting > > > namespace, > > > instead they are the UIDs from the init user ns. > > > > > > The reason for this is that you can call fsopen("nfs4") in the > > > unprivileged > > > namespace, and that configures fs_context with all the right > > > information for > > > that user namespace, but we currently require CAP_SYS_ADMIN in > > > the > > > init user > > > namespace to call fsmount. This means that the superblock's user > > > namespace is > > > set "correctly" to the container, but there's absolutely no way > > > nfs4uidmap > > > to consume an unprivileged user namespace. > > > > > > This behaviour happens "the other way" as well, where the UID in > > > the > > > container > > > may be 0, but the corresponding kuid is 1000. When a response > > > from an > > > NFS > > > server comes in we decode it according to the idmap userns[1]. > > > The > > > userns > > > used to get create idmap is generated at fsmount time, and not as > > > fsopen > > > time. So, even if the filesystem is in the user namespace, and > > > the > > > server > > > responds with UID 0, it'll come up with an unmapped UID. > > > > > > This is because we do > > > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS > > > from_kuid(container_ns, 0) -> invalid uid > > > > > > This is broken behaviour, in my humble opinion as is it makes it > > > impossible to > > > use NFSv4 (and v3 for that matter) out of the box with > > > unprivileged > > > user > > > namespaces. At least in our environment, using usernames / GSS > > > isn't > > > an option, > > > so we have to rely on UIDs being set correctly [at least from the > > > container's > > > perspective]. > > > > > > > The current code for setting server->cred was developed > > independently > > of fsopen() (and predates it actually). I'm fine with the change to > > have server->cred be the cred of the user that called fsopen(). > > That's > > in line with what we used to do for sys_mount(). > > > Just curious, without FS_USERNS, how were you mounting NFSv4 in an > unprivileged user ns? The code was originally developed on a 5.1 kernel. So all my testing has been with ordinary sys_mount() calls in a container that had CAP_SYS_ADMIN privileges. > > However all the other stuff to throw errors when the user namespace > > is > > not init_user_ns introduces massive regressions. > > > > I can remove that and respin the patch. How do you feel about that? > I would > still like to keep the log lines though because it is a uapi change. > I am > worried that someone might exercise this path with GSS and allow for > upcalls > into the main namespaces by accident -- or be confused of why they're > seeing > upcalls "in a different namespace". > > Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from > fs_context during > mount") without any changes? Why do we need the dprintk()s? It seems to me that either they should be reporting something that the user needs to know (in which case they should be real printk()s) or they are telling us something that we should already know. To me they seem to fit more in the latter category. > > I can respin ("NFSv4: Refactor NFS to use user namespaces") without: > /* > * nfs4idmap is not fully isolated by user namespaces. It is > currently > * only network namespace aware. If upcalls never happen, we do not > * need to worry as nfs_client instances aren't shared between > * user namespaces. > */ > if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && > !(server->caps & NFS_CAP_UIDGID_NOMAP)) { > error = -EINVAL; > errorf(fc, "Mount credentials are from non init user > namespace and ID mapping is enabled. This is not allowed."); > goto error; > } > > (and making it so we can call idmap_userns) > Yes. That would be acceptable. Again, though, I'd like to see the dprintk()s gone. > > > > > > > > The current design assumes that the user namespace being used > > > > is > > > > the one where > > > > the mount itself is performed. That means that the uids and > > > > gids or > > > > usernames > > > > and groupnames that go on the wire match the uids and gids of > > > > the > > > > container in > > > > which the mount occurred. > > > > > > > > > > Right now, NFS does not have the ability for the fsmount() call > > > to be > > > called in an unprivileged user namespace. We can change that > > > behaviour > > > elsewhere if we want, but it's orthogonal to this. > > > > > > > The assumption is that the server has authenticated that client > > > > as > > > > belonging to a domain that it recognises (either through strong > > > > RPCSEC_GSS/krb5 authentication, or through weaker matching of > > > > IP > > > > addresses to a list of acceptable clients). > > > > > > > I added a rejection for upcalls because upcalls can happen in the > > > init > > > namespaces. We can drop that restriction from the nfs4 patch if > > > you'd > > > like. I > > > *believe* (and I'm not a little out of my depth) that the > > > request-key > > > handler gets called with the *network namespace* of the NFS > > > mount, > > > but the userns is a privileged one, allowing for potential > > > hazards. > > > > > > > The idmapper already rejects upcalls to the keyring '/sbin/request- > > key' > > utility if you're running with your own user namespace. > > > > Quite frankly, switching to using the keyring was a mistake which > > I'd > > undo if I could. Aside from not supporting containers, it is > > horribly > > slow due to requiring a full process startup/teardown for every > > upcall, > > so it scales poorly to large numbers of identities (particularly > > with > > an operation like readdir() in which you're doing serial upcalls). > > > > However nothing stops you from using the old NFSv4 idmapper daemon > > (a.k.a. rpc.idmapd) in the context of the container that called > > fsopen() so that it can translate identities correctly using > > whatever > > userspace tools (ldap, sssd, winbind...) that the container has > > configured. > > > > 1. We see this as a potential security risk [this being upcalls] into > the > unconfined portion of the system. Although, I'm sure that the > userspace handlers > are written perfectly well, it allows for information leakage to > occur. > > 2. Is there a way to do this for NFSv3? > > 3. Can rpc.idmapd get the user namespace that the call is from (and > is the > keyring per-userns?). In general, I think that this change follows > the principal > of least surprise. > > > > The reason I added that block there is that I didn't imagine > > > anyone > > > was running > > > NFS in an unprivileged user namespace, and relying on upcalls > > > (potentially into > > > privileged namespaces) in order to do authz. > > > > > > > > > > If you go ahead and change the user namespace on the client > > > > without > > > > going through the mount process again to mount a different > > > > super > > > > block > > > > with a different user namespace, then you will now get the > > > > exact > > > > same > > > > behaviour as if you do that with any other filesystem. > > > > > > Not exactly, because other filesystems *only* use the s_user_ns > > > for > > > conversion > > > of UIDs, whereas NFS uses the currend_cred() acquired at mount > > > time, > > > which > > > doesn't match s_user_ns, leading to this behaviour. > > > > > > 1. Mistranslated UIDs in encoding RPCs > > > 2. The UID / GID exposed to VFS do not match the user ns. > > > > > > > >
On Wed, Nov 11, 2020 at 08:03:18PM +0000, Trond Myklebust wrote: > On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote: > > On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote: > > > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote: > > > > > > The current code for setting server->cred was developed > > > independently > > > of fsopen() (and predates it actually). I'm fine with the change to > > > have server->cred be the cred of the user that called fsopen(). > > > That's > > > in line with what we used to do for sys_mount(). > > > > > Just curious, without FS_USERNS, how were you mounting NFSv4 in an > > unprivileged user ns? > > The code was originally developed on a 5.1 kernel. So all my testing > has been with ordinary sys_mount() calls in a container that had > CAP_SYS_ADMIN privileges. > > > > However all the other stuff to throw errors when the user namespace > > > is > > > not init_user_ns introduces massive regressions. > > > > > > > I can remove that and respin the patch. How do you feel about that? > > I would > > still like to keep the log lines though because it is a uapi change. > > I am > > worried that someone might exercise this path with GSS and allow for > > upcalls > > into the main namespaces by accident -- or be confused of why they're > > seeing > > upcalls "in a different namespace". > > > > Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from > > fs_context during > > mount") without any changes? > > Why do we need the dprintk()s? It seems to me that either they should > be reporting something that the user needs to know (in which case they > should be real printk()s) or they are telling us something that we > should already know. To me they seem to fit more in the latter > category. > > > > > I can respin ("NFSv4: Refactor NFS to use user namespaces") without: > > /* > > * nfs4idmap is not fully isolated by user namespaces. It is > > currently > > * only network namespace aware. If upcalls never happen, we do not > > * need to worry as nfs_client instances aren't shared between > > * user namespaces. > > */ > > if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && > > !(server->caps & NFS_CAP_UIDGID_NOMAP)) { > > error = -EINVAL; > > errorf(fc, "Mount credentials are from non init user > > namespace and ID mapping is enabled. This is not allowed."); > > goto error; > > } > > > > (and making it so we can call idmap_userns) > > > > Yes. That would be acceptable. Again, though, I'd like to see the > dprintk()s gone. > I can drop the dprintks, but given this is a uapi change, does it make sense to pr_info_once? Especially, because this can have security impact?
On Thu, Nov 12, 2020 at 12:30:56AM +0000, Sargun Dhillon wrote: > On Wed, Nov 11, 2020 at 08:03:18PM +0000, Trond Myklebust wrote: > > On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote: > > > On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote: > > > > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote: > > > > > > > > The current code for setting server->cred was developed > > > > independently > > > > of fsopen() (and predates it actually). I'm fine with the change to > > > > have server->cred be the cred of the user that called fsopen(). > > > > That's > > > > in line with what we used to do for sys_mount(). > > > > > > > Just curious, without FS_USERNS, how were you mounting NFSv4 in an > > > unprivileged user ns? > > > > The code was originally developed on a 5.1 kernel. So all my testing > > has been with ordinary sys_mount() calls in a container that had > > CAP_SYS_ADMIN privileges. > > > > > > However all the other stuff to throw errors when the user namespace > > > > is > > > > not init_user_ns introduces massive regressions. > > > > > > > > > > I can remove that and respin the patch. How do you feel about that? > > > I would > > > still like to keep the log lines though because it is a uapi change. > > > I am > > > worried that someone might exercise this path with GSS and allow for > > > upcalls > > > into the main namespaces by accident -- or be confused of why they're > > > seeing > > > upcalls "in a different namespace". > > > > > > Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from > > > fs_context during > > > mount") without any changes? > > > > Why do we need the dprintk()s? It seems to me that either they should > > be reporting something that the user needs to know (in which case they > > should be real printk()s) or they are telling us something that we > > should already know. To me they seem to fit more in the latter > > category. > > > > > > > > I can respin ("NFSv4: Refactor NFS to use user namespaces") without: > > > /* > > > * nfs4idmap is not fully isolated by user namespaces. It is > > > currently > > > * only network namespace aware. If upcalls never happen, we do not > > > * need to worry as nfs_client instances aren't shared between > > > * user namespaces. > > > */ > > > if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && > > > !(server->caps & NFS_CAP_UIDGID_NOMAP)) { > > > error = -EINVAL; > > > errorf(fc, "Mount credentials are from non init user > > > namespace and ID mapping is enabled. This is not allowed."); > > > goto error; > > > } > > > > > > (and making it so we can call idmap_userns) > > > > > > > Yes. That would be acceptable. Again, though, I'd like to see the > > dprintk()s gone. > > > > I can drop the dprintks, but given this is a uapi change, does it make sense to > pr_info_once? Especially, because this can have security impact? Spending 5 minutes thinking about this, I think that best go out in another patch that I can spin, and we can discuss there.