[v4,0/2] NFS: Fix interaction between fs_context and user namespaces

Message ID	20201102174737.2740-1-sargun@sargun.me (mailing list archive)
Headers	show Return-Path: <SRS0=KohJ=EI=vger.kernel.org=linux-nfs-owner@kernel.org> From: Sargun Dhillon <sargun@sargun.me> To: "J . Bruce Fields" <bfields@fieldses.org>, Chuck Lever <chuck.lever@oracle.com>, Trond Myklebust <trond.myklebust@hammerspace.com>, Anna Schumaker <anna.schumaker@netapp.com>, Anna Schumaker <schumaker.anna@gmail.com>, David Howells <dhowells@redhat.com>, Scott Mayhew <smayhew@redhat.com> Cc: Sargun Dhillon <sargun@sargun.me>, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 0/2] NFS: Fix interaction between fs_context and user namespaces Date: Mon, 2 Nov 2020 09:47:35 -0800 Message-Id: <20201102174737.2740-1-sargun@sargun.me> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	NFS: Fix interaction between fs_context and user namespaces \| expand [v4,0/2] NFS: Fix interaction between fs_context and user namespaces [v4,1/2] NFS: NFSv2/NFSv3: Use cred from fs_context during mount [v4,2/2] NFSv4: Refactor NFS to use user namespaces

Sargun Dhillon Nov. 2, 2020, 5:47 p.m. UTC

This is effectively a resend, but re-based atop Anna's current tree. I can
add the samples back in an another patchset.

Right now, it is possible to mount NFS with an non-matching super block
user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
set of interactions if using anything other than auth_null, where the UIDs
being sent to the server are different than the local UIDs being checked.
This can cause "breakage", where if you try to communicate with the NFS
server with any other set of mappings, it breaks.

This is after the initial v5.10 merge window, so hopefully this patchset
can be reconsidered, and maybe we can make forward progress? I think that
it takes a relatively conservative approach in enabling user namespaces,
and it prevents the case where someone is using auth_gss (for now), as the
mappings are non-trivial.

Changes since v3:
  * Rebase atop Anna's tree
Changes since v2:
  * Removed samples
  * Split out NFSv2/v3 patchset from NFSv4 patchset
  * Added restrictions around use
Changes since v1:
  * Added samples

Sargun Dhillon (2):
  NFS: NFSv2/NFSv3: Use cred from fs_context during mount
  NFSv4: Refactor NFS to use user namespaces

 fs/nfs/client.c     | 10 ++++++++--
 fs/nfs/nfs4client.c | 27 ++++++++++++++++++++++++++-
 fs/nfs/nfs4idmap.c  |  2 +-
 fs/nfs/nfs4idmap.h  |  3 ++-
 4 files changed, 37 insertions(+), 5 deletions(-)


base-commit: 8c39076c276be0b31982e44654e2c2357473258a

Alban Crequy Nov. 10, 2020, 4:43 p.m. UTC | #1

Hi,

I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS
share with a different user namespace. fsopen() is done in the
container namespaces (user, mnt and net namespaces) while fsconfig(),
fsmount() and move_mount() are done on the host namespaces. The mount
on the host is available in the container via mount propagation from
the host mount.

With this, the files on the NFS server with uid 0 are available in the
container with uid 0. On the host, they are available with uid
4294967294 (make_kuid(&init_user_ns, -2)).

The code to reproduce my test is available at:
https://github.com/kinvolk/nfs-mount-in-userns
And the results and traces are attached at the end.

While the basic feature works, I have some thoughts.

First, doing the fsopen() in the container namespaces implements two
features in one step:
1. Selection of the userns for the id mapping translation.
2. Selection of the netns for the connection to the NFS server.

I was wondering if this only considers the scenario where the user
wants to make the connection to the NFS server from the network
namespace of the container. I think there is another valid use case to
use the userns of the container but the netns of the host or a
third-party netns. We can use the correct set of setns() to do the
fsopen() in the container userns but in the host netns, but then we’d
be in a netns that does not belong to the current userns, so we would
not have any capability over it. In my tests, that seems to work fine
when the netns and the userns of the fs_context are not related.

Still, I would find the API cleaner if the userns and netns were
selected explicitly with something like:

sfd = fsopen("nfs4", FSOPEN_CLOEXEC);
usernsfd = pidfd_open(...); or usernsfd = open(“/proc/pid/ns/user”)
fsconfig(sfd, FSCONFIG_SET_FD, "userns", NULL, usernsfd);
netnsfd = pidfd_open(...); or netnsfd = open(“/proc/pid/ns/net”)
fsconfig(sfd, FSCONFIG_SET_FD, "netns", NULL, netnsfd);

This would avoid the need for fd passing after the fsopen(). This
would require fsconfig() (possibly in nfs_fs_context_parse_param()) to
do the capability check but making it more explicit sounds better to
me.

Second, the capability check in fsopen() is the following:
  if (!ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN))

This means that we cannot just create a temporary userns with the
desired id mapping, but we additionally need to enter a mntns owned by
the userns. However the code in fsopen() does not seem to do anything
with the mntns (The new mount will only be associated with the current
mntns at move_mount() time), so we could just create a temporary
userns + mntns. It seems weird to me that the capability check is done
in relation to the current mntns even though the code does not do
anything with it.

In Kubernetes, the NFS mount is done before creating the user
namespace. pkg/kubelet/kubelet.go Kubelet's syncPod() will do the
following in this order:
1. Mount the volumes with CSI or other volume implementations:
WaitForAttachAndMount() line 1667
2. Call the CRI's createPodSandbox via kl.containerRuntime.SyncPod()
line 1678 to create the user namespace and network namespace.

This means that at the time of the NFS mount, we have not yet created
the user namespace or the network namespace, and even less configured
it with the CNI plugin. With this API where the id mapping for the NFS
mount is decided at the superblock level, we would need to refactor
the Kubelet code to be able to call the CSI mount after the creation
of the sandbox, and after the configuration with CNI. This will be
more complicated to integrate in Kubernetes than the idmapped mounts
patch set where the id mapping is set at the bind mount level
(https://lists.linuxfoundation.org/pipermail/containers/2020-October/042477.html).
However, it is less invasive.

This approach works for NFS volumes in Kubernetes but would not work
with other volumes like hostPath (bind mount from the host) where we
don’t have a new superblock.

Lastly, I checked the implementation of nfs_compare_super() and it
seems fine. In Kubernetes, we want to be able to create several
Kubernetes pods with different userns and mount the same NFS share in
several pods. The kernel will have to create different NFS superblocks
for that scenario and it does that correctly in nfs_compare_super() by
comparing the userns and comparing the netns as well.

-----

Running ./nfs-mount-in-userns
strace: Process 4022 attached
[pid  4022] fsopen("nfs4", FSOPEN_CLOEXEC) = 6
[pid  4022] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=4022,
si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
fsconfig(7, FSCONFIG_SET_STRING, "source", "127.0.0.1:/server", 0) = 0
fsconfig(7, FSCONFIG_SET_STRING, "vers", "4.2", 0) = 0
fsconfig(7, FSCONFIG_SET_STRING, "addr", "127.0.0.1", 0) = 0
fsconfig(7, FSCONFIG_SET_STRING, "clientaddr", "127.0.0.1", 0) = 0
fsconfig(7, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(7, FSMOUNT_CLOEXEC, 0)          = 6
move_mount(6, "", AT_FDCWD, "/mnt/nfs", MOVE_MOUNT_F_EMPTY_PATH) = 0
+++ exited with 0 +++
./nfs-mount-in-userns returned 0
last dmesg line about nfs4_create_server
[55258.702256] nfs4_create_server: Using creds from non-init userns
459 55 0:40 / /mnt/nfs rw,relatime shared:187 - nfs4 127.0.0.1:/server
rw,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=127.0.0.1,local_lock=none,addr=127.0.0.1

+ : 'Files on the NFS server:'
+ ls -lani /server/
total 20
1048578 drwxr-xr-x.  5    0    0 4096 Nov 10 09:19 .
      2 dr-xr-xr-x. 21    0    0 4096 Nov  9 14:25 ..
1048582 drwx------.  2    0    0 4096 Nov 10 09:19 dir-0
1048583 drwx------.  2 1000 1000 4096 Nov 10 09:19 dir-1000
1048584 drwx------.  2 3000 3000 4096 Nov 10 09:19 dir-3000
1048579 -rw-------.  1    0    0    0 Nov 10 09:19 file-0
1048580 -rw-------.  1 1000 1000    0 Nov 10 09:19 file-1000
1048581 -rw-------.  1 3000 3000    0 Nov 10 09:19 file-3000

+ : 'Files on the NFS mountpoint (from container PoV):'
+ nsenter -U -m -n -t 4002 sh -c 'ls -lani /mnt/nfs'
total 20
1048578 drwxr-xr-x. 5     0     0 4096 Nov 10 09:19 .
 786433 drwxr-xr-x. 3 65534 65534 4096 May 16 16:08 ..
1048582 drwx------. 2     0     0 4096 Nov 10 09:19 dir-0
1048583 drwx------. 2 65534 65534 4096 Nov 10 09:19 dir-1000
1048584 drwx------. 2 65534 65534 4096 Nov 10 09:19 dir-3000
1048579 -rw-------. 1     0     0    0 Nov 10 09:19 file-0
1048580 -rw-------. 1 65534 65534    0 Nov 10 09:19 file-1000
1048581 -rw-------. 1 65534 65534    0 Nov 10 09:19 file-3000

+ : 'Files on the NFS mountpoint (from host PoV):'
+ ls -lani /mnt/nfs/
total 20
1048578 drwxr-xr-x. 5       1000       1000 4096 Nov 10 09:19 .
 786433 drwxr-xr-x. 3          0          0 4096 May 16 16:08 ..
1048582 drwx------. 2       1000       1000 4096 Nov 10 09:19 dir-0
1048583 drwx------. 2 4294967294 4294967294 4096 Nov 10 09:19 dir-1000
1048584 drwx------. 2 4294967294 4294967294 4096 Nov 10 09:19 dir-3000
1048579 -rw-------. 1       1000       1000    0 Nov 10 09:19 file-0
1048580 -rw-------. 1 4294967294 4294967294    0 Nov 10 09:19 file-1000
1048581 -rw-------. 1 4294967294 4294967294    0 Nov 10 09:19 file-3000

Alban

On Mon, 2 Nov 2020 at 18:48, Sargun Dhillon <sargun@sargun.me> wrote:
>
> This is effectively a resend, but re-based atop Anna's current tree. I can
> add the samples back in an another patchset.
>
> Right now, it is possible to mount NFS with an non-matching super block
> user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
> set of interactions if using anything other than auth_null, where the UIDs
> being sent to the server are different than the local UIDs being checked.
> This can cause "breakage", where if you try to communicate with the NFS
> server with any other set of mappings, it breaks.
>
> This is after the initial v5.10 merge window, so hopefully this patchset
> can be reconsidered, and maybe we can make forward progress? I think that
> it takes a relatively conservative approach in enabling user namespaces,
> and it prevents the case where someone is using auth_gss (for now), as the
> mappings are non-trivial.
>
> Changes since v3:
>   * Rebase atop Anna's tree
> Changes since v2:
>   * Removed samples
>   * Split out NFSv2/v3 patchset from NFSv4 patchset
>   * Added restrictions around use
> Changes since v1:
>   * Added samples
>
> Sargun Dhillon (2):
>   NFS: NFSv2/NFSv3: Use cred from fs_context during mount
>   NFSv4: Refactor NFS to use user namespaces
>
>  fs/nfs/client.c     | 10 ++++++++--
>  fs/nfs/nfs4client.c | 27 ++++++++++++++++++++++++++-
>  fs/nfs/nfs4idmap.c  |  2 +-
>  fs/nfs/nfs4idmap.h  |  3 ++-
>  4 files changed, 37 insertions(+), 5 deletions(-)
>
>
> base-commit: 8c39076c276be0b31982e44654e2c2357473258a
> --
> 2.25.1
>

Trond Myklebust Nov. 10, 2020, 8:12 p.m. UTC | #2

On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> Hi,
> 
> I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS
> share with a different user namespace. fsopen() is done in the
> container namespaces (user, mnt and net namespaces) while fsconfig(),
> fsmount() and move_mount() are done on the host namespaces. The mount
> on the host is available in the container via mount propagation from
> the host mount.
> 
> With this, the files on the NFS server with uid 0 are available in
> the
> container with uid 0. On the host, they are available with uid
> 4294967294 (make_kuid(&init_user_ns, -2)).
> 

Can someone please tell me what is broken with the _current_ design
before we start trying to push "fixes" that clearly break it?

The current design assumes that the user namespace being used is the
one where the mount itself is performed. That means that the uids and
gids or usernames and groupnames that go on the wire match the uids and
gids of the container in which the mount occurred.

The assumption is that the server has authenticated that client as
belonging to a domain that it recognises (either through strong
RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
addresses to a list of acceptable clients).

If you go ahead and change the user namespace on the client without
going through the mount process again to mount a different super block
with a different user namespace, then you will now get the exact same
behaviour as if you do that with any other filesystem.

Sargun Dhillon Nov. 11, 2020, 11:12 a.m. UTC | #3

On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > Hi,
> > 
> > I tested the patches on top of 5.10.0-rc3+ and I could mount an NFS
> > share with a different user namespace. fsopen() is done in the
> > container namespaces (user, mnt and net namespaces) while fsconfig(),
> > fsmount() and move_mount() are done on the host namespaces. The mount
> > on the host is available in the container via mount propagation from
> > the host mount.
> > 
> > With this, the files on the NFS server with uid 0 are available in
> > the
> > container with uid 0. On the host, they are available with uid
> > 4294967294 (make_kuid(&init_user_ns, -2)).
> > 
> 
> Can someone please tell me what is broken with the _current_ design
> before we start trying to push "fixes" that clearly break it?
Currently the mechanism of mounting nfs4 in a user namespace is as follows:

Parent: fork()
Child: setns(userns)
C: fsopen("nfs4") = 3
C->P: Send FD 3
P: FSConfig...
P: fsmount... (This is where the CAP_SYS_ADMIN check happens))

Right now, when you mount an NFS filesystem in a non-init user
namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
are sent to the server are not the UIDs from the mounting namespace,
instead they are the UIDs from the init user ns.

The reason for this is that you can call fsopen("nfs4") in the unprivileged 
namespace, and that configures fs_context with all the right information for 
that user namespace, but we currently require CAP_SYS_ADMIN in the init user 
namespace to call fsmount. This means that the superblock's user namespace is 
set "correctly" to the container, but there's absolutely no way nfs4uidmap
to consume an unprivileged user namespace.

This behaviour happens "the other way" as well, where the UID in the container
may be 0, but the corresponding kuid is 1000. When a response from an NFS
server comes in we decode it according to the idmap userns[1]. The userns
used to get create idmap is generated at fsmount time, and not as fsopen
time. So, even if the filesystem is in the user namespace, and the server
responds with UID 0, it'll come up with an unmapped UID.

This is because we do
Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS from_kuid(container_ns, 0) -> invalid uid

This is broken behaviour, in my humble opinion as is it makes it impossible to 
use NFSv4 (and v3 for that matter) out of the box with unprivileged user 
namespaces. At least in our environment, using usernames / GSS isn't an option,
so we have to rely on UIDs being set correctly [at least from the container's
perspective].

> 
> The current design assumes that the user namespace being used is the one where 
> the mount itself is performed. That means that the uids and gids or usernames 
> and groupnames that go on the wire match the uids and gids of the container in 
> which the mount occurred.
> 

Right now, NFS does not have the ability for the fsmount() call to be
called in an unprivileged user namespace. We can change that behaviour
elsewhere if we want, but it's orthogonal to this.

> The assumption is that the server has authenticated that client as
> belonging to a domain that it recognises (either through strong
> RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> addresses to a list of acceptable clients).
> 
I added a rejection for upcalls because upcalls can happen in the init 
namespaces. We can drop that restriction from the nfs4 patch if you'd like. I
*believe* (and I'm not a little out of my depth) that the request-key
handler gets called with the *network namespace* of the NFS mount,
but the userns is a privileged one, allowing for potential hazards.

The reason I added that block there is that I didn't imagine anyone was running 
NFS in an unprivileged user namespace, and relying on upcalls (potentially into 
privileged namespaces) in order to do authz.

> If you go ahead and change the user namespace on the client without
> going through the mount process again to mount a different super block
> with a different user namespace, then you will now get the exact same
> behaviour as if you do that with any other filesystem.

Not exactly, because other filesystems *only* use the s_user_ns for conversion 
of UIDs, whereas NFS uses the currend_cred() acquired at mount time, which 
doesn't match s_user_ns, leading to this behaviour.

1. Mistranslated UIDs in encoding RPCs
2. The UID / GID exposed to VFS do not match the user ns.

> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 
-Thanks,
Sargun

[1]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782
[2]: https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154

Trond Myklebust Nov. 11, 2020, 2:38 p.m. UTC | #4

On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > Hi,
> > > 
> > > I tested the patches on top of 5.10.0-rc3+ and I could mount an
> > > NFS
> > > share with a different user namespace. fsopen() is done in the
> > > container namespaces (user, mnt and net namespaces) while
> > > fsconfig(),
> > > fsmount() and move_mount() are done on the host namespaces. The
> > > mount
> > > on the host is available in the container via mount propagation
> > > from
> > > the host mount.
> > > 
> > > With this, the files on the NFS server with uid 0 are available
> > > in
> > > the
> > > container with uid 0. On the host, they are available with uid
> > > 4294967294 (make_kuid(&init_user_ns, -2)).
> > > 
> > 
> > Can someone please tell me what is broken with the _current_ design
> > before we start trying to push "fixes" that clearly break it?
> Currently the mechanism of mounting nfs4 in a user namespace is as
> follows:
> 
> Parent: fork()
> Child: setns(userns)
> C: fsopen("nfs4") = 3
> C->P: Send FD 3
> P: FSConfig...
> P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> 
> 
> Right now, when you mount an NFS filesystem in a non-init user
> namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> are sent to the server are not the UIDs from the mounting namespace,
> instead they are the UIDs from the init user ns.
> 
> The reason for this is that you can call fsopen("nfs4") in the
> unprivileged 
> namespace, and that configures fs_context with all the right
> information for 
> that user namespace, but we currently require CAP_SYS_ADMIN in the
> init user 
> namespace to call fsmount. This means that the superblock's user
> namespace is 
> set "correctly" to the container, but there's absolutely no way
> nfs4uidmap
> to consume an unprivileged user namespace.
> 
> This behaviour happens "the other way" as well, where the UID in the
> container
> may be 0, but the corresponding kuid is 1000. When a response from an
> NFS
> server comes in we decode it according to the idmap userns[1]. The
> userns
> used to get create idmap is generated at fsmount time, and not as
> fsopen
> time. So, even if the filesystem is in the user namespace, and the
> server
> responds with UID 0, it'll come up with an unmapped UID.
> 
> This is because we do
> Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> from_kuid(container_ns, 0) -> invalid uid
> 
> This is broken behaviour, in my humble opinion as is it makes it
> impossible to 
> use NFSv4 (and v3 for that matter) out of the box with unprivileged
> user 
> namespaces. At least in our environment, using usernames / GSS isn't
> an option,
> so we have to rely on UIDs being set correctly [at least from the
> container's
> perspective].
> 

The current code for setting server->cred was developed independently
of fsopen() (and predates it actually). I'm fine with the change to
have server->cred be the cred of the user that called fsopen(). That's
in line with what we used to do for sys_mount().

However all the other stuff to throw errors when the user namespace is
not init_user_ns introduces massive regressions.

> > 
> > The current design assumes that the user namespace being used is
> > the one where 
> > the mount itself is performed. That means that the uids and gids or
> > usernames 
> > and groupnames that go on the wire match the uids and gids of the
> > container in 
> > which the mount occurred.
> > 
> 
> Right now, NFS does not have the ability for the fsmount() call to be
> called in an unprivileged user namespace. We can change that
> behaviour
> elsewhere if we want, but it's orthogonal to this.
> 
> > The assumption is that the server has authenticated that client as
> > belonging to a domain that it recognises (either through strong
> > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> > addresses to a list of acceptable clients).
> > 
> I added a rejection for upcalls because upcalls can happen in the
> init 
> namespaces. We can drop that restriction from the nfs4 patch if you'd
> like. I
> *believe* (and I'm not a little out of my depth) that the request-key
> handler gets called with the *network namespace* of the NFS mount,
> but the userns is a privileged one, allowing for potential hazards.
> 

The idmapper already rejects upcalls to the keyring '/sbin/request-key'
utility if you're running with your own user namespace.

Quite frankly, switching to using the keyring was a mistake which I'd
undo if I could. Aside from not supporting containers, it is horribly
slow due to requiring a full process startup/teardown for every upcall,
so it scales poorly to large numbers of identities (particularly with
an operation like readdir() in which you're doing serial upcalls).

However nothing stops you from using the old NFSv4 idmapper daemon
(a.k.a. rpc.idmapd) in the context of the container that called
fsopen() so that it can translate identities correctly using whatever
userspace tools (ldap, sssd, winbind...) that the container has
configured.

> The reason I added that block there is that I didn't imagine anyone
> was running 
> NFS in an unprivileged user namespace, and relying on upcalls
> (potentially into 
> privileged namespaces) in order to do authz.
> 
> 
> > If you go ahead and change the user namespace on the client without
> > going through the mount process again to mount a different super
> > block
> > with a different user namespace, then you will now get the exact
> > same
> > behaviour as if you do that with any other filesystem.
> 
> Not exactly, because other filesystems *only* use the s_user_ns for
> conversion 
> of UIDs, whereas NFS uses the currend_cred() acquired at mount time,
> which 
> doesn't match s_user_ns, leading to this behaviour.
> 
> 1. Mistranslated UIDs in encoding RPCs
> 2. The UID / GID exposed to VFS do not match the user ns.
> 
> > 
> > -- 
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@hammerspace.com
> > 
> > 
> -Thanks,
> Sargun
> 
> [1]:  
> https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782
> [2]:  
> https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154

Sargun Dhillon Nov. 11, 2020, 6:57 p.m. UTC | #5

On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote:
> On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > > Hi,
> > > > 
> > > > I tested the patches on top of 5.10.0-rc3+ and I could mount an
> > > > NFS
> > > > share with a different user namespace. fsopen() is done in the
> > > > container namespaces (user, mnt and net namespaces) while
> > > > fsconfig(),
> > > > fsmount() and move_mount() are done on the host namespaces. The
> > > > mount
> > > > on the host is available in the container via mount propagation
> > > > from
> > > > the host mount.
> > > > 
> > > > With this, the files on the NFS server with uid 0 are available
> > > > in
> > > > the
> > > > container with uid 0. On the host, they are available with uid
> > > > 4294967294 (make_kuid(&init_user_ns, -2)).
> > > > 
> > > 
> > > Can someone please tell me what is broken with the _current_ design
> > > before we start trying to push "fixes" that clearly break it?
> > Currently the mechanism of mounting nfs4 in a user namespace is as
> > follows:
> > 
> > Parent: fork()
> > Child: setns(userns)
> > C: fsopen("nfs4") = 3
> > C->P: Send FD 3
> > P: FSConfig...
> > P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> > 
> > 
> > Right now, when you mount an NFS filesystem in a non-init user
> > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> > are sent to the server are not the UIDs from the mounting namespace,
> > instead they are the UIDs from the init user ns.
> > 
> > The reason for this is that you can call fsopen("nfs4") in the
> > unprivileged 
> > namespace, and that configures fs_context with all the right
> > information for 
> > that user namespace, but we currently require CAP_SYS_ADMIN in the
> > init user 
> > namespace to call fsmount. This means that the superblock's user
> > namespace is 
> > set "correctly" to the container, but there's absolutely no way
> > nfs4uidmap
> > to consume an unprivileged user namespace.
> > 
> > This behaviour happens "the other way" as well, where the UID in the
> > container
> > may be 0, but the corresponding kuid is 1000. When a response from an
> > NFS
> > server comes in we decode it according to the idmap userns[1]. The
> > userns
> > used to get create idmap is generated at fsmount time, and not as
> > fsopen
> > time. So, even if the filesystem is in the user namespace, and the
> > server
> > responds with UID 0, it'll come up with an unmapped UID.
> > 
> > This is because we do
> > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> > from_kuid(container_ns, 0) -> invalid uid
> > 
> > This is broken behaviour, in my humble opinion as is it makes it
> > impossible to 
> > use NFSv4 (and v3 for that matter) out of the box with unprivileged
> > user 
> > namespaces. At least in our environment, using usernames / GSS isn't
> > an option,
> > so we have to rely on UIDs being set correctly [at least from the
> > container's
> > perspective].
> > 
> 
> The current code for setting server->cred was developed independently
> of fsopen() (and predates it actually). I'm fine with the change to
> have server->cred be the cred of the user that called fsopen(). That's
> in line with what we used to do for sys_mount().
> 
Just curious, without FS_USERNS, how were you mounting NFSv4 in an
unprivileged user ns?


> However all the other stuff to throw errors when the user namespace is
> not init_user_ns introduces massive regressions.
> 

I can remove that and respin the patch. How do you feel about that?  I would 
still like to keep the log lines though because it is a uapi change. I am 
worried that someone might exercise this path with GSS and allow for upcalls 
into the main namespaces by accident -- or be confused of why they're seeing 
upcalls "in a different namespace".

Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from fs_context during 
mount") without any changes?

I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
/*
 * nfs4idmap is not fully isolated by user namespaces. It is currently
 * only network namespace aware. If upcalls never happen, we do not
 * need to worry as nfs_client instances aren't shared between
 * user namespaces.
 */
if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && 
	!(server->caps & NFS_CAP_UIDGID_NOMAP)) {
	error = -EINVAL;
	errorf(fc, "Mount credentials are from non init user namespace and ID mapping is enabled. This is not allowed.");
	goto error;
}

(and making it so we can call idmap_userns)

> > > 
> > > The current design assumes that the user namespace being used is
> > > the one where 
> > > the mount itself is performed. That means that the uids and gids or
> > > usernames 
> > > and groupnames that go on the wire match the uids and gids of the
> > > container in 
> > > which the mount occurred.
> > > 
> > 
> > Right now, NFS does not have the ability for the fsmount() call to be
> > called in an unprivileged user namespace. We can change that
> > behaviour
> > elsewhere if we want, but it's orthogonal to this.
> > 
> > > The assumption is that the server has authenticated that client as
> > > belonging to a domain that it recognises (either through strong
> > > RPCSEC_GSS/krb5 authentication, or through weaker matching of IP
> > > addresses to a list of acceptable clients).
> > > 
> > I added a rejection for upcalls because upcalls can happen in the
> > init 
> > namespaces. We can drop that restriction from the nfs4 patch if you'd
> > like. I
> > *believe* (and I'm not a little out of my depth) that the request-key
> > handler gets called with the *network namespace* of the NFS mount,
> > but the userns is a privileged one, allowing for potential hazards.
> > 
> 
> The idmapper already rejects upcalls to the keyring '/sbin/request-key'
> utility if you're running with your own user namespace.
> 
> Quite frankly, switching to using the keyring was a mistake which I'd
> undo if I could. Aside from not supporting containers, it is horribly
> slow due to requiring a full process startup/teardown for every upcall,
> so it scales poorly to large numbers of identities (particularly with
> an operation like readdir() in which you're doing serial upcalls).
> 
> However nothing stops you from using the old NFSv4 idmapper daemon
> (a.k.a. rpc.idmapd) in the context of the container that called
> fsopen() so that it can translate identities correctly using whatever
> userspace tools (ldap, sssd, winbind...) that the container has
> configured.
> 

1. We see this as a potential security risk [this being upcalls] into the 
unconfined portion of the system. Although, I'm sure that the userspace handlers 
are written perfectly well, it allows for information leakage to occur.

2. Is there a way to do this for NFSv3? 

3. Can rpc.idmapd get the user namespace that the call is from (and is the 
keyring per-userns?). In general, I think that this change follows the principal 
of least surprise.

> > The reason I added that block there is that I didn't imagine anyone
> > was running 
> > NFS in an unprivileged user namespace, and relying on upcalls
> > (potentially into 
> > privileged namespaces) in order to do authz.
> > 
> > 
> > > If you go ahead and change the user namespace on the client without
> > > going through the mount process again to mount a different super
> > > block
> > > with a different user namespace, then you will now get the exact
> > > same
> > > behaviour as if you do that with any other filesystem.
> > 
> > Not exactly, because other filesystems *only* use the s_user_ns for
> > conversion 
> > of UIDs, whereas NFS uses the currend_cred() acquired at mount time,
> > which 
> > doesn't match s_user_ns, leading to this behaviour.
> > 
> > 1. Mistranslated UIDs in encoding RPCs
> > 2. The UID / GID exposed to VFS do not match the user ns.
> > 
> > > 
> > > -- 
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > trond.myklebust@hammerspace.com
> > > 
> > > 
> > -Thanks,
> > Sargun
> > 
> > [1]:  
> > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4idmap.c#L782
> > [2]:  
> > https://elixir.bootlin.com/linux/v5.9.8/source/fs/nfs/nfs4client.c#L1154
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
>

Trond Myklebust Nov. 11, 2020, 8:03 p.m. UTC | #6

On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote:
> On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote:
> > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > > On Tue, Nov 10, 2020 at 08:12:01PM +0000, Trond Myklebust wrote:
> > > > On Tue, 2020-11-10 at 17:43 +0100, Alban Crequy wrote:
> > > > > Hi,
> > > > > 
> > > > > I tested the patches on top of 5.10.0-rc3+ and I could mount
> > > > > an
> > > > > NFS
> > > > > share with a different user namespace. fsopen() is done in
> > > > > the
> > > > > container namespaces (user, mnt and net namespaces) while
> > > > > fsconfig(),
> > > > > fsmount() and move_mount() are done on the host namespaces.
> > > > > The
> > > > > mount
> > > > > on the host is available in the container via mount
> > > > > propagation
> > > > > from
> > > > > the host mount.
> > > > > 
> > > > > With this, the files on the NFS server with uid 0 are
> > > > > available
> > > > > in
> > > > > the
> > > > > container with uid 0. On the host, they are available with
> > > > > uid
> > > > > 4294967294 (make_kuid(&init_user_ns, -2)).
> > > > > 
> > > > 
> > > > Can someone please tell me what is broken with the _current_
> > > > design
> > > > before we start trying to push "fixes" that clearly break it?
> > > Currently the mechanism of mounting nfs4 in a user namespace is
> > > as
> > > follows:
> > > 
> > > Parent: fork()
> > > Child: setns(userns)
> > > C: fsopen("nfs4") = 3
> > > C->P: Send FD 3
> > > P: FSConfig...
> > > P: fsmount... (This is where the CAP_SYS_ADMIN check happens))
> > > 
> > > 
> > > Right now, when you mount an NFS filesystem in a non-init user
> > > namespace, and you have UIDs / GIDs on, the UIDs / GIDs which
> > > are sent to the server are not the UIDs from the mounting
> > > namespace,
> > > instead they are the UIDs from the init user ns.
> > > 
> > > The reason for this is that you can call fsopen("nfs4") in the
> > > unprivileged 
> > > namespace, and that configures fs_context with all the right
> > > information for 
> > > that user namespace, but we currently require CAP_SYS_ADMIN in
> > > the
> > > init user 
> > > namespace to call fsmount. This means that the superblock's user
> > > namespace is 
> > > set "correctly" to the container, but there's absolutely no way
> > > nfs4uidmap
> > > to consume an unprivileged user namespace.
> > > 
> > > This behaviour happens "the other way" as well, where the UID in
> > > the
> > > container
> > > may be 0, but the corresponding kuid is 1000. When a response
> > > from an
> > > NFS
> > > server comes in we decode it according to the idmap userns[1].
> > > The
> > > userns
> > > used to get create idmap is generated at fsmount time, and not as
> > > fsopen
> > > time. So, even if the filesystem is in the user namespace, and
> > > the
> > > server
> > > responds with UID 0, it'll come up with an unmapped UID.
> > > 
> > > This is because we do
> > > Server UID 0 -> idmap make_kuid(init_user_ns, 0) -> VFS
> > > from_kuid(container_ns, 0) -> invalid uid
> > > 
> > > This is broken behaviour, in my humble opinion as is it makes it
> > > impossible to 
> > > use NFSv4 (and v3 for that matter) out of the box with
> > > unprivileged
> > > user 
> > > namespaces. At least in our environment, using usernames / GSS
> > > isn't
> > > an option,
> > > so we have to rely on UIDs being set correctly [at least from the
> > > container's
> > > perspective].
> > > 
> > 
> > The current code for setting server->cred was developed
> > independently
> > of fsopen() (and predates it actually). I'm fine with the change to
> > have server->cred be the cred of the user that called fsopen().
> > That's
> > in line with what we used to do for sys_mount().
> > 
> Just curious, without FS_USERNS, how were you mounting NFSv4 in an
> unprivileged user ns?

The code was originally developed on a 5.1 kernel. So all my testing
has been with ordinary sys_mount() calls in a container that had
CAP_SYS_ADMIN privileges.

> > However all the other stuff to throw errors when the user namespace
> > is
> > not init_user_ns introduces massive regressions.
> > 
> 
> I can remove that and respin the patch. How do you feel about that? 
> I would 
> still like to keep the log lines though because it is a uapi change.
> I am 
> worried that someone might exercise this path with GSS and allow for
> upcalls 
> into the main namespaces by accident -- or be confused of why they're
> seeing 
> upcalls "in a different namespace".
> 
> Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from
> fs_context during 
> mount") without any changes?

Why do we need the dprintk()s? It seems to me that either they should
be reporting something that the user needs to know (in which case they
should be real printk()s) or they are telling us something that we
should already know. To me they seem to fit more in the latter
category.

> 
> I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
> /*
>  * nfs4idmap is not fully isolated by user namespaces. It is
> currently
>  * only network namespace aware. If upcalls never happen, we do not
>  * need to worry as nfs_client instances aren't shared between
>  * user namespaces.
>  */
> if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && 
>         !(server->caps & NFS_CAP_UIDGID_NOMAP)) {
>         error = -EINVAL;
>         errorf(fc, "Mount credentials are from non init user
> namespace and ID mapping is enabled. This is not allowed.");
>         goto error;
> }
> 
> (and making it so we can call idmap_userns)
> 

Yes. That would be acceptable. Again, though, I'd like to see the
dprintk()s gone.

> > > > 
> > > > The current design assumes that the user namespace being used
> > > > is
> > > > the one where 
> > > > the mount itself is performed. That means that the uids and
> > > > gids or
> > > > usernames 
> > > > and groupnames that go on the wire match the uids and gids of
> > > > the
> > > > container in 
> > > > which the mount occurred.
> > > > 
> > > 
> > > Right now, NFS does not have the ability for the fsmount() call
> > > to be
> > > called in an unprivileged user namespace. We can change that
> > > behaviour
> > > elsewhere if we want, but it's orthogonal to this.
> > > 
> > > > The assumption is that the server has authenticated that client
> > > > as
> > > > belonging to a domain that it recognises (either through strong
> > > > RPCSEC_GSS/krb5 authentication, or through weaker matching of
> > > > IP
> > > > addresses to a list of acceptable clients).
> > > > 
> > > I added a rejection for upcalls because upcalls can happen in the
> > > init 
> > > namespaces. We can drop that restriction from the nfs4 patch if
> > > you'd
> > > like. I
> > > *believe* (and I'm not a little out of my depth) that the
> > > request-key
> > > handler gets called with the *network namespace* of the NFS
> > > mount,
> > > but the userns is a privileged one, allowing for potential
> > > hazards.
> > > 
> > 
> > The idmapper already rejects upcalls to the keyring '/sbin/request-
> > key'
> > utility if you're running with your own user namespace.
> > 
> > Quite frankly, switching to using the keyring was a mistake which
> > I'd
> > undo if I could. Aside from not supporting containers, it is
> > horribly
> > slow due to requiring a full process startup/teardown for every
> > upcall,
> > so it scales poorly to large numbers of identities (particularly
> > with
> > an operation like readdir() in which you're doing serial upcalls).
> > 
> > However nothing stops you from using the old NFSv4 idmapper daemon
> > (a.k.a. rpc.idmapd) in the context of the container that called
> > fsopen() so that it can translate identities correctly using
> > whatever
> > userspace tools (ldap, sssd, winbind...) that the container has
> > configured.
> > 
> 
> 1. We see this as a potential security risk [this being upcalls] into
> the 
> unconfined portion of the system. Although, I'm sure that the
> userspace handlers 
> are written perfectly well, it allows for information leakage to
> occur.
> 
> 2. Is there a way to do this for NFSv3? 
> 
> 3. Can rpc.idmapd get the user namespace that the call is from (and
> is the 
> keyring per-userns?). In general, I think that this change follows
> the principal 
> of least surprise.
> 
> > > The reason I added that block there is that I didn't imagine
> > > anyone
> > > was running 
> > > NFS in an unprivileged user namespace, and relying on upcalls
> > > (potentially into 
> > > privileged namespaces) in order to do authz.
> > > 
> > > 
> > > > If you go ahead and change the user namespace on the client
> > > > without
> > > > going through the mount process again to mount a different
> > > > super
> > > > block
> > > > with a different user namespace, then you will now get the
> > > > exact
> > > > same
> > > > behaviour as if you do that with any other filesystem.
> > > 
> > > Not exactly, because other filesystems *only* use the s_user_ns
> > > for
> > > conversion 
> > > of UIDs, whereas NFS uses the currend_cred() acquired at mount
> > > time,
> > > which 
> > > doesn't match s_user_ns, leading to this behaviour.
> > > 
> > > 1. Mistranslated UIDs in encoding RPCs
> > > 2. The UID / GID exposed to VFS do not match the user ns.
> > > 
> > > 
> >

Sargun Dhillon Nov. 12, 2020, 12:30 a.m. UTC | #7

On Wed, Nov 11, 2020 at 08:03:18PM +0000, Trond Myklebust wrote:
> On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote:
> > On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote:
> > > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > > 
> > > The current code for setting server->cred was developed
> > > independently
> > > of fsopen() (and predates it actually). I'm fine with the change to
> > > have server->cred be the cred of the user that called fsopen().
> > > That's
> > > in line with what we used to do for sys_mount().
> > > 
> > Just curious, without FS_USERNS, how were you mounting NFSv4 in an
> > unprivileged user ns?
> 
> The code was originally developed on a 5.1 kernel. So all my testing
> has been with ordinary sys_mount() calls in a container that had
> CAP_SYS_ADMIN privileges.
> 
> > > However all the other stuff to throw errors when the user namespace
> > > is
> > > not init_user_ns introduces massive regressions.
> > > 
> > 
> > I can remove that and respin the patch. How do you feel about that? 
> > I would 
> > still like to keep the log lines though because it is a uapi change.
> > I am 
> > worried that someone might exercise this path with GSS and allow for
> > upcalls 
> > into the main namespaces by accident -- or be confused of why they're
> > seeing 
> > upcalls "in a different namespace".
> > 
> > Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from
> > fs_context during 
> > mount") without any changes?
> 
> Why do we need the dprintk()s? It seems to me that either they should
> be reporting something that the user needs to know (in which case they
> should be real printk()s) or they are telling us something that we
> should already know. To me they seem to fit more in the latter
> category.
> 
> > 
> > I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
> > /*
> >  * nfs4idmap is not fully isolated by user namespaces. It is
> > currently
> >  * only network namespace aware. If upcalls never happen, we do not
> >  * need to worry as nfs_client instances aren't shared between
> >  * user namespaces.
> >  */
> > if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && 
> >         !(server->caps & NFS_CAP_UIDGID_NOMAP)) {
> >         error = -EINVAL;
> >         errorf(fc, "Mount credentials are from non init user
> > namespace and ID mapping is enabled. This is not allowed.");
> >         goto error;
> > }
> > 
> > (and making it so we can call idmap_userns)
> > 
> 
> Yes. That would be acceptable. Again, though, I'd like to see the
> dprintk()s gone.
> 

I can drop the dprintks, but given this is a uapi change, does it make sense to 
pr_info_once? Especially, because this can have security impact?

Sargun Dhillon Nov. 12, 2020, 12:42 a.m. UTC | #8

On Thu, Nov 12, 2020 at 12:30:56AM +0000, Sargun Dhillon wrote:
> On Wed, Nov 11, 2020 at 08:03:18PM +0000, Trond Myklebust wrote:
> > On Wed, 2020-11-11 at 18:57 +0000, Sargun Dhillon wrote:
> > > On Wed, Nov 11, 2020 at 02:38:11PM +0000, Trond Myklebust wrote:
> > > > On Wed, 2020-11-11 at 11:12 +0000, Sargun Dhillon wrote:
> > > > 
> > > > The current code for setting server->cred was developed
> > > > independently
> > > > of fsopen() (and predates it actually). I'm fine with the change to
> > > > have server->cred be the cred of the user that called fsopen().
> > > > That's
> > > > in line with what we used to do for sys_mount().
> > > > 
> > > Just curious, without FS_USERNS, how were you mounting NFSv4 in an
> > > unprivileged user ns?
> > 
> > The code was originally developed on a 5.1 kernel. So all my testing
> > has been with ordinary sys_mount() calls in a container that had
> > CAP_SYS_ADMIN privileges.
> > 
> > > > However all the other stuff to throw errors when the user namespace
> > > > is
> > > > not init_user_ns introduces massive regressions.
> > > > 
> > > 
> > > I can remove that and respin the patch. How do you feel about that? 
> > > I would 
> > > still like to keep the log lines though because it is a uapi change.
> > > I am 
> > > worried that someone might exercise this path with GSS and allow for
> > > upcalls 
> > > into the main namespaces by accident -- or be confused of why they're
> > > seeing 
> > > upcalls "in a different namespace".
> > > 
> > > Are you okay with picking up ("NFS: NFSv2/NFSv3: Use cred from
> > > fs_context during 
> > > mount") without any changes?
> > 
> > Why do we need the dprintk()s? It seems to me that either they should
> > be reporting something that the user needs to know (in which case they
> > should be real printk()s) or they are telling us something that we
> > should already know. To me they seem to fit more in the latter
> > category.
> > 
> > > 
> > > I can respin ("NFSv4: Refactor NFS to use user namespaces") without:
> > > /*
> > >  * nfs4idmap is not fully isolated by user namespaces. It is
> > > currently
> > >  * only network namespace aware. If upcalls never happen, we do not
> > >  * need to worry as nfs_client instances aren't shared between
> > >  * user namespaces.
> > >  */
> > > if (idmap_userns(server->nfs_client->cl_idmap) != &init_user_ns && 
> > >         !(server->caps & NFS_CAP_UIDGID_NOMAP)) {
> > >         error = -EINVAL;
> > >         errorf(fc, "Mount credentials are from non init user
> > > namespace and ID mapping is enabled. This is not allowed.");
> > >         goto error;
> > > }
> > > 
> > > (and making it so we can call idmap_userns)
> > > 
> > 
> > Yes. That would be acceptable. Again, though, I'd like to see the
> > dprintk()s gone.
> > 
> 
> I can drop the dprintks, but given this is a uapi change, does it make sense to 
> pr_info_once? Especially, because this can have security impact?

Spending 5 minutes thinking about this, I think that best go out in another patch
that I can spin, and we can discuss there.

[v4,0/2] NFS: Fix interaction between fs_context and user namespaces

Message

Comments