[v5,0/2] NFS: Fix interaction between fs_context and user namespaces

Message ID	20201112100952.3514-1-sargun@sargun.me (mailing list archive)
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Sargun Dhillon <sargun@sargun.me> To: "J . Bruce Fields" <bfields@fieldses.org>, Chuck Lever <chuck.lever@oracle.com>, Trond Myklebust <trond.myklebust@hammerspace.com>, Anna Schumaker <anna.schumaker@netapp.com>, David Howells <dhowells@redhat.com>, Scott Mayhew <smayhew@redhat.com> Cc: mauricio@kinvolk.io, Alban Crequy <alban.crequy@gmail.com>, Sargun Dhillon <sargun@sargun.me>, linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Kyle Anderson <kylea@netflix.com> Subject: [PATCH v5 0/2] NFS: Fix interaction between fs_context and user namespaces Date: Thu, 12 Nov 2020 02:09:50 -0800 Message-Id: <20201112100952.3514-1-sargun@sargun.me> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	NFS: Fix interaction between fs_context and user namespaces \| expand [v5,0/2] NFS: Fix interaction between fs_context and user namespaces [v5,1/2] NFS: NFSv2/NFSv3: Use cred from fs_context during mount [v5,2/2] NFSv4: Refactor to use user namespaces for nfs4idmap

Message ID

20201112100952.3514-1-sargun@sargun.me (mailing list archive)

Headers

From: Sargun Dhillon <sargun@sargun.me>
To: "J . Bruce Fields" <bfields@fieldses.org>,
        Chuck Lever <chuck.lever@oracle.com>,
        Trond Myklebust <trond.myklebust@hammerspace.com>,
        Anna Schumaker <anna.schumaker@netapp.com>,
        David Howells <dhowells@redhat.com>,
        Scott Mayhew <smayhew@redhat.com>
Cc: mauricio@kinvolk.io, Alban Crequy <alban.crequy@gmail.com>,
        Sargun Dhillon <sargun@sargun.me>, linux-nfs@vger.kernel.org,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        Kyle Anderson <kylea@netflix.com>
Subject: [PATCH v5 0/2] NFS: Fix interaction between fs_context and user
 namespaces
Date: Thu, 12 Nov 2020 02:09:50 -0800
Message-Id: <20201112100952.3514-1-sargun@sargun.me>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

NFS: Fix interaction between fs_context and user namespaces | expand

Message

Sargun Dhillon Nov. 12, 2020, 10:09 a.m. UTC

Right now, it is possible to mount NFS with an non-matching super block
user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
set of interactions if using anything other than auth_null, where the UIDs
being sent to the server are different than the local UIDs being checked.
This can cause "breakage", where if you try to communicate with the NFS
server with any other set of mappings, it breaks.

The reason for this is that you can call fsopen("nfs4") in the unprivileged
namespace, and that configures fs_context with all the right information
for that user namespace. In addition, it also keeps a gets a cred object
associated with the caller -- which should match the user namespace.
Unfortunately, the mount has to be finished in the init_user_ns because we
currently require CAP_SYS_ADMIN in the init user namespace to call fsmount.
This means that the superblock's user namespace is set "correctly" to the
container, but there's absolutely no way nfs4idmap to consume an
unprivileged user namespace because the cred / user_ns that's passed down
to nfs4idmap is the one at fsmount.

How this actually exhibits is let's say that the UID 0 in the user
namespace is mapped to UID 1000 in the init user ns (and kuid space). What
will happen is that nfs4idmap will translate the UID 1000 into UID 0 on the
wire, even if the mount is in entirely in the mount / user namespace of the
container.

So, it looks something like this
Client in unprivileged User NS (UID: 0, KUID: 0)
	->Perform open()
		...VFS / NFS bits...
		nfs_map_uid_to_name ->
			from_kuid_munged(init_user_ns, uid) (returns 0)
				RPC with UID 0

This behaviour happens "the other way" as well, where the UID in the
container may be 0, but the corresponding kuid is 1000. When a response
from an NFS server comes in we decode it according to the idmap userns.
The way this exhibits is even more odd.

Server responds with file attribute (UID: 0, GID: 0)
	->nfs_map_name_to_uid(..., 0)
		->make_kuid(init_user_ns, id) (returns 0)
			....VFS / NFS Bits...
			->from_kuid(container_ns, 0) -> invalid uid
				-> EOVERFLOW

This changes the nfs server to use the cred / userns from fs_context, which
is how idmap is constructed. This subsequently is used in the above
described flow of converting uids back-and-forth.

Trond gave the feedback that this behaviour [implemented by this patch] is
how the legacy sys_mount() behaviour worked[1], and that the intended
behaviour is for UIDs to be plumbed through entirely, where the user
namespaces UIDs are what is sent over the wire, and not the init user ns.

[1]: https://lore.kernel.org/linux-nfs/8feccf45f6575a204da03e796391cc135283eb88.camel@hammerspace.com/

Sargun Dhillon (2):
  NFS: NFSv2/NFSv3: Use cred from fs_context during mount
  NFSv4: Refactor to use user namespaces for nfs4idmap

 fs/nfs/client.c     | 4 ++--
 fs/nfs/nfs4client.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)


base-commit: 8c39076c276be0b31982e44654e2c2357473258a

Comments

Sargun Dhillon Nov. 13, 2020, 6:46 p.m. UTC | #1

On Thu, Nov 12, 2020 at 02:09:50AM -0800, Sargun Dhillon wrote:
> Right now, it is possible to mount NFS with an non-matching super block
> user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
> set of interactions if using anything other than auth_null, where the UIDs
> being sent to the server are different than the local UIDs being checked.
> This can cause "breakage", where if you try to communicate with the NFS
> server with any other set of mappings, it breaks.
> 
> The reason for this is that you can call fsopen("nfs4") in the unprivileged
> namespace, and that configures fs_context with all the right information
> for that user namespace. In addition, it also keeps a gets a cred object
> associated with the caller -- which should match the user namespace.
> Unfortunately, the mount has to be finished in the init_user_ns because we
> currently require CAP_SYS_ADMIN in the init user namespace to call fsmount.
> This means that the superblock's user namespace is set "correctly" to the
> container, but there's absolutely no way nfs4idmap to consume an
> unprivileged user namespace because the cred / user_ns that's passed down
> to nfs4idmap is the one at fsmount.
> 
> How this actually exhibits is let's say that the UID 0 in the user
> namespace is mapped to UID 1000 in the init user ns (and kuid space). What
> will happen is that nfs4idmap will translate the UID 1000 into UID 0 on the
> wire, even if the mount is in entirely in the mount / user namespace of the
> container.
> 
> So, it looks something like this
> Client in unprivileged User NS (UID: 0, KUID: 0)
> 	->Perform open()
> 		...VFS / NFS bits...
> 		nfs_map_uid_to_name ->
> 			from_kuid_munged(init_user_ns, uid) (returns 0)
> 				RPC with UID 0
> 
> This behaviour happens "the other way" as well, where the UID in the
> container may be 0, but the corresponding kuid is 1000. When a response
> from an NFS server comes in we decode it according to the idmap userns.
> The way this exhibits is even more odd.
> 
> Server responds with file attribute (UID: 0, GID: 0)
> 	->nfs_map_name_to_uid(..., 0)
> 		->make_kuid(init_user_ns, id) (returns 0)
> 			....VFS / NFS Bits...
> 			->from_kuid(container_ns, 0) -> invalid uid
> 				-> EOVERFLOW
> 
> This changes the nfs server to use the cred / userns from fs_context, which
> is how idmap is constructed. This subsequently is used in the above
> described flow of converting uids back-and-forth.
> 
> Trond gave the feedback that this behaviour [implemented by this patch] is
> how the legacy sys_mount() behaviour worked[1], and that the intended
> behaviour is for UIDs to be plumbed through entirely, where the user
> namespaces UIDs are what is sent over the wire, and not the init user ns.
> 
> [1]: https://lore.kernel.org/linux-nfs/8feccf45f6575a204da03e796391cc135283eb88.camel@hammerspace.com/
> 
> Sargun Dhillon (2):
>   NFS: NFSv2/NFSv3: Use cred from fs_context during mount
>   NFSv4: Refactor to use user namespaces for nfs4idmap
> 
>  fs/nfs/client.c     | 4 ++--
>  fs/nfs/nfs4client.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> 
> base-commit: 8c39076c276be0b31982e44654e2c2357473258a
> -- 
> 2.25.1
> 
Trond,

I was just thinking, since you said that this is the behaviour of the sys_mount 
API, would this be considered a regression? Should it go to stable (v5.9)?

Sargun Dhillon Nov. 24, 2020, 6:42 p.m. UTC | #2

On Thu, Nov 12, 2020 at 02:09:50AM -0800, Sargun Dhillon wrote:
> Right now, it is possible to mount NFS with an non-matching super block
> user ns, and NFS sunrpc user ns. This (for the user) results in an awkward
> set of interactions if using anything other than auth_null, where the UIDs
> being sent to the server are different than the local UIDs being checked.
> This can cause "breakage", where if you try to communicate with the NFS
> server with any other set of mappings, it breaks.
> 
> The reason for this is that you can call fsopen("nfs4") in the unprivileged
> namespace, and that configures fs_context with all the right information
> for that user namespace. In addition, it also keeps a gets a cred object
> associated with the caller -- which should match the user namespace.
> Unfortunately, the mount has to be finished in the init_user_ns because we
> currently require CAP_SYS_ADMIN in the init user namespace to call fsmount.
> This means that the superblock's user namespace is set "correctly" to the
> container, but there's absolutely no way nfs4idmap to consume an
> unprivileged user namespace because the cred / user_ns that's passed down
> to nfs4idmap is the one at fsmount.
> 
> How this actually exhibits is let's say that the UID 0 in the user
> namespace is mapped to UID 1000 in the init user ns (and kuid space). What
> will happen is that nfs4idmap will translate the UID 1000 into UID 0 on the
> wire, even if the mount is in entirely in the mount / user namespace of the
> container.
> 
> So, it looks something like this
> Client in unprivileged User NS (UID: 0, KUID: 0)
> 	->Perform open()
> 		...VFS / NFS bits...
> 		nfs_map_uid_to_name ->
> 			from_kuid_munged(init_user_ns, uid) (returns 0)
> 				RPC with UID 0
> 
> This behaviour happens "the other way" as well, where the UID in the
> container may be 0, but the corresponding kuid is 1000. When a response
> from an NFS server comes in we decode it according to the idmap userns.
> The way this exhibits is even more odd.
> 
> Server responds with file attribute (UID: 0, GID: 0)
> 	->nfs_map_name_to_uid(..., 0)
> 		->make_kuid(init_user_ns, id) (returns 0)
> 			....VFS / NFS Bits...
> 			->from_kuid(container_ns, 0) -> invalid uid
> 				-> EOVERFLOW
> 
> This changes the nfs server to use the cred / userns from fs_context, which
> is how idmap is constructed. This subsequently is used in the above
> described flow of converting uids back-and-forth.
> 
> Trond gave the feedback that this behaviour [implemented by this patch] is
> how the legacy sys_mount() behaviour worked[1], and that the intended
> behaviour is for UIDs to be plumbed through entirely, where the user
> namespaces UIDs are what is sent over the wire, and not the init user ns.
> 
> [1]: https://lore.kernel.org/linux-nfs/8feccf45f6575a204da03e796391cc135283eb88.camel@hammerspace.com/
> 
> Sargun Dhillon (2):
>   NFS: NFSv2/NFSv3: Use cred from fs_context during mount
>   NFSv4: Refactor to use user namespaces for nfs4idmap
> 
>  fs/nfs/client.c     | 4 ++--
>  fs/nfs/nfs4client.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> 
> base-commit: 8c39076c276be0b31982e44654e2c2357473258a
> -- 
> 2.25.1
>


Trond,
Are there any other concerns you have before landing this, or do you want
to wait until the v5.11 merge window?