From patchwork Mon Feb 17 20:53:05 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: James Bottomley <James.Bottomley@HansenPartnership.com>
X-Patchwork-Id: 11387585
Return-Path: <SRS0=6ymt=4F=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 84B7817F0
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Mon, 17 Feb 2020 20:57:25 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 3EF4A207FD
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Mon, 17 Feb 2020 20:57:25 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key)
 header.d=hansenpartnership.com header.i=@hansenpartnership.com
 header.b="SfVbWhMi"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729861AbgBQU5Y (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Mon, 17 Feb 2020 15:57:24 -0500
Received: from bedivere.hansenpartnership.com ([66.63.167.143]:37780 "EHLO
        bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1728088AbgBQU5Y (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 17 Feb 2020 15:57:24 -0500
Received: from localhost (localhost [127.0.0.1])
        by bedivere.hansenpartnership.com (Postfix) with ESMTP id 26AB68EE218;
        Mon, 17 Feb 2020 12:57:24 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=hansenpartnership.com;
        s=20151216; t=1581973044;
        bh=uMJCegPX72lmQRIcA2jq+ewm1vATTzhjJYMiJzY4cHY=;
        h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
        b=SfVbWhMi7TI40Ww7rJCx+r6gLWT4KNhXSIdSzm8+6fkrZFxZB4OChf5Mt1aPgreq7
         zfQuIF9VwD49fqSACtgaTnvXFgEE5WA/b2l3XkL/bqXIrXaL22/EJIhk4fNVdRpxcT
         lVm6sHB+KQ7BjIeZZ2IWyA/gPQ18a11sd1eTAMCY=
Received: from bedivere.hansenpartnership.com ([127.0.0.1])
        by localhost (bedivere.hansenpartnership.com [127.0.0.1])
 (amavisd-new, port 10024)
        with ESMTP id n5dZMth6iikl; Mon, 17 Feb 2020 12:57:24 -0800 (PST)
Received: from jarvis.lan (jarvis.ext.hansenpartnership.com [153.66.160.226])
        by bedivere.hansenpartnership.com (Postfix) with ESMTP id 638DE8EE0F5;
        Mon, 17 Feb 2020 12:57:23 -0800 (PST)
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: linux-fsdevel@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>,
 Christian Brauner <christian@brauner.io>, Al Viro <viro@ZenIV.linux.org.uk>,
 Miklos Szeredi <miklos@szeredi.hu>,
 Seth Forshee <seth.forshee@canonical.com>, linux-unionfs@vger.kernel.org,
 Amir Goldstein <amir73il@gmail.com>,
 =?utf-8?q?St=C3=A9phane_Graber?= <stgraber@ubuntu.com>,
 Eric Biederman <ebiederm@xmission.com>, Aleksa Sarai <cyphar@cyphar.com>,
 "Serge E . Hallyn" <serge@hallyn.com>, Tycho Andersen <tycho@tycho.ws>,
 containers@lists.linux-foundation.org
Subject: [PATCH v3 1/3] fs: rethread notify_change to take a path instead of a
 dentry
Date: Mon, 17 Feb 2020 12:53:05 -0800
Message-Id: <20200217205307.32256-2-James.Bottomley@HansenPartnership.com>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20200217205307.32256-1-James.Bottomley@HansenPartnership.com>
References: <20200217205307.32256-1-James.Bottomley@HansenPartnership.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

In order to prepare for implementing shiftfs as a property changing
bind mount, the path (which contains the vfsmount) must be threaded
through everywhere we are going to do either a permission check or an
attribute get/set so that we can arrange for the credentials for the
operation to be based on the bind mount properties rather than those
of current.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---

v2: fix issues found by Amir Goldstein
v3: add reviews
---
 drivers/base/devtmpfs.c   |  8 ++++++--
 fs/attr.c                 |  4 +++-
 fs/cachefiles/interface.c |  6 ++++--
 fs/coredump.c             |  4 ++--
 fs/ecryptfs/inode.c       |  9 ++++++---
 fs/inode.c                |  7 ++++---
 fs/namei.c                |  2 +-
 fs/nfsd/vfs.c             | 13 ++++++++-----
 fs/open.c                 | 19 ++++++++++---------
 fs/overlayfs/copy_up.c    | 40 ++++++++++++++++++++++++----------------
 fs/overlayfs/dir.c        | 10 ++++++++--
 fs/overlayfs/inode.c      |  6 ++++--
 fs/overlayfs/overlayfs.h  |  2 +-
 fs/overlayfs/super.c      |  3 ++-
 fs/utimes.c               |  2 +-
 include/linux/fs.h        |  7 +++----
 16 files changed, 87 insertions(+), 55 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index c9017e0584c0..e323b55721a1 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -214,13 +214,17 @@ static int handle_create(const char *nodename, umode_t mode, kuid_t uid,
 	err = vfs_mknod(d_inode(path.dentry), dentry, mode, dev->devt);
 	if (!err) {
 		struct iattr newattrs;
+		struct path newpath = {
+			.mnt = path.mnt,
+			.dentry = dentry,
+		};
 
 		newattrs.ia_mode = mode;
 		newattrs.ia_uid = uid;
 		newattrs.ia_gid = gid;
 		newattrs.ia_valid = ATTR_MODE|ATTR_UID|ATTR_GID;
 		inode_lock(d_inode(dentry));
-		notify_change(dentry, &newattrs, NULL);
+		notify_change(&newpath, &newattrs, NULL);
 		inode_unlock(d_inode(dentry));
 
 		/* mark as kernel-created inode */
@@ -327,7 +331,7 @@ static int handle_remove(const char *nodename, struct device *dev)
 			newattrs.ia_valid =
 				ATTR_UID|ATTR_GID|ATTR_MODE;
 			inode_lock(d_inode(dentry));
-			notify_change(dentry, &newattrs, NULL);
+			notify_change(&p, &newattrs, NULL);
 			inode_unlock(d_inode(dentry));
 			err = vfs_unlink(d_inode(parent.dentry), dentry, NULL);
 			if (!err || err == -ENOENT)
diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..11201ab7e3b1 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -220,8 +220,10 @@ EXPORT_SYMBOL(setattr_copy);
  * the file open for write, as there can be no conflicting delegation in
  * that case.
  */
-int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **delegated_inode)
+int notify_change(const struct path *path, struct iattr * attr,
+		  struct inode **delegated_inode)
 {
+	struct dentry *dentry = path->dentry;
 	struct inode *inode = dentry->d_inode;
 	umode_t mode = inode->i_mode;
 	int error;
diff --git a/fs/cachefiles/interface.c b/fs/cachefiles/interface.c
index 4cea5fbf695e..f11216f59a56 100644
--- a/fs/cachefiles/interface.c
+++ b/fs/cachefiles/interface.c
@@ -436,6 +436,7 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 	uint64_t ni_size;
 	loff_t oi_size;
 	int ret;
+	struct path path;
 
 	ni_size = _object->store_limit_l;
 
@@ -466,18 +467,19 @@ static int cachefiles_attr_changed(struct fscache_object *_object)
 	/* if there's an extension to a partial page at the end of the backing
 	 * file, we need to discard the partial page so that we pick up new
 	 * data after it */
+	path = (struct path){ .mnt = cache->mnt, .dentry = object->backer };
 	if (oi_size & ~PAGE_MASK && ni_size > oi_size) {
 		_debug("discard tail %llx", oi_size);
 		newattrs.ia_valid = ATTR_SIZE;
 		newattrs.ia_size = oi_size & PAGE_MASK;
-		ret = notify_change(object->backer, &newattrs, NULL);
+		ret = notify_change(&path, &newattrs, NULL);
 		if (ret < 0)
 			goto truncate_failed;
 	}
 
 	newattrs.ia_valid = ATTR_SIZE;
 	newattrs.ia_size = ni_size;
-	ret = notify_change(object->backer, &newattrs, NULL);
+	ret = notify_change(&path, &newattrs, NULL);
 
 truncate_failed:
 	inode_unlock(d_inode(object->backer));
diff --git a/fs/coredump.c b/fs/coredump.c
index f8296a82d01d..0334050e27e4 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -775,7 +775,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
 			goto close_fail;
 		if (!(cprm.file->f_mode & FMODE_CAN_WRITE))
 			goto close_fail;
-		if (do_truncate(cprm.file->f_path.dentry, 0, 0, cprm.file))
+		if (do_truncate(&cprm.file->f_path, 0, 0, cprm.file))
 			goto close_fail;
 	}
 
@@ -879,7 +879,7 @@ void dump_truncate(struct coredump_params *cprm)
 	if (file->f_op->llseek && file->f_op->llseek != no_llseek) {
 		offset = file->f_op->llseek(file, 0, SEEK_CUR);
 		if (i_size_read(file->f_mapping->host) < offset)
-			do_truncate(file->f_path.dentry, offset, 0, file);
+			do_truncate(&file->f_path, offset, 0, file);
 	}
 }
 EXPORT_SYMBOL(dump_truncate);
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index e23752d9a79f..3bc67c478163 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -852,10 +852,11 @@ int ecryptfs_truncate(struct dentry *dentry, loff_t new_length)
 
 	rc = truncate_upper(dentry, &ia, &lower_ia);
 	if (!rc && lower_ia.ia_valid & ATTR_SIZE) {
-		struct dentry *lower_dentry = ecryptfs_dentry_to_lower(dentry);
+		struct path *lower_path = ecryptfs_dentry_to_lower_path(dentry);
+		struct dentry *lower_dentry = lower_path->dentry;
 
 		inode_lock(d_inode(lower_dentry));
-		rc = notify_change(lower_dentry, &lower_ia, NULL);
+		rc = notify_change(lower_path, &lower_ia, NULL);
 		inode_unlock(d_inode(lower_dentry));
 	}
 	return rc;
@@ -883,6 +884,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 {
 	int rc = 0;
 	struct dentry *lower_dentry;
+	struct path *lower_path;
 	struct iattr lower_ia;
 	struct inode *inode;
 	struct inode *lower_inode;
@@ -897,6 +899,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 	inode = d_inode(dentry);
 	lower_inode = ecryptfs_inode_to_lower(inode);
 	lower_dentry = ecryptfs_dentry_to_lower(dentry);
+	lower_path = ecryptfs_dentry_to_lower_path(dentry);
 	mutex_lock(&crypt_stat->cs_mutex);
 	if (d_is_dir(dentry))
 		crypt_stat->flags &= ~(ECRYPTFS_ENCRYPTED);
@@ -959,7 +962,7 @@ static int ecryptfs_setattr(struct dentry *dentry, struct iattr *ia)
 		lower_ia.ia_valid &= ~ATTR_MODE;
 
 	inode_lock(d_inode(lower_dentry));
-	rc = notify_change(lower_dentry, &lower_ia, NULL);
+	rc = notify_change(lower_path, &lower_ia, NULL);
 	inode_unlock(d_inode(lower_dentry));
 out:
 	fsstack_copy_attr_all(inode, lower_inode);
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..be14d3fcbee1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1821,7 +1821,7 @@ int dentry_needs_remove_privs(struct dentry *dentry)
 	return mask;
 }
 
-static int __remove_privs(struct dentry *dentry, int kill)
+static int __remove_privs(struct path *path, int kill)
 {
 	struct iattr newattrs;
 
@@ -1830,7 +1830,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
 	 * Note we call this on write, so notify_change will not
 	 * encounter any conflicting delegations:
 	 */
-	return notify_change(dentry, &newattrs, NULL);
+	return notify_change(path, &newattrs, NULL);
 }
 
 /*
@@ -1839,6 +1839,7 @@ static int __remove_privs(struct dentry *dentry, int kill)
  */
 int file_remove_privs(struct file *file)
 {
+	struct path *path = &file->f_path;
 	struct dentry *dentry = file_dentry(file);
 	struct inode *inode = file_inode(file);
 	int kill;
@@ -1857,7 +1858,7 @@ int file_remove_privs(struct file *file)
 	if (kill < 0)
 		return kill;
 	if (kill)
-		error = __remove_privs(dentry, kill);
+		error = __remove_privs(path, kill);
 	if (!error)
 		inode_has_no_xattr(inode);
 
diff --git a/fs/namei.c b/fs/namei.c
index db6565c99825..531ac55c7e67 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3080,7 +3080,7 @@ static int handle_truncate(struct file *filp)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error) {
-		error = do_truncate(path->dentry, 0,
+		error = do_truncate(path, 0,
 				    ATTR_MTIME|ATTR_CTIME|ATTR_OPEN,
 				    filp);
 	}
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..a51a69e0cc87 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -366,8 +366,8 @@ __be32
 nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 	     int check_guard, time64_t guardtime)
 {
-	struct dentry	*dentry;
 	struct inode	*inode;
+	struct path	path;
 	int		accmode = NFSD_MAY_SATTR;
 	umode_t		ftype = 0;
 	__be32		err;
@@ -406,8 +406,11 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 			goto out;
 	}
 
-	dentry = fhp->fh_dentry;
-	inode = d_inode(dentry);
+	path = (struct path) {
+		.mnt = fhp->fh_export->ex_path.mnt,
+		.dentry = fhp->fh_dentry,
+	};
+	inode = d_inode(path.dentry);
 
 	/* Ignore any mode updates on symlinks */
 	if (S_ISLNK(inode->i_mode))
@@ -448,7 +451,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 			.ia_size	= iap->ia_size,
 		};
 
-		host_err = notify_change(dentry, &size_attr, NULL);
+		host_err = notify_change(&path, &size_attr, NULL);
 		if (host_err)
 			goto out_unlock;
 		iap->ia_valid &= ~ATTR_SIZE;
@@ -463,7 +466,7 @@ nfsd_setattr(struct svc_rqst *rqstp, struct svc_fh *fhp, struct iattr *iap,
 	}
 
 	iap->ia_valid |= ATTR_CTIME;
-	host_err = notify_change(dentry, iap, NULL);
+	host_err = notify_change(&path, iap, NULL);
 
 out_unlock:
 	fh_unlock(fhp);
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..db6758b9636a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -35,11 +35,12 @@
 
 #include "internal.h"
 
-int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
+int do_truncate(const struct path *path, loff_t length, unsigned int time_attrs,
 	struct file *filp)
 {
 	int ret;
 	struct iattr newattrs;
+	struct dentry *dentry = path->dentry;
 
 	/* Not pretty: "inode->i_size" shouldn't really be signed. But it is. */
 	if (length < 0)
@@ -61,7 +62,7 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
-	ret = notify_change(dentry, &newattrs, NULL);
+	ret = notify_change(path, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
 	return ret;
 }
@@ -107,7 +108,7 @@ long vfs_truncate(const struct path *path, loff_t length)
 	if (!error)
 		error = security_path_truncate(path);
 	if (!error)
-		error = do_truncate(path->dentry, length, 0, NULL);
+		error = do_truncate(path, length, 0, NULL);
 
 put_write_and_out:
 	put_write_access(inode);
@@ -155,7 +156,7 @@ COMPAT_SYSCALL_DEFINE2(truncate, const char __user *, path, compat_off_t, length
 long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 {
 	struct inode *inode;
-	struct dentry *dentry;
+	struct path *path;
 	struct fd f;
 	int error;
 
@@ -171,8 +172,8 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	if (f.file->f_flags & O_LARGEFILE)
 		small = 0;
 
-	dentry = f.file->f_path.dentry;
-	inode = dentry->d_inode;
+	path = &f.file->f_path;
+	inode = path->dentry->d_inode;
 	error = -EINVAL;
 	if (!S_ISREG(inode->i_mode) || !(f.file->f_mode & FMODE_WRITE))
 		goto out_putf;
@@ -192,7 +193,7 @@ long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	if (!error)
 		error = security_path_truncate(&f.file->f_path);
 	if (!error)
-		error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, f.file);
+		error = do_truncate(path, length, ATTR_MTIME|ATTR_CTIME, f.file);
 	sb_end_write(inode->i_sb);
 out_putf:
 	fdput(f);
@@ -558,7 +559,7 @@ static int chmod_common(const struct path *path, umode_t mode)
 		goto out_unlock;
 	newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
 	newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	error = notify_change(path, &newattrs, &delegated_inode);
 out_unlock:
 	inode_unlock(inode);
 	if (delegated_inode) {
@@ -649,7 +650,7 @@ static int chown_common(const struct path *path, uid_t user, gid_t group)
 	inode_lock(inode);
 	error = security_path_chown(path, uid, gid);
 	if (!error)
-		error = notify_change(path->dentry, &newattrs, &delegated_inode);
+		error = notify_change(path, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 9fc47c2e078d..05cfbd579690 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -214,17 +214,17 @@ static int ovl_copy_up_data(struct path *old, struct path *new, loff_t len)
 	return error;
 }
 
-static int ovl_set_size(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_size(struct path *upperpath, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid = ATTR_SIZE,
 		.ia_size = stat->size,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_change(upperpath, &attr, NULL);
 }
 
-static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
+static int ovl_set_timestamps(struct path *upperpath, struct kstat *stat)
 {
 	struct iattr attr = {
 		.ia_valid =
@@ -233,10 +233,10 @@ static int ovl_set_timestamps(struct dentry *upperdentry, struct kstat *stat)
 		.ia_mtime = stat->mtime,
 	};
 
-	return notify_change(upperdentry, &attr, NULL);
+	return notify_change(upperpath, &attr, NULL);
 }
 
-int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
+int ovl_set_attr(struct path *upperpath, struct kstat *stat)
 {
 	int err = 0;
 
@@ -245,7 +245,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_valid = ATTR_MODE,
 			.ia_mode = stat->mode,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_change(upperpath, &attr, NULL);
 	}
 	if (!err) {
 		struct iattr attr = {
@@ -253,10 +253,10 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat *stat)
 			.ia_uid = stat->uid,
 			.ia_gid = stat->gid,
 		};
-		err = notify_change(upperdentry, &attr, NULL);
+		err = notify_change(upperpath, &attr, NULL);
 	}
 	if (!err)
-		ovl_set_timestamps(upperdentry, stat);
+		ovl_set_timestamps(upperpath, stat);
 
 	return err;
 }
@@ -435,8 +435,13 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 {
 	int err;
 	struct dentry *upper;
-	struct dentry *upperdir = ovl_dentry_upper(c->parent);
-	struct inode *udir = d_inode(upperdir);
+	struct dentry *upperdir;
+	struct path upperdirpath;
+	struct inode *udir;
+
+	ovl_path_upper(c->parent, &upperdirpath);
+	upperdir = upperdirpath.dentry;
+	udir = d_inode(upperdir);
 
 	/* Mark parent "impure" because it may now contain non-pure upper */
 	err = ovl_set_impure(c->parent, upperdir);
@@ -457,7 +462,7 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 
 		if (!err) {
 			/* Restore timestamps on parent (best effort) */
-			ovl_set_timestamps(upperdir, &c->pstat);
+			ovl_set_timestamps(&upperdirpath, &c->pstat);
 			ovl_dentry_set_upper_alias(c->dentry);
 		}
 	}
@@ -473,15 +478,16 @@ static int ovl_link_up(struct ovl_copy_up_ctx *c)
 static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 {
 	int err;
+	struct path upperpath, temppath;
 
+	ovl_path_upper(c->dentry, &upperpath);
 	/*
 	 * Copy up data first and then xattrs. Writing data after
 	 * xattrs will remove security.capability xattr automatically.
 	 */
 	if (S_ISREG(c->stat.mode) && !c->metacopy) {
-		struct path upperpath, datapath;
+		struct path datapath;
 
-		ovl_path_upper(c->dentry, &upperpath);
 		if (WARN_ON(upperpath.dentry != NULL))
 			return -EIO;
 		upperpath.dentry = temp;
@@ -515,12 +521,13 @@ static int ovl_copy_up_inode(struct ovl_copy_up_ctx *c, struct dentry *temp)
 		if (err)
 			return err;
 	}
+	temppath = (struct path){ .mnt = upperpath.mnt, .dentry = temp };
 
 	inode_lock(temp->d_inode);
 	if (S_ISREG(c->stat.mode))
-		err = ovl_set_size(temp, &c->stat);
+		err = ovl_set_size(&temppath, &c->stat);
 	if (!err)
-		err = ovl_set_attr(temp, &c->stat);
+		err = ovl_set_attr(&temppath, &c->stat);
 	inode_unlock(temp->d_inode);
 
 	return err;
@@ -736,10 +743,11 @@ static int ovl_do_copy_up(struct ovl_copy_up_ctx *c)
 		err = ovl_set_nlink_upper(c->dentry);
 	} else {
 		struct inode *udir = d_inode(c->destdir);
+		struct path destpath = { .mnt = ofs->upper_mnt, .dentry = c->destdir };
 
 		/* Restore timestamps on parent (best effort) */
 		inode_lock(udir);
-		ovl_set_timestamps(c->destdir, &c->pstat);
+		ovl_set_timestamps(&destpath, &c->pstat);
 		inode_unlock(udir);
 
 		ovl_dentry_set_upper_alias(c->dentry);
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 8e57d5372b8f..cecd7e244829 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -374,7 +374,8 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
 		goto out_cleanup;
 
 	inode_lock(opaquedir->d_inode);
-	err = ovl_set_attr(opaquedir, &stat);
+	err = ovl_set_attr(&(struct path) { .mnt = upperpath.mnt,
+					    .dentry = opaquedir }, &stat);
 	inode_unlock(opaquedir->d_inode);
 	if (err)
 		goto out_cleanup;
@@ -435,10 +436,13 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 	struct inode *udir = upperdir->d_inode;
 	struct dentry *upper;
 	struct dentry *newdentry;
+	struct path newpath;
 	int err;
 	struct posix_acl *acl, *default_acl;
 	bool hardlink = !!cattr->hardlink;
 
+	ovl_path_upper(dentry, &newpath);
+
 	if (WARN_ON(!workdir))
 		return -EROFS;
 
@@ -478,8 +482,10 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
 			.ia_valid = ATTR_MODE,
 			.ia_mode = cattr->mode,
 		};
+
+		newpath.dentry = newdentry;
 		inode_lock(newdentry->d_inode);
-		err = notify_change(newdentry, &attr, NULL);
+		err = notify_change(&newpath, &attr, NULL);
 		inode_unlock(newdentry->d_inode);
 		if (err)
 			goto out_cleanup;
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 79e8994e3bc1..f3d86b357a23 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -45,8 +45,10 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 		err = ovl_copy_up_with_data(dentry);
 	if (!err) {
 		struct inode *winode = NULL;
+		struct path upperpath;
 
-		upperdentry = ovl_dentry_upper(dentry);
+		ovl_path_upper(dentry, &upperpath);
+		upperdentry = upperpath.dentry;
 
 		if (attr->ia_valid & ATTR_SIZE) {
 			winode = d_inode(upperdentry);
@@ -60,7 +62,7 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
 
 		inode_lock(upperdentry->d_inode);
 		old_cred = ovl_override_creds(dentry->d_sb);
-		err = notify_change(upperdentry, attr, NULL);
+		err = notify_change(&upperpath, attr, NULL);
 		revert_creds(old_cred);
 		if (!err)
 			ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 3623d28aa4fa..ecd20f3d979b 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -459,7 +459,7 @@ int ovl_copy_up_with_data(struct dentry *dentry);
 int ovl_copy_up_flags(struct dentry *dentry, int flags);
 int ovl_maybe_copy_up(struct dentry *dentry, int flags);
 int ovl_copy_xattr(struct dentry *old, struct dentry *new);
-int ovl_set_attr(struct dentry *upper, struct kstat *stat);
+int ovl_set_attr(struct path *upperpath, struct kstat *stat);
 struct ovl_fh *ovl_encode_real_fh(struct dentry *real, bool is_upper);
 int ovl_set_origin(struct dentry *dentry, struct dentry *lower,
 		   struct dentry *upper);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 319fe0d355b0..1cd87964b34f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -633,6 +633,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 			.ia_valid = ATTR_MODE,
 			.ia_mode = S_IFDIR | 0,
 		};
+		const struct path workpath = { .mnt = mnt, .dentry = work };
 
 		if (work->d_inode) {
 			err = -EEXIST;
@@ -676,7 +677,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 
 		/* Clear any inherited mode bits */
 		inode_lock(work->d_inode);
-		err = notify_change(work, &attr, NULL);
+		err = notify_change(&workpath, &attr, NULL);
 		inode_unlock(work->d_inode);
 		if (err)
 			goto out_dput;
diff --git a/fs/utimes.c b/fs/utimes.c
index 1d17ce98cb80..8b5635d30de4 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -57,7 +57,7 @@ static int utimes_common(const struct path *path, struct timespec64 *times)
 	}
 retry_deleg:
 	inode_lock(inode);
-	error = notify_change(path->dentry, &newattrs, &delegated_inode);
+	error = notify_change(path, &newattrs, &delegated_inode);
 	inode_unlock(inode);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4dc62a697817..d0e7ca7bf72e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2537,8 +2537,8 @@ struct filename {
 static_assert(offsetof(struct filename, iname) % sizeof(long) == 0);
 
 extern long vfs_truncate(const struct path *, loff_t);
-extern int do_truncate(struct dentry *, loff_t start, unsigned int time_attrs,
-		       struct file *filp);
+extern int do_truncate(const struct path *p, loff_t start,
+		       unsigned int time_attrs, struct file *filp);
 extern int vfs_fallocate(struct file *file, int mode, loff_t offset,
 			loff_t len);
 extern long do_sys_open(int dfd, const char __user *filename, int flags,
@@ -2887,8 +2887,7 @@ static inline int bmap(struct inode *inode,  sector_t *block)
 	return -EINVAL;
 }
 #endif
-
-extern int notify_change(struct dentry *, struct iattr *, struct inode **);
+extern int notify_change(const struct path *, struct iattr *, struct inode **);
 extern int inode_permission(struct inode *, int);
 extern int generic_permission(struct inode *, int);
 extern int __check_sticky(struct inode *dir, struct inode *inode);

From patchwork Mon Feb 17 20:53:06 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: James Bottomley <James.Bottomley@HansenPartnership.com>
X-Patchwork-Id: 11387587
Return-Path: <SRS0=6ymt=4F=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 46B7C924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Mon, 17 Feb 2020 20:58:08 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 03BB72070B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Mon, 17 Feb 2020 20:58:08 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key)
 header.d=hansenpartnership.com header.i=@hansenpartnership.com
 header.b="LqLpyGL5"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729242AbgBQU6H (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Mon, 17 Feb 2020 15:58:07 -0500
Received: from bedivere.hansenpartnership.com ([66.63.167.143]:37838 "EHLO
        bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1728947AbgBQU6H (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 17 Feb 2020 15:58:07 -0500
Received: from localhost (localhost [127.0.0.1])
        by bedivere.hansenpartnership.com (Postfix) with ESMTP id 86F3D8EE218;
        Mon, 17 Feb 2020 12:58:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=hansenpartnership.com;
        s=20151216; t=1581973086;
        bh=ewGzLjmLyybex+04KIZxXQo9TlPiMEo63JusSL0E9OY=;
        h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
        b=LqLpyGL5zyU4JjWJ4BSu/C3K2YRtCdHIKTQJDTrB/tu7sMxG30CsReE2PNbxGIA2c
         TQL6/uWG/LcxoCWU5pGyAQBFYn6KiMcU9uHWMHWLfGSQiPVOlT1X02mIWQVTalmdXe
         WcbyexcGZ5HC9+BGlOrYjYoMQSd2ppPDXjyMB5h0=
Received: from bedivere.hansenpartnership.com ([127.0.0.1])
        by localhost (bedivere.hansenpartnership.com [127.0.0.1])
 (amavisd-new, port 10024)
        with ESMTP id njG0Qazl88yX; Mon, 17 Feb 2020 12:58:06 -0800 (PST)
Received: from jarvis.lan (jarvis.ext.hansenpartnership.com [153.66.160.226])
        by bedivere.hansenpartnership.com (Postfix) with ESMTP id ADE808EE0F5;
        Mon, 17 Feb 2020 12:58:05 -0800 (PST)
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: linux-fsdevel@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>,
 Christian Brauner <christian@brauner.io>, Al Viro <viro@ZenIV.linux.org.uk>,
 Miklos Szeredi <miklos@szeredi.hu>,
 Seth Forshee <seth.forshee@canonical.com>, linux-unionfs@vger.kernel.org,
 Amir Goldstein <amir73il@gmail.com>,
 =?utf-8?q?St=C3=A9phane_Graber?= <stgraber@ubuntu.com>,
 Eric Biederman <ebiederm@xmission.com>, Aleksa Sarai <cyphar@cyphar.com>,
 "Serge E . Hallyn" <serge@hallyn.com>, Tycho Andersen <tycho@tycho.ws>,
 containers@lists.linux-foundation.org
Subject: [PATCH v3 2/3] fs: introduce uid/gid shifting bind mount
Date: Mon, 17 Feb 2020 12:53:06 -0800
Message-Id: <20200217205307.32256-3-James.Bottomley@HansenPartnership.com>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20200217205307.32256-1-James.Bottomley@HansenPartnership.com>
References: <20200217205307.32256-1-James.Bottomley@HansenPartnership.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This implementation reverse shifts according to the user_ns belonging
to the mnt_ns.  So if the vfsmount has the newly introduced flag
MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns
then we shift back using the user_ns and an optional mnt_userns (which
belongs to the struct mount) before committing to the underlying
filesystem.

For example, if a user_ns is created where interior (fake root, uid 0)
is mapped to kernel uid 100000 then writes from interior root normally
go to the filesystem at the kernel uid.  However, if MNT_SHIFT is set,
they will be shifted back to write at uid 0, meaning we can bind mount
real image filesystems to user_ns protected faker root.

In essence there are several things which have to be done for this to
occur safely.  Firstly for all operations on the filesystem, new
credentials have to be installed where fsuid and fsgid are set to the
*interior* values. Next all inodes used from the filesystem have to
have i_uid and i_gid shifted back to the kernel values and attributes
set from user space have to have ia_uid and ia_gid shifted from the
kernel values to the interior values.  The capability checks have to
be done using ns_capable against the kernel values, but the inode
capability checks have to be done against the shifted ids.

Since creating a new credential is a reasonably expensive proposition
and we have to shift and unshift many times during path walking, a
cached copy of the shifted credential is saved to a newly created
place in the task structure.  This serves the dual purpose of allowing
us to use a pre-prepared copy of the shifted credentials and also
allows us to recognise whenever the shift is actually in effect (the
cached shifted credential pointer being equal to the current_cred()
pointer).

To get this all to work, we have a check for the vfsmount flag and the
user_ns gating a shifting of the credentials over all user space
entries to filesystem functions.  In theory the path has to be present
everywhere we do this, so we can check the vfsmount flags.  However,
for lower level functions we can cheat this path check of vfsmount
simply to check whether a shifted credential is in effect or not to
gate things like the inode permission check, which means the path
doesn't have to be threaded all the way through the permission
checking functions.  if the credential is shifted check passes, we can
also be sure that the current user_ns is the same as the mnt->user_ns,
so we can use it and thus have no need of the struct mount at the
point of the shift.

Although the shift can be effected simply by executing
do_reconfigure_mnt with MNT_SHIFT in the flags, this patch only
contains the shifting mechanisms.  The follow on patch wires up the
user visible API for turning the flag on.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---

v3: added a bind mount base shift at the request of Serge Hallyn
---
 fs/attr.c             | 127 +++++++++++++++++++++++++++++++++++++++++---------
 fs/exec.c             |   3 +-
 fs/inode.c            |  10 ++--
 fs/internal.h         |   2 +
 fs/mount.h            |   1 +
 fs/namei.c            | 112 +++++++++++++++++++++++++++++++++++++-------
 fs/namespace.c        |   5 ++
 fs/open.c             |  25 +++++++++-
 fs/posix_acl.c        |   4 +-
 fs/stat.c             |  32 +++++++++++--
 include/linux/cred.h  |  12 +++++
 include/linux/mount.h |   4 +-
 include/linux/sched.h |   5 ++
 kernel/capability.c   |   9 +++-
 kernel/cred.c         |  20 ++++++++
 15 files changed, 317 insertions(+), 54 deletions(-)

diff --git a/fs/attr.c b/fs/attr.c
index 11201ab7e3b1..d7c5883a4b4c 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -18,14 +18,26 @@
 #include <linux/evm.h>
 #include <linux/ima.h>
 
+#include "internal.h"
+#include "mount.h"
+
 static bool chown_ok(const struct inode *inode, kuid_t uid)
 {
+	kuid_t i_uid = inode->i_uid;
+
+	if (cred_is_shifted()) {
+		struct mount *m = real_mount(current->mnt);
+
+		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
+		i_uid = make_kuid(current_user_ns(), __kuid_val(i_uid));
+	}
+
 	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    uid_eq(uid, inode->i_uid))
+	    uid_eq(uid, i_uid))
 		return true;
 	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
 		return true;
-	if (uid_eq(inode->i_uid, INVALID_UID) &&
+	if (uid_eq(i_uid, INVALID_UID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
@@ -33,17 +45,40 @@ static bool chown_ok(const struct inode *inode, kuid_t uid)
 
 static bool chgrp_ok(const struct inode *inode, kgid_t gid)
 {
+	kgid_t i_gid = inode->i_gid;
+	kuid_t i_uid = inode->i_uid;
+
+	if (cred_is_shifted()) {
+		struct mount *m = real_mount(current->mnt);
+		struct user_namespace *ns = current_user_ns();
+
+		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
+		i_uid = make_kuid(ns, __kuid_val(i_uid));
+		i_gid = KGIDT_INIT(from_kgid(m->mnt_userns, i_gid));
+		i_gid = make_kgid(ns, __kgid_val(i_gid));
+	}
 	if (uid_eq(current_fsuid(), inode->i_uid) &&
-	    (in_group_p(gid) || gid_eq(gid, inode->i_gid)))
+	    (in_group_p(gid) || gid_eq(gid, i_gid)))
 		return true;
 	if (capable_wrt_inode_uidgid(inode, CAP_CHOWN))
 		return true;
-	if (gid_eq(inode->i_gid, INVALID_GID) &&
+	if (gid_eq(i_gid, INVALID_GID) &&
 	    ns_capable(inode->i_sb->s_user_ns, CAP_CHOWN))
 		return true;
 	return false;
 }
 
+int in_group_p_shifted(kgid_t grp)
+{
+	if (cred_is_shifted()) {
+		struct mount *m = real_mount(current->mnt);
+
+		grp = KGIDT_INIT(from_kgid(m->mnt_userns, grp));
+		grp = make_kgid(current_user_ns(), __kgid_val(grp));
+	}
+	return in_group_p(grp);
+}
+
 /**
  * setattr_prepare - check if attribute changes to a dentry are allowed
  * @dentry:	dentry to check
@@ -89,9 +124,10 @@ int setattr_prepare(struct dentry *dentry, struct iattr *attr)
 	if (ia_valid & ATTR_MODE) {
 		if (!inode_owner_or_capable(inode))
 			return -EPERM;
+
 		/* Also check the setgid bit! */
-		if (!in_group_p((ia_valid & ATTR_GID) ? attr->ia_gid :
-				inode->i_gid) &&
+		if (!in_group_p_shifted((ia_valid & ATTR_GID) ? attr->ia_gid :
+					inode->i_gid) &&
 		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			attr->ia_mode &= ~S_ISGID;
 	}
@@ -192,7 +228,7 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 	if (ia_valid & ATTR_MODE) {
 		umode_t mode = attr->ia_mode;
 
-		if (!in_group_p(inode->i_gid) &&
+		if (!in_group_p_shifted(inode->i_gid) &&
 		    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 			mode &= ~S_ISGID;
 		inode->i_mode = mode;
@@ -200,6 +236,23 @@ void setattr_copy(struct inode *inode, const struct iattr *attr)
 }
 EXPORT_SYMBOL(setattr_copy);
 
+void cred_shift(kuid_t *uid, kgid_t *gid)
+{
+	if (cred_is_shifted()) {
+		struct user_namespace *ns = current_user_ns();
+		struct mount *m = real_mount(current->mnt);
+
+		if (uid) {
+			*uid = KUIDT_INIT(from_kuid(m->mnt_userns, *uid));
+			*uid = make_kuid(ns, __kuid_val(*uid));
+		}
+		if (gid) {
+			*gid = KGIDT_INIT(from_kgid(m->mnt_userns, *gid));
+			*gid = make_kgid(ns, __kgid_val(*gid));
+		}
+	}
+}
+
 /**
  * notify_change - modify attributes of a filesytem object
  * @dentry:	object affected
@@ -229,6 +282,9 @@ int notify_change(const struct path *path, struct iattr * attr,
 	int error;
 	struct timespec64 now;
 	unsigned int ia_valid = attr->ia_valid;
+	const struct cred *cred;
+	kuid_t i_uid = inode->i_uid;
+	kgid_t i_gid = inode->i_gid;
 
 	WARN_ON_ONCE(!inode_is_locked(inode));
 
@@ -237,18 +293,30 @@ int notify_change(const struct path *path, struct iattr * attr,
 			return -EPERM;
 	}
 
+	cred = change_userns_creds(path);
+	if (cred) {
+		struct mount *m = real_mount(path->mnt);
+
+		attr->ia_uid = KUIDT_INIT(from_kuid(m->mnt_ns->user_ns, attr->ia_uid));
+		attr->ia_uid = make_kuid(m->mnt_userns, __kuid_val(attr->ia_uid));
+		attr->ia_gid = KGIDT_INIT(from_kgid(m->mnt_ns->user_ns, attr->ia_gid));
+		attr->ia_gid = make_kgid(m->mnt_userns, __kgid_val(attr->ia_gid));
+	}
+
 	/*
 	 * If utimes(2) and friends are called with times == NULL (or both
 	 * times are UTIME_NOW), then we need to check for write permission
 	 */
 	if (ia_valid & ATTR_TOUCH) {
-		if (IS_IMMUTABLE(inode))
-			return -EPERM;
+		if (IS_IMMUTABLE(inode)) {
+			error = -EPERM;
+			goto err;
+		}
 
 		if (!inode_owner_or_capable(inode)) {
 			error = inode_permission(inode, MAY_WRITE);
 			if (error)
-				return error;
+				goto err;
 		}
 	}
 
@@ -274,7 +342,7 @@ int notify_change(const struct path *path, struct iattr * attr,
 	if (ia_valid & ATTR_KILL_PRIV) {
 		error = security_inode_need_killpriv(dentry);
 		if (error < 0)
-			return error;
+			goto err;
 		if (error == 0)
 			ia_valid = attr->ia_valid &= ~ATTR_KILL_PRIV;
 	}
@@ -305,34 +373,49 @@ int notify_change(const struct path *path, struct iattr * attr,
 			attr->ia_mode &= ~S_ISGID;
 		}
 	}
-	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID)))
-		return 0;
+	if (!(attr->ia_valid & ~(ATTR_KILL_SUID | ATTR_KILL_SGID))) {
+		error = 0;
+		goto err;
+	}
 
 	/*
 	 * Verify that uid/gid changes are valid in the target
 	 * namespace of the superblock.
 	 */
+	error = -EOVERFLOW;
 	if (ia_valid & ATTR_UID &&
 	    !kuid_has_mapping(inode->i_sb->s_user_ns, attr->ia_uid))
-		return -EOVERFLOW;
+		goto err;
+
 	if (ia_valid & ATTR_GID &&
 	    !kgid_has_mapping(inode->i_sb->s_user_ns, attr->ia_gid))
-		return -EOVERFLOW;
+		goto err;
 
 	/* Don't allow modifications of files with invalid uids or
 	 * gids unless those uids & gids are being made valid.
 	 */
-	if (!(ia_valid & ATTR_UID) && !uid_valid(inode->i_uid))
-		return -EOVERFLOW;
-	if (!(ia_valid & ATTR_GID) && !gid_valid(inode->i_gid))
-		return -EOVERFLOW;
+	if (cred_is_shifted()) {
+		struct user_namespace *ns = current_user_ns();
+		struct mount *m = real_mount(current->mnt);
+
+		i_uid = KUIDT_INIT(from_kuid(m->mnt_userns, i_uid));
+		i_uid = make_kuid(ns, __kuid_val(i_uid));
+		i_gid = KGIDT_INIT(from_kgid(m->mnt_userns, i_gid));
+		i_gid = make_kgid(ns, __kgid_val(i_gid));
+	}
+
+	if (!(ia_valid & ATTR_UID) && !uid_valid(i_uid))
+		goto err;
+
+	if (!(ia_valid & ATTR_GID) && !gid_valid(i_gid))
+		goto err;
 
 	error = security_inode_setattr(dentry, attr);
 	if (error)
-		return error;
+		goto err;
 	error = try_break_deleg(inode, delegated_inode);
 	if (error)
-		return error;
+		goto err;
 
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
@@ -345,6 +428,8 @@ int notify_change(const struct path *path, struct iattr * attr,
 		evm_inode_post_setattr(dentry, ia_valid);
 	}
 
+ err:
+	revert_userns_creds(cred);
 	return error;
 }
 EXPORT_SYMBOL(notify_change);
diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..926bab39ed45 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1543,13 +1543,14 @@ static void bprm_fill_uid(struct linux_binprm *bprm)
 
 	/* Be careful if suid/sgid is set */
 	inode_lock(inode);
-
 	/* reload atomically mode/uid/gid now that lock held */
 	mode = inode->i_mode;
 	uid = inode->i_uid;
 	gid = inode->i_gid;
 	inode_unlock(inode);
 
+	cred_shift(&uid, &gid);
+
 	/* We ignore suid/sgid if there are no mappings for them in the ns */
 	if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
 		 !kgid_has_mapping(bprm->cred->user_ns, gid))
diff --git a/fs/inode.c b/fs/inode.c
index be14d3fcbee1..ae75b6396786 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2064,7 +2064,7 @@ void inode_init_owner(struct inode *inode, const struct inode *dir,
 		if (S_ISDIR(mode))
 			mode |= S_ISGID;
 		else if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP) &&
-			 !in_group_p(inode->i_gid) &&
+			 !in_group_p_shifted(inode->i_gid) &&
 			 !capable_wrt_inode_uidgid(dir, CAP_FSETID))
 			mode &= ~S_ISGID;
 	} else
@@ -2083,12 +2083,16 @@ EXPORT_SYMBOL(inode_init_owner);
 bool inode_owner_or_capable(const struct inode *inode)
 {
 	struct user_namespace *ns;
+	kuid_t uid = inode->i_uid;
 
-	if (uid_eq(current_fsuid(), inode->i_uid))
+	if (uid_eq(current_fsuid(), uid))
 		return true;
 
 	ns = current_user_ns();
-	if (kuid_has_mapping(ns, inode->i_uid) && ns_capable(ns, CAP_FOWNER))
+
+	cred_shift(&uid, NULL);
+
+	if (kuid_has_mapping(ns, uid) && ns_capable(ns, CAP_FOWNER))
 		return true;
 	return false;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 80d89ddb9b28..d2adcdb3eb2e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -73,6 +73,8 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 		  const char __user *newname);
 int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	      const char __user *newname, int flags);
+const struct cred *change_userns_creds(const struct path *p);
+void revert_userns_creds(const struct cred *cred);
 
 /*
  * namespace.c
diff --git a/fs/mount.h b/fs/mount.h
index 711a4093e475..c3bfc6ced4c7 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,7 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	struct user_namespace *mnt_userns; /* mapping for underlying mount uid/gid */
 } __randomize_layout;
 
 #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */
diff --git a/fs/namei.c b/fs/namei.c
index 531ac55c7e67..369bd18c7330 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -124,6 +124,42 @@
 
 #define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))
 
+const struct cred *change_userns_creds(const struct path *p)
+{
+	struct mount *m = real_mount(p->mnt);
+
+	if ((p->mnt->mnt_flags & MNT_SHIFT) == 0)
+		return NULL;
+
+	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
+		return NULL;
+
+	if (current->mnt != p->mnt) {
+		struct cred *cred;
+		struct user_namespace *user_ns = m->mnt_ns->user_ns;
+		kuid_t fsuid = current->cred->fsuid;
+		kgid_t fsgid = current->cred->fsgid;
+
+		if (current->mnt_cred)
+			put_cred(current->mnt_cred);
+		cred = prepare_creds();
+		fsuid = KUIDT_INIT(from_kuid(user_ns, fsuid));
+		fsgid = KGIDT_INIT(from_kgid(user_ns, fsgid));
+		cred->fsuid = make_kuid(m->mnt_userns, __kuid_val(fsuid));
+		cred->fsgid = make_kgid(m->mnt_userns, __kgid_val(fsgid));
+		current->mnt = p->mnt; /* no reference needed */
+		current->mnt_cred = cred;
+	}
+	return override_creds(current->mnt_cred);
+}
+
+void revert_userns_creds(const struct cred *cred)
+{
+	if (!cred)
+		return;
+	revert_creds(cred);
+}
+
 struct filename *
 getname_flags(const char __user *filename, int flags, int *empty)
 {
@@ -303,7 +339,7 @@ static int acl_permission_check(struct inode *inode, int mask)
 				return error;
 		}
 
-		if (in_group_p(inode->i_gid))
+		if (in_group_p_shifted(inode->i_gid))
 			mode >>= 3;
 	}
 
@@ -366,7 +402,6 @@ int generic_permission(struct inode *inode, int mask)
 	if (!(mask & MAY_EXEC) || (inode->i_mode & S_IXUGO))
 		if (capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE))
 			return 0;
-
 	return -EACCES;
 }
 EXPORT_SYMBOL(generic_permission);
@@ -1897,6 +1932,7 @@ static int walk_component(struct nameidata *nd, int flags)
 	struct inode *inode;
 	unsigned seq;
 	int err;
+	const struct cred *cred;
 	/*
 	 * "." and ".." are special - ".." especially so because it has
 	 * to be able to know about the current root directory and
@@ -1908,25 +1944,31 @@ static int walk_component(struct nameidata *nd, int flags)
 			put_link(nd);
 		return err;
 	}
+	cred = change_userns_creds(&nd->path);
 	err = lookup_fast(nd, &path, &inode, &seq);
 	if (unlikely(err <= 0)) {
 		if (err < 0)
-			return err;
+			goto out;
 		path.dentry = lookup_slow(&nd->last, nd->path.dentry,
 					  nd->flags);
-		if (IS_ERR(path.dentry))
-			return PTR_ERR(path.dentry);
+		if (IS_ERR(path.dentry)) {
+			err = PTR_ERR(path.dentry);
+			goto out;
+		}
 
 		path.mnt = nd->path.mnt;
 		err = follow_managed(&path, nd);
 		if (unlikely(err < 0))
-			return err;
+			goto out;
 
 		seq = 0;	/* we are already out of RCU mode */
 		inode = d_backing_inode(path.dentry);
 	}
 
-	return step_into(nd, &path, flags, inode, seq);
+	err = step_into(nd, &path, flags, inode, seq);
+ out:
+	revert_userns_creds(cred);
+	return err;
 }
 
 /*
@@ -2180,8 +2222,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
 	for(;;) {
 		u64 hash_len;
 		int type;
+		const struct cred *cred = change_userns_creds(&nd->path);
 
 		err = may_lookup(nd);
+		revert_userns_creds(cred);
 		if (err)
 			return err;
 
@@ -2373,12 +2417,17 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
 static const char *trailing_symlink(struct nameidata *nd)
 {
 	const char *s;
+	const struct cred *cred = change_userns_creds(&nd->path);
 	int error = may_follow_link(nd);
-	if (unlikely(error))
-		return ERR_PTR(error);
+	if (unlikely(error)) {
+		s = ERR_PTR(error);
+		goto out;
+	}
 	nd->flags |= LOOKUP_PARENT;
 	nd->stack[0].name = NULL;
 	s = get_link(nd);
+ out:
+	revert_userns_creds(cred);
 	return s ? s : "";
 }
 
@@ -3343,6 +3392,7 @@ static int do_last(struct nameidata *nd,
 	struct inode *inode;
 	struct path path;
 	int error;
+	const struct cred *cred = change_userns_creds(&nd->path);
 
 	nd->flags &= ~LOOKUP_PARENT;
 	nd->flags |= op->intent;
@@ -3350,7 +3400,7 @@ static int do_last(struct nameidata *nd,
 	if (nd->last_type != LAST_NORM) {
 		error = handle_dots(nd, nd->last_type);
 		if (unlikely(error))
-			return error;
+			goto err;
 		goto finish_open;
 	}
 
@@ -3363,7 +3413,7 @@ static int do_last(struct nameidata *nd,
 			goto finish_lookup;
 
 		if (error < 0)
-			return error;
+			goto err;
 
 		BUG_ON(nd->inode != dir->d_inode);
 		BUG_ON(nd->flags & LOOKUP_RCU);
@@ -3376,12 +3426,14 @@ static int do_last(struct nameidata *nd,
 		 */
 		error = complete_walk(nd);
 		if (error)
-			return error;
+			goto err;
 
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
 		/* trailing slashes? */
-		if (unlikely(nd->last.name[nd->last.len]))
-			return -EISDIR;
+		if (unlikely(nd->last.name[nd->last.len])) {
+			error = -EISDIR;
+			goto err;
+		}
 	}
 
 	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {
@@ -3437,7 +3489,7 @@ static int do_last(struct nameidata *nd,
 
 	error = follow_managed(&path, nd);
 	if (unlikely(error < 0))
-		return error;
+		goto err;
 
 	/*
 	 * create/update audit record if it already exists.
@@ -3446,7 +3498,8 @@ static int do_last(struct nameidata *nd,
 
 	if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) {
 		path_to_nameidata(&path, nd);
-		return -EEXIST;
+		error = -EEXIST;
+		goto err;
 	}
 
 	seq = 0;	/* out of RCU mode, so the value doesn't matter */
@@ -3454,12 +3507,12 @@ static int do_last(struct nameidata *nd,
 finish_lookup:
 	error = step_into(nd, &path, 0, inode, seq);
 	if (unlikely(error))
-		return error;
+		goto err;
 finish_open:
 	/* Why this, you ask?  _Now_ we might have grown LOOKUP_JUMPED... */
 	error = complete_walk(nd);
 	if (error)
-		return error;
+		goto err;
 	audit_inode(nd->name, nd->path.dentry, 0);
 	if (open_flag & O_CREAT) {
 		error = -EISDIR;
@@ -3501,6 +3554,8 @@ static int do_last(struct nameidata *nd,
 	}
 	if (got_write)
 		mnt_drop_write(nd->path.mnt);
+ err:
+	revert_userns_creds(cred);
 	return error;
 }
 
@@ -3819,6 +3874,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	struct path path;
 	int error;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 
 	error = may_mknod(mode);
 	if (error)
@@ -3828,6 +3884,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
+	cred = change_userns_creds(&path);
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mknod(&path, dentry, mode, dev);
@@ -3849,6 +3906,7 @@ long do_mknodat(int dfd, const char __user *filename, umode_t mode,
 	}
 out:
 	done_path_create(&path, dentry);
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -3899,18 +3957,21 @@ long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
+	const struct cred *cred;
 
 retry:
 	dentry = user_path_create(dfd, pathname, &path, lookup_flags);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
+	cred = change_userns_creds(&path);
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mkdir(&path, dentry, mode);
 	if (!error)
 		error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
 	done_path_create(&path, dentry);
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
@@ -3977,12 +4038,14 @@ long do_rmdir(int dfd, const char __user *pathname)
 	struct qstr last;
 	int type;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 retry:
 	name = filename_parentat(dfd, getname(pathname), lookup_flags,
 				&path, &last, &type);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
+	cred = change_userns_creds(&path);
 	switch (type) {
 	case LAST_DOTDOT:
 		error = -ENOTEMPTY;
@@ -4018,6 +4081,7 @@ long do_rmdir(int dfd, const char __user *pathname)
 	inode_unlock(path.dentry->d_inode);
 	mnt_drop_write(path.mnt);
 exit1:
+	revert_userns_creds(cred);
 	path_put(&path);
 	putname(name);
 	if (retry_estale(error, lookup_flags)) {
@@ -4107,11 +4171,13 @@ long do_unlinkat(int dfd, struct filename *name)
 	struct inode *inode = NULL;
 	struct inode *delegated_inode = NULL;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 retry:
 	name = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (IS_ERR(name))
 		return PTR_ERR(name);
 
+	cred = change_userns_creds(&path);
 	error = -EISDIR;
 	if (type != LAST_NORM)
 		goto exit1;
@@ -4149,6 +4215,7 @@ long do_unlinkat(int dfd, struct filename *name)
 	}
 	mnt_drop_write(path.mnt);
 exit1:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -4213,6 +4280,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 	struct dentry *dentry;
 	struct path path;
 	unsigned int lookup_flags = 0;
+	const struct cred *cred;
 
 	from = getname(oldname);
 	if (IS_ERR(from))
@@ -4223,6 +4291,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 	if (IS_ERR(dentry))
 		goto out_putname;
 
+	cred = change_userns_creds(&path);
 	error = security_path_symlink(&path, dentry, from->name);
 	if (!error)
 		error = vfs_symlink(path.dentry->d_inode, dentry, from->name);
@@ -4231,6 +4300,7 @@ long do_symlinkat(const char __user *oldname, int newdfd,
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
 	}
+	revert_userns_creds(cred);
 out_putname:
 	putname(from);
 	return error;
@@ -4344,6 +4414,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	struct inode *delegated_inode = NULL;
 	int how = 0;
 	int error;
+	const struct cred *cred;
 
 	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
 		return -EINVAL;
@@ -4371,6 +4442,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 	if (IS_ERR(new_dentry))
 		goto out;
 
+	cred = change_userns_creds(&new_path);
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto out_dput;
@@ -4382,6 +4454,7 @@ int do_linkat(int olddfd, const char __user *oldname, int newdfd,
 		goto out_dput;
 	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry, &delegated_inode);
 out_dput:
+	revert_userns_creds(cred);
 	done_path_create(&new_path, new_dentry);
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
@@ -4601,6 +4674,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 	unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
 	bool should_retry = false;
 	int error;
+	const struct cred *cred;
 
 	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
 		return -EINVAL;
@@ -4630,6 +4704,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 		goto exit1;
 	}
 
+	cred = change_userns_creds(&new_path);
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto exit2;
@@ -4714,6 +4789,7 @@ static int do_renameat2(int olddfd, const char __user *oldname, int newdfd,
 	}
 	mnt_drop_write(old_path.mnt);
 exit2:
+	revert_userns_creds(cred);
 	if (retry_estale(error, lookup_flags))
 		should_retry = true;
 	path_put(&new_path);
diff --git a/fs/namespace.c b/fs/namespace.c
index 69fb23ae3d8f..4720647588ab 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -200,6 +200,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 		mnt->mnt_writers = 0;
 #endif
 
+		mnt->mnt_userns = get_user_ns(&init_user_ns);
+
 		INIT_HLIST_NODE(&mnt->mnt_hash);
 		INIT_LIST_HEAD(&mnt->mnt_child);
 		INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -1044,6 +1046,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 	mnt->mnt.mnt_root = dget(root);
 	mnt->mnt_mountpoint = mnt->mnt.mnt_root;
 	mnt->mnt_parent = mnt;
+	put_user_ns(mnt->mnt_userns);
+	mnt->mnt_userns = get_user_ns(old->mnt_userns);
 	lock_mount_hash();
 	list_add_tail(&mnt->mnt_instance, &sb->s_mounts);
 	unlock_mount_hash();
@@ -1102,6 +1106,7 @@ static void cleanup_mnt(struct mount *mnt)
 	dput(mnt->mnt.mnt_root);
 	deactivate_super(mnt->mnt.mnt_sb);
 	mnt_free_id(mnt);
+	put_user_ns(mnt->mnt_userns);
 	call_rcu(&mnt->mnt_rcu, delayed_free_vfsmnt);
 }
 
diff --git a/fs/open.c b/fs/open.c
index db6758b9636a..d27b90dce64d 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -456,11 +456,13 @@ int ksys_chdir(const char __user *filename)
 	struct path path;
 	int error;
 	unsigned int lookup_flags = LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
+	const struct cred *cred;
 retry:
 	error = user_path_at(AT_FDCWD, filename, lookup_flags, &path);
 	if (error)
 		goto out;
 
+	cred = change_userns_creds(&path);
 	error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
 	if (error)
 		goto dput_and_out;
@@ -468,6 +470,7 @@ int ksys_chdir(const char __user *filename)
 	set_fs_pwd(current->fs, &path);
 
 dput_and_out:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -486,11 +489,13 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 {
 	struct fd f = fdget_raw(fd);
 	int error;
+	const struct cred *cred;
 
 	error = -EBADF;
 	if (!f.file)
 		goto out;
 
+	cred = change_userns_creds(&f.file->f_path);
 	error = -ENOTDIR;
 	if (!d_can_lookup(f.file->f_path.dentry))
 		goto out_putf;
@@ -499,6 +504,7 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
 	if (!error)
 		set_fs_pwd(current->fs, &f.file->f_path);
 out_putf:
+	revert_userns_creds(cred);
 	fdput(f);
 out:
 	return error;
@@ -547,11 +553,13 @@ static int chmod_common(const struct path *path, umode_t mode)
 	struct inode *inode = path->dentry->d_inode;
 	struct inode *delegated_inode = NULL;
 	struct iattr newattrs;
+	const struct cred *cred;
 	int error;
 
+	cred = change_userns_creds(path);
 	error = mnt_want_write(path->mnt);
 	if (error)
-		return error;
+		goto out;
 retry_deleg:
 	inode_lock(inode);
 	error = security_path_chmod(path, mode);
@@ -568,6 +576,8 @@ static int chmod_common(const struct path *path, umode_t mode)
 			goto retry_deleg;
 	}
 	mnt_drop_write(path->mnt);
+ out:
+	revert_userns_creds(cred);
 	return error;
 }
 
@@ -666,6 +676,7 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
 	struct path path;
 	int error = -EINVAL;
 	int lookup_flags;
+	const struct cred *cred;
 
 	if ((flag & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
 		goto out;
@@ -677,12 +688,14 @@ int do_fchownat(int dfd, const char __user *filename, uid_t user, gid_t group,
 	error = user_path_at(dfd, filename, lookup_flags, &path);
 	if (error)
 		goto out;
+	cred = change_userns_creds(&path);
 	error = mnt_want_write(path.mnt);
 	if (error)
 		goto out_release;
 	error = chown_common(&path, user, group);
 	mnt_drop_write(path.mnt);
 out_release:
+	revert_userns_creds(cred);
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -713,10 +726,12 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
 {
 	struct fd f = fdget(fd);
 	int error = -EBADF;
+	const struct cred *cred;
 
 	if (!f.file)
 		goto out;
 
+	cred = change_userns_creds(&f.file->f_path);
 	error = mnt_want_write_file(f.file);
 	if (error)
 		goto out_fput;
@@ -724,6 +739,7 @@ int ksys_fchown(unsigned int fd, uid_t user, gid_t group)
 	error = chown_common(&f.file->f_path, user, group);
 	mnt_drop_write_file(f.file);
 out_fput:
+	revert_userns_creds(cred);
 	fdput(f);
 out:
 	return error;
@@ -911,8 +927,13 @@ EXPORT_SYMBOL(file_path);
  */
 int vfs_open(const struct path *path, struct file *file)
 {
+	int ret;
+	const struct cred *cred = change_userns_creds(path);
+
 	file->f_path = *path;
-	return do_dentry_open(file, d_backing_inode(path->dentry), NULL);
+	ret = do_dentry_open(file, d_backing_inode(path->dentry), NULL);
+	revert_userns_creds(cred);
+	return ret;
 }
 
 struct file *dentry_open(const struct path *path, int flags,
diff --git a/fs/posix_acl.c b/fs/posix_acl.c
index 249672bf54fe..ff777110f3da 100644
--- a/fs/posix_acl.c
+++ b/fs/posix_acl.c
@@ -364,7 +364,7 @@ posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want)
                                         goto mask;
 				break;
                         case ACL_GROUP_OBJ:
-                                if (in_group_p(inode->i_gid)) {
+				if (in_group_p_shifted(inode->i_gid)) {
 					found = 1;
 					if ((pa->e_perm & want) == want)
 						goto mask;
@@ -655,7 +655,7 @@ int posix_acl_update_mode(struct inode *inode, umode_t *mode_p,
 		return error;
 	if (error == 0)
 		*acl = NULL;
-	if (!in_group_p(inode->i_gid) &&
+	if (!in_group_p_shifted(inode->i_gid) &&
 	    !capable_wrt_inode_uidgid(inode, CAP_FSETID))
 		mode &= ~S_ISGID;
 	*mode_p = mode;
diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..634b8d13ed51 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -22,6 +22,7 @@
 #include <asm/unistd.h>
 
 #include "internal.h"
+#include "mount.h"
 
 /**
  * generic_fillattr - Fill in the basic attributes from the inode struct
@@ -50,6 +51,23 @@ void generic_fillattr(struct inode *inode, struct kstat *stat)
 }
 EXPORT_SYMBOL(generic_fillattr);
 
+static void shift_check(struct vfsmount *mnt, struct kstat *stat)
+{
+	struct mount *m = real_mount(mnt);
+	struct user_namespace *user_ns = m->mnt_ns->user_ns;
+
+	if ((mnt->mnt_flags & MNT_SHIFT) == 0)
+		return;
+
+	if (current->nsproxy->mnt_ns->user_ns != m->mnt_ns->user_ns)
+		return;
+
+	stat->uid = KUIDT_INIT(from_kuid(m->mnt_userns, stat->uid));
+	stat->uid = make_kuid(user_ns, __kuid_val(stat->uid));
+	stat->gid = KGIDT_INIT(from_kgid(m->mnt_userns, stat->gid));
+	stat->gid = make_kgid(user_ns, __kgid_val(stat->gid));
+}
+
 /**
  * vfs_getattr_nosec - getattr without security checks
  * @path: file to get attributes from
@@ -67,6 +85,7 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 		      u32 request_mask, unsigned int query_flags)
 {
 	struct inode *inode = d_backing_inode(path->dentry);
+	int ret;
 
 	memset(stat, 0, sizeof(*stat));
 	stat->result_mask |= STATX_BASIC_STATS;
@@ -79,12 +98,17 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	ret = 0;
 	if (inode->i_op->getattr)
-		return inode->i_op->getattr(path, stat, request_mask,
-					    query_flags);
+		ret = inode->i_op->getattr(path, stat, request_mask,
+					   query_flags);
+	else
+		generic_fillattr(inode, stat);
 
-	generic_fillattr(inode, stat);
-	return 0;
+	if (!ret)
+		shift_check(path->mnt, stat);
+
+	return ret;
 }
 EXPORT_SYMBOL(vfs_getattr_nosec);
 
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 18639c069263..d29638617844 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -59,6 +59,7 @@ extern struct group_info *groups_alloc(int);
 extern void groups_free(struct group_info *);
 
 extern int in_group_p(kgid_t);
+extern int in_group_p_shifted(kgid_t);
 extern int in_egroup_p(kgid_t);
 extern int groups_search(const struct group_info *, kgid_t);
 
@@ -75,6 +76,10 @@ static inline int in_group_p(kgid_t grp)
 {
         return 1;
 }
+static inline int in_group_p_shifted(kgid_t grp)
+{
+	return 1;
+}
 static inline int in_egroup_p(kgid_t grp)
 {
         return 1;
@@ -422,4 +427,11 @@ do {						\
 	*(_fsgid) = __cred->fsgid;		\
 } while(0)
 
+static inline bool cred_is_shifted(void)
+{
+	return current_cred() == current->mnt_cred;
+}
+
+extern void cred_shift(kuid_t *kuid, kgid_t *kgid);
+
 #endif /* _LINUX_CRED_H */
diff --git a/include/linux/mount.h b/include/linux/mount.h
index bf8cc4108b8f..cdc5d981d594 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -46,7 +46,7 @@ struct fs_context;
 #define MNT_SHARED_MASK	(MNT_UNBINDABLE)
 #define MNT_USER_SETTABLE_MASK  (MNT_NOSUID | MNT_NODEV | MNT_NOEXEC \
 				 | MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME \
-				 | MNT_READONLY)
+				 | MNT_READONLY | MNT_SHIFT)
 #define MNT_ATIME_MASK (MNT_NOATIME | MNT_NODIRATIME | MNT_RELATIME )
 
 #define MNT_INTERNAL_FLAGS (MNT_SHARED | MNT_WRITE_HOLD | MNT_INTERNAL | \
@@ -65,6 +65,8 @@ struct fs_context;
 #define MNT_MARKED		0x4000000
 #define MNT_UMOUNT		0x8000000
 
+#define MNT_SHIFT		0x10000000
+
 struct vfsmount {
 	struct dentry *mnt_root;	/* root of the mounted tree */
 	struct super_block *mnt_sb;	/* pointer to superblock */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..b489c3964e5a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -19,6 +19,7 @@
 #include <linux/plist.h>
 #include <linux/hrtimer.h>
 #include <linux/seccomp.h>
+#include <linux/mount.h>
 #include <linux/nodemask.h>
 #include <linux/rcupdate.h>
 #include <linux/refcount.h>
@@ -882,6 +883,10 @@ struct task_struct {
 	/* Effective (overridable) subjective task credentials (COW): */
 	const struct cred __rcu		*cred;
 
+	/* cache for uid/gid shifted cred tied to mnt */
+	struct cred			*mnt_cred;
+	struct vfsmount			*mnt;
+
 #ifdef CONFIG_KEYS
 	/* Cached requested key. */
 	struct key			*cached_requested_key;
diff --git a/kernel/capability.c b/kernel/capability.c
index 1444f3954d75..a28955a38460 100644
--- a/kernel/capability.c
+++ b/kernel/capability.c
@@ -486,8 +486,13 @@ EXPORT_SYMBOL(file_ns_capable);
  */
 bool privileged_wrt_inode_uidgid(struct user_namespace *ns, const struct inode *inode)
 {
-	return kuid_has_mapping(ns, inode->i_uid) &&
-		kgid_has_mapping(ns, inode->i_gid);
+	kuid_t i_uid = inode->i_uid;
+	kgid_t i_gid = inode->i_gid;
+
+	cred_shift(&i_uid, &i_gid);
+
+	return kuid_has_mapping(ns, i_uid) &&
+		kgid_has_mapping(ns, i_gid);
 }
 
 /**
diff --git a/kernel/cred.c b/kernel/cred.c
index 809a985b1793..32a8eb5c91b3 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -167,6 +167,8 @@ void exit_creds(struct task_struct *tsk)
 	validate_creds(cred);
 	alter_cred_subscribers(cred, -1);
 	put_cred(cred);
+	if (tsk->mnt_cred)
+		put_cred(tsk->mnt_cred);
 
 	cred = (struct cred *) tsk->cred;
 	tsk->cred = NULL;
@@ -318,6 +320,17 @@ struct cred *prepare_exec_creds(void)
 	return new;
 }
 
+static void flush_mnt_cred(struct task_struct *t)
+{
+	if (t->mnt_cred == t->cred)
+		return;
+	if (t->mnt_cred)
+		put_cred(t->mnt_cred);
+	t->mnt_cred = NULL;
+	/* mnt is only used for comparison, so it has no reference */
+	t->mnt = NULL;
+}
+
 /*
  * Copy credentials for the new process created by fork()
  *
@@ -344,6 +357,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 	    ) {
 		p->real_cred = get_cred(p->cred);
 		get_cred(p->cred);
+		p->mnt = NULL;
+		p->mnt_cred = NULL;
 		alter_cred_subscribers(p->cred, 2);
 		kdebug("share_creds(%p{%d,%d})",
 		       p->cred, atomic_read(&p->cred->usage),
@@ -383,6 +398,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
 
 	atomic_inc(&new->user->processes);
 	p->cred = p->real_cred = get_cred(new);
+	p->mnt = NULL;
+	p->mnt_cred = NULL;
 	alter_cred_subscribers(new, 2);
 	validate_creds(new);
 	return 0;
@@ -506,6 +523,7 @@ int commit_creds(struct cred *new)
 	/* release the old obj and subj refs both */
 	put_cred(old);
 	put_cred(old);
+	flush_mnt_cred(task);
 	return 0;
 }
 EXPORT_SYMBOL(commit_creds);
@@ -564,6 +582,7 @@ const struct cred *override_creds(const struct cred *new)
 	alter_cred_subscribers(new, 1);
 	rcu_assign_pointer(current->cred, new);
 	alter_cred_subscribers(old, -1);
+	flush_mnt_cred(current);
 
 	kdebug("override_creds() = %p{%d,%d}", old,
 	       atomic_read(&old->usage),
@@ -589,6 +608,7 @@ void revert_creds(const struct cred *old)
 
 	validate_creds(old);
 	validate_creds(override);
+	flush_mnt_cred(current);
 	alter_cred_subscribers(old, 1);
 	rcu_assign_pointer(current->cred, old);
 	alter_cred_subscribers(override, -1);

From patchwork Mon Feb 17 20:53:07 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: James Bottomley <James.Bottomley@HansenPartnership.com>
X-Patchwork-Id: 11387589
Return-Path: <SRS0=6ymt=4F=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AC770924
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Mon, 17 Feb 2020 20:59:36 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 83F1B207FD
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Mon, 17 Feb 2020 20:59:36 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key)
 header.d=hansenpartnership.com header.i=@hansenpartnership.com
 header.b="GQSlVDQO"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729764AbgBQU7g (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Mon, 17 Feb 2020 15:59:36 -0500
Received: from bedivere.hansenpartnership.com ([66.63.167.143]:37908 "EHLO
        bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1727300AbgBQU7f (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 17 Feb 2020 15:59:35 -0500
Received: from localhost (localhost [127.0.0.1])
        by bedivere.hansenpartnership.com (Postfix) with ESMTP id 614348EE218;
        Mon, 17 Feb 2020 12:59:35 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=hansenpartnership.com;
        s=20151216; t=1581973175;
        bh=oOI+sLNi+8MOJdLASIqHxh9N3SDyW7HWx2KX5RZprbU=;
        h=From:To:Cc:Subject:Date:In-Reply-To:References:From;
        b=GQSlVDQOYtYdaMioisaXRS0q0Te0CzgQ5P/3pLjNFG4eqnhYznDu1ur4deLaCTPjh
         6z4B8616X7jPBB+Tepv6vA74h1SV8VZ8q3NKvJWZzWTVdhEas0FXp5gN56TSRR0rjs
         01zL60h2GRp8ZcIsxm6+0ve9yUpRiO/d6IatdCKM=
Received: from bedivere.hansenpartnership.com ([127.0.0.1])
        by localhost (bedivere.hansenpartnership.com [127.0.0.1])
 (amavisd-new, port 10024)
        with ESMTP id Tpcxp99wljSs; Mon, 17 Feb 2020 12:59:35 -0800 (PST)
Received: from jarvis.lan (jarvis.ext.hansenpartnership.com [153.66.160.226])
        by bedivere.hansenpartnership.com (Postfix) with ESMTP id 682788EE0F5;
        Mon, 17 Feb 2020 12:59:34 -0800 (PST)
From: James Bottomley <James.Bottomley@HansenPartnership.com>
To: linux-fsdevel@vger.kernel.org
Cc: David Howells <dhowells@redhat.com>,
 Christian Brauner <christian@brauner.io>, Al Viro <viro@ZenIV.linux.org.uk>,
 Miklos Szeredi <miklos@szeredi.hu>,
 Seth Forshee <seth.forshee@canonical.com>, linux-unionfs@vger.kernel.org,
 Amir Goldstein <amir73il@gmail.com>,
 =?utf-8?q?St=C3=A9phane_Graber?= <stgraber@ubuntu.com>,
 Eric Biederman <ebiederm@xmission.com>, Aleksa Sarai <cyphar@cyphar.com>,
 "Serge E . Hallyn" <serge@hallyn.com>, Tycho Andersen <tycho@tycho.ws>,
 containers@lists.linux-foundation.org
Subject: [PATCH v3 3/3] fs: expose shifting bind mount to userspace
Date: Mon, 17 Feb 2020 12:53:07 -0800
Message-Id: <20200217205307.32256-4-James.Bottomley@HansenPartnership.com>
X-Mailer: git-send-email 2.16.4
In-Reply-To: <20200217205307.32256-1-James.Bottomley@HansenPartnership.com>
References: <20200217205307.32256-1-James.Bottomley@HansenPartnership.com>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This capability is exposed currently through the proposed new bind
API. To mark a mount for shifting, you add the allow-shift flag to the
properties, either by a reconfigure or a rebind.  Only real root on
the system can do this.  Once this is done, admin in a user namespace
(i.e. an unprivileged user) can take that mount point and bind it with
a shift in effect.

This patch has the ability to shift image uids built in.  The way you
do this is create a usernamespace where the fake root of the image is
seen by the interior uid as zero.  Then pass this namespace in using
the "ns" option of the bind.  One of the key design criteria is that
the mounted image never appears at exterior uid zero, so the image
shift isn't activated until the whole shift is applied.

The way an admin marks a mount is:

pathfd = open("/path/to/shift", O_PATH);
fd = configfd_open("bind", O_CLOEXEC);
configfd_action(fd, CONFIGFD_SET_FD, "pathfd", NULL, pathfd);
configfd_action(fd, CONFIGFD_SET_FLAG, "allow-shift", NULL, 0);
configfd_action(fd, CONFIGFD_SET_FLAG, "detached", NULL, 0);
/* optionally apply a shift to the base image */
nsfd = open("/proc/567/ns/user", O_RDONLY);  /* note: not O_PATH */
configfd_action(fd, CONFIGFD_SET_FD, "ns", NULL, nsfd);
/* then create the detached object */
configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0);
configfd_action(fd, CONFIGFD_GET_FD, "bindfd", &bindfd, O_CLOEXEC);

move_mount(bindfd, "", AT_FDCWD, "/path/to/allow", MOVE_MOUNT_F_EMPTY_PATH);

Technically /path/to/shift and /path/to/allow can be the same, which
basically installs a mnt at the path that allows onward traversal.

Then any mount namespace in a user namespace can do:

pathfd = open("/path/to/allow", O_PATH);
fd = configfd_open("bind", O_CLOEXEC);
configfd_action(fd, CONFIGFD_SET_FD, "pathfd", NULL, pathfd);
configfd_action(fd, CONFIGFD_SET_FLAG, "shift", NULL, 0);
configfd_action(fd, CONFIGFD_SET_FLAG, "detached", NULL, 0);
configfd_action(fd, CONFIGFD_CMD_CREATE, NULL, NULL, 0);
configfd_action(fd, CONFIGFD_GET_FD, "bindfd", &bindfd, O_CLOEXEC);

move_mount(bindfd, "", AT_FDCWD, "/path/to/mount", MOVE_MOUNT_F_EMPTY_PATH);

And /path/to/mount will have the uid/gid shifting bind mount installed.

Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---

v3: added ns option to do base image shifting
---
 fs/bind.c           | 105 +++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/mount.h          |   2 +
 fs/namespace.c      |   1 +
 fs/proc_namespace.c |   4 ++
 4 files changed, 103 insertions(+), 9 deletions(-)

diff --git a/fs/bind.c b/fs/bind.c
index c1dedef40169..0cb40d3d4289 100644
--- a/fs/bind.c
+++ b/fs/bind.c
@@ -6,23 +6,29 @@
  */
 
 #include <linux/configfd.h>
+#include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/mount.h>
 #include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+#include <linux/user_namespace.h>
 
 #include "internal.h"
 #include "mount.h"
 
 struct bind_data {
-	bool		ro:1;
-	bool		noexec:1;
-	bool		nosuid:1;
-	bool		nodev:1;
-	bool		detached:1;
-	bool		recursive:1;
-	struct file	*file;
-	struct file	*retfile;
+	bool			ro:1;
+	bool			noexec:1;
+	bool			nosuid:1;
+	bool			nodev:1;
+	bool			detached:1;
+	bool			recursive:1;
+	bool			shift:1;
+	bool			allow_shift:1;
+	struct file		*file;
+	struct file		*retfile;
+	struct user_namespace	*userns;
 };
 
 struct bind_data *to_bind_data(const struct configfd_context *cfc)
@@ -30,13 +36,54 @@ struct bind_data *to_bind_data(const struct configfd_context *cfc)
 	return cfc->data;
 }
 
+static int match_file(const void *p, struct file *file, unsigned int fd)
+{
+	if (p == file)
+		return fd + 1;
+	return 0;
+}
+
+static int bind_set_userns(const struct configfd_context *cfc,
+			   struct configfd_param *p)
+{
+	struct bind_data *bd = to_bind_data(cfc);
+	int fd;
+	struct file *file;
+	struct ns_common *ns;
+
+	if (!bd->allow_shift) {
+		plogger_err(&cfc->log, "ns may only be specified if allow_shift is also");
+		return -EINVAL;
+	}
+
+	/* huge fuss here because nsfs matches on fd not file */
+	fd = iterate_fd(current->files, 0, match_file, p->file);
+	if (fd == 0)
+		return -EINVAL;
+	fd--;
+	file = proc_ns_fget(fd);
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	ns = get_proc_ns(file_inode(file));
+	if (ns->ops->type != CLONE_NEWUSER)
+		return -EINVAL;
+
+	bd->userns = get_user_ns(container_of(ns, struct user_namespace, ns));
+
+	/* note we haven't consumed p->file so configfd will release it */
+	return 0;
+}
+
 static int bind_set_fd(const struct configfd_context *cfc,
 		       struct configfd_param *p)
 {
 	struct bind_data *bd = to_bind_data(cfc);
 	struct path *path;
 
-	if (strcmp(p->key, "pathfd") != 0)
+	if (strcmp(p->key, "ns") == 0)
+		return bind_set_userns(cfc, p);
+	else if (strcmp(p->key, "pathfd") != 0)
 		return -EINVAL;
 
 	path = &p->file->f_path;
@@ -66,6 +113,25 @@ static int bind_set_flag(const struct configfd_context *cfc,
 		bd->nodev = true;
 	} else if (strcmp(p->key, "noexec") == 0) {
 		bd->noexec = true;
+	} else if (strcmp(p->key, "shift") == 0) {
+		struct mount *m;
+
+		if (!bd->file) {
+			plogger_err(&cfc->log, "can't shift without setting pathfd");
+			return -EINVAL;
+		}
+		m = real_mount(bd->file->f_path.mnt);
+		if (!m->allow_shift) {
+			plogger_err(&cfc->log, "pathfd doesn't allow shifting");
+			return -EINVAL;
+		}
+		bd->shift = true;
+	} else if (strcmp(p->key, "allow-shift") == 0) {
+		if (!capable(CAP_SYS_ADMIN)) {
+			plogger_err(&cfc->log, "must be root to set allow-shift");
+			return -EPERM;
+		}
+		bd->allow_shift = true;
 	} else if (strcmp(p->key, "recursive") == 0 &&
 		   cfc->op == CONFIGFD_CMD_CREATE) {
 		bd->recursive = true;
@@ -126,6 +192,8 @@ static int bind_get_mnt_flags(struct bind_data *bd, int mnt_flags)
 		mnt_flags |= MNT_NODEV;
 	if (bd->noexec)
 		mnt_flags |= MNT_NOEXEC;
+	if (bd->shift)
+		mnt_flags |= MNT_SHIFT;
 
 	return mnt_flags;
 }
@@ -143,6 +211,13 @@ static int bind_reconfigure(const struct configfd_context *cfc)
 	mnt_flags = bd->file->f_path.mnt->mnt_flags & MNT_ATIME_MASK;
 	mnt_flags = bind_get_mnt_flags(bd, mnt_flags);
 
+	if (bd->allow_shift) {
+		struct mount *m = real_mount(bd->file->f_path.mnt);
+
+		/* FIXME: this should be set with the reconfigure locking */
+		m->allow_shift = true;
+	}
+
 	return do_reconfigure_mnt(&bd->file->f_path, mnt_flags);
 }
 
@@ -178,11 +253,21 @@ static int bind_create(const struct configfd_context *cfc)
 
 	if (bd->detached) {
 		int mnt_flags = f->f_path.mnt->mnt_flags & MNT_ATIME_MASK;
+		struct mount *m = real_mount(f->f_path.mnt);
 
 		mnt_flags = bind_get_mnt_flags(bd, mnt_flags);
 
 		/* since this is a detached copy, we can do without locking */
 		f->f_path.mnt->mnt_flags |= mnt_flags;
+
+		if (bd->allow_shift)
+			m->allow_shift = true;
+
+		if (bd->userns) {
+			put_user_ns(m->mnt_userns);
+			m->mnt_userns = bd->userns;
+			bd->userns = NULL; /* transfer the reference */
+		}
 	}
 
 	bd->retfile = f;
@@ -208,6 +293,8 @@ static void bind_free(const struct configfd_context *cfc)
 
 	if (bd->file)
 		fput(bd->file);
+	if (bd->userns)
+		put_user_ns(bd->userns);
 }
 
 static struct configfd_ops bind_type_ops = {
diff --git a/fs/mount.h b/fs/mount.h
index c3bfc6ced4c7..157e86cf9930 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -72,6 +72,8 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	/* shifting bind mount parameters */
+	bool allow_shift:1;
 	struct user_namespace *mnt_userns; /* mapping for underlying mount uid/gid */
 } __randomize_layout;
 
diff --git a/fs/namespace.c b/fs/namespace.c
index 4720647588ab..75e4df4ddf54 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1040,6 +1040,7 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root,
 
 	mnt->mnt.mnt_flags = old->mnt.mnt_flags;
 	mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
+	mnt->allow_shift = old->allow_shift;
 
 	atomic_inc(&sb->s_active);
 	mnt->mnt.mnt_sb = sb;
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 273ee82d8aa9..bdf8d23cf42e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -70,14 +70,18 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_NOATIME, ",noatime" },
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
+		{ MNT_SHIFT, ",shift" },
 		{ 0, NULL }
 	};
 	const struct proc_fs_info *fs_infop;
+	struct mount *rm = real_mount(mnt);
 
 	for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) {
 		if (mnt->mnt_flags & fs_infop->flag)
 			seq_puts(m, fs_infop->str);
 	}
+	if (rm->allow_shift)
+		seq_puts(m, ",allow-shift");
 }
 
 static inline void mangle(struct seq_file *m, const char *s)