diff mbox

fs: Teach path_connected to handle nfs filesystems with multiple roots.

Message ID 87muzai6rm.fsf_-_@xmission.com (mailing list archive)
State New, archived
Headers show

Commit Message

Eric W. Biederman March 14, 2018, 11:20 p.m. UTC
On nfsv2 and nfsv3 the nfs server can export subsets of the same
filesystem and report the same filesystem identifier, so that the nfs
client can know they are the same filesystem.  The subsets can be from
disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
way to find the common root of all directory trees exported form the
server with the same filesystem identifier.

The practical result is that in struct super s_root for nfs s_root is
not necessarily the root of the filesystem.  The nfs mount code sets
s_root to the root of the first subset of the nfs filesystem that the
kernel mounts.

This effects the dcache invalidation code in generic_shutdown_super
currently called shrunk_dcache_for_umount and that code for years
has gone through an additional list of dentries that might be dentry
trees that need to be freed to accomodate nfs.

When I wrote path_connected I did not realize nfs was so special, and
it's hueristic for avoiding calling is_subdir can fail.

The practical case where this fails is when there is a move of a
directory from the subtree exposed by one nfs mount to the subtree
exposed by another nfs mount.  This move can happen either locally or
remotely.  With the remote case requiring that the move directory be cached
before the move and that after the move someone walks the path
to where the move directory now exists and in so doing causes the
already cached directory to be moved in the dcache through the magic
of d_splice_alias.

If someone whose working directory is in the move directory or a
subdirectory and now starts calling .. from the initial mount of nfs
(where s_root == mnt_root), then path_connected as a heuristic will
not bother with the is_subdir check.  As s_root really is not the root
of the nfs filesystem this heuristic is wrong, and the path may
actually not be connected and path_connected can fail.

The is_subdir function might be cheap enough that we can call it
unconditionally.  Verifying that will take some benchmarking and
the result may not be the same on all kernels this fix needs
to be backported to.  So I am avoiding that for now.

Filesystems with snapshots such as nilfs and btrfs do something
similar.  But as the directory tree of the snapshots are disjoint
from one another and from the main directory tree rename won't move
things between them and this problem will not occur.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

Al do you want to push this one to Linus or shall I?

 fs/namei.c         | 5 +++--
 fs/nfs/super.c     | 2 ++
 include/linux/fs.h | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

Comments

Al Viro March 15, 2018, 10:34 p.m. UTC | #1
On Wed, Mar 14, 2018 at 06:20:29PM -0500, Eric W. Biederman wrote:
> 
> On nfsv2 and nfsv3 the nfs server can export subsets of the same
> filesystem and report the same filesystem identifier, so that the nfs
> client can know they are the same filesystem.  The subsets can be from
> disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
> way to find the common root of all directory trees exported form the
> server with the same filesystem identifier.
> 
> The practical result is that in struct super s_root for nfs s_root is
> not necessarily the root of the filesystem.  The nfs mount code sets
> s_root to the root of the first subset of the nfs filesystem that the
> kernel mounts.
> 
> This effects the dcache invalidation code in generic_shutdown_super
> currently called shrunk_dcache_for_umount and that code for years
> has gone through an additional list of dentries that might be dentry
> trees that need to be freed to accomodate nfs.
> 
> When I wrote path_connected I did not realize nfs was so special, and
> it's hueristic for avoiding calling is_subdir can fail.
> 
> The practical case where this fails is when there is a move of a
> directory from the subtree exposed by one nfs mount to the subtree
> exposed by another nfs mount.  This move can happen either locally or
> remotely.  With the remote case requiring that the move directory be cached
> before the move and that after the move someone walks the path
> to where the move directory now exists and in so doing causes the
> already cached directory to be moved in the dcache through the magic
> of d_splice_alias.
> 
> If someone whose working directory is in the move directory or a
> subdirectory and now starts calling .. from the initial mount of nfs
> (where s_root == mnt_root), then path_connected as a heuristic will
> not bother with the is_subdir check.  As s_root really is not the root
> of the nfs filesystem this heuristic is wrong, and the path may
> actually not be connected and path_connected can fail.
> 
> The is_subdir function might be cheap enough that we can call it
> unconditionally.  Verifying that will take some benchmarking and
> the result may not be the same on all kernels this fix needs
> to be backported to.  So I am avoiding that for now.
> 
> Filesystems with snapshots such as nilfs and btrfs do something
> similar.  But as the directory tree of the snapshots are disjoint
> from one another and from the main directory tree rename won't move
> things between them and this problem will not occur.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
> Fixes: 397d425dc26d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> Al do you want to push this one to Linus or shall I?

Applied; I think there might be a helper lurking in there, but for now
that'll do.
diff mbox

Patch

diff --git a/fs/namei.c b/fs/namei.c
index 921ae32dbc80..cafa365eeb70 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -559,9 +559,10 @@  static int __nd_alloc_stack(struct nameidata *nd)
 static bool path_connected(const struct path *path)
 {
 	struct vfsmount *mnt = path->mnt;
+	struct super_block *sb = mnt->mnt_sb;
 
-	/* Only bind mounts can have disconnected paths */
-	if (mnt->mnt_root == mnt->mnt_sb->s_root)
+	/* Bind mounts and multi-root filesystems can have disconnected paths */
+	if (!(sb->s_iflags & SB_I_MULTIROOT) && (mnt->mnt_root == sb->s_root))
 		return true;
 
 	return is_subdir(path->dentry, mnt->mnt_root);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 29bacdc56f6a..5e470e233c83 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -2631,6 +2631,8 @@  struct dentry *nfs_fs_mount_common(struct nfs_server *server,
 		/* initial superblock/root creation */
 		mount_info->fill_super(s, mount_info);
 		nfs_get_cache_cookie(s, mount_info->parsed, mount_info->cloned);
+		if (!(server->flags & NFS_MOUNT_UNSHARED))
+			s->s_iflags |= SB_I_MULTIROOT;
 	}
 
 	mntroot = nfs_get_root(s, mount_info->mntfh, dev_name);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a815560fda0..0430e03febaa 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1317,6 +1317,7 @@  extern int send_sigurg(struct fown_struct *fown);
 #define SB_I_CGROUPWB	0x00000001	/* cgroup-aware writeback enabled */
 #define SB_I_NOEXEC	0x00000002	/* Ignore executables on this fs */
 #define SB_I_NODEV	0x00000004	/* Ignore devices on this fs */
+#define SB_I_MULTIROOT	0x00000008	/* Multiple roots to the dentry tree */
 
 /* sb->s_iflags to limit user namespace mounts */
 #define SB_I_USERNS_VISIBLE		0x00000010 /* fstype already mounted */