[51/79] namei: remove restrictions on nesting depth

Message ID	1430803373-4948-51-git-send-email-viro@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Al Viro <viro@ZenIV.linux.org.uk> To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Neil Brown <neilb@suse.de>, Christoph Hellwig <hch@infradead.org>, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH 51/79] namei: remove restrictions on nesting depth Date: Tue, 5 May 2015 06:22:25 +0100 Message-Id: <1430803373-4948-51-git-send-email-viro@ZenIV.linux.org.uk> In-Reply-To: <20150505052205.GS889@ZenIV.linux.org.uk> References: <20150505052205.GS889@ZenIV.linux.org.uk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

1430803373-4948-51-git-send-email-viro@ZenIV.linux.org.uk (mailing list archive)

State

New, archived

Headers

From: Al Viro <viro@ZenIV.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Neil Brown <neilb@suse.de>, Christoph Hellwig <hch@infradead.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: [PATCH 51/79] namei: remove restrictions on nesting depth
Date: Tue,  5 May 2015 06:22:25 +0100
Message-Id: <1430803373-4948-51-git-send-email-viro@ZenIV.linux.org.uk>
In-Reply-To: <20150505052205.GS889@ZenIV.linux.org.uk>
References: <20150505052205.GS889@ZenIV.linux.org.uk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Al Viro May 5, 2015, 5:22 a.m. UTC

From: Al Viro <viro@zeniv.linux.org.uk>

The only restriction is that on the total amount of symlinks
crossed; how they are nested does not matter

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/namei.c            | 66 ++++++++++++++++++++++++++++++++++++++++-----------
 include/linux/namei.h |  2 ++
 2 files changed, 54 insertions(+), 14 deletions(-)

Comments

NeilBrown May 6, 2015, 2:55 a.m. UTC | #1

On Tue,  5 May 2015 06:22:25 +0100 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> From: Al Viro <viro@zeniv.linux.org.uk>
> 
> The only restriction is that on the total amount of symlinks
> crossed; how they are nested does not matter
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---


> -	if (unlikely(current->total_link_count >= 40)) {
> +	if (unlikely(current->total_link_count >= MAXSYMLINKS)) {

There is still a literal '40' in follow_automount.

        current->total_link_count++;
        if (current->total_link_count >= 40)
                return -ELOOP;


should that become MAXSYMLINKS too?

Thanks,
NeilBrown

Al Viro May 6, 2015, 7:24 a.m. UTC | #2

On Wed, May 06, 2015 at 12:55:28PM +1000, NeilBrown wrote:

> > -	if (unlikely(current->total_link_count >= 40)) {
> > +	if (unlikely(current->total_link_count >= MAXSYMLINKS)) {
> 
> There is still a literal '40' in follow_automount.
> 
>         current->total_link_count++;
>         if (current->total_link_count >= 40)
>                 return -ELOOP;
> 
> 
> should that become MAXSYMLINKS too?

Yes, assuming we want to keep it at all...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro May 6, 2015, 4:22 p.m. UTC | #3

On Wed, May 06, 2015 at 08:24:07AM +0100, Al Viro wrote:
> On Wed, May 06, 2015 at 12:55:28PM +1000, NeilBrown wrote:
> 
> > > -	if (unlikely(current->total_link_count >= 40)) {
> > > +	if (unlikely(current->total_link_count >= MAXSYMLINKS)) {
> > 
> > There is still a literal '40' in follow_automount.
> > 
> >         current->total_link_count++;
> >         if (current->total_link_count >= 40)
> >                 return -ELOOP;
> > 
> > 
> > should that become MAXSYMLINKS too?
> 
> Yes, assuming we want to keep it at all...

To elaborate a bit - the reason why we count those among the symlink crossings
is historical; we used to implement automounting via ->follow_link().

Note that it's not really consistent - once a process has triggered automount,
subsequent lookups crossing that point *won't* have it counted as link
traversal.  Except for ones that come while automount attempt is in progress,
etc.

Limit on link traversals makes obvious sense wrt avoiding loops and DoS -
reaching a symlink in effect is adding a pile of path components to the
path yet to be traversed.  Automount points do nothing of that sort -
the pending path remains as-is.

So I'm not sure it makes sense to have those contribute to the same counter.
OTOH, there _is_ an unpleasant problem around follow_automount() - it's
a place where we call a method with heavy stack footprint still fairly
deep in stack; in mainline it can get as bad as ~2.8K deep, with this tree
the worst call chain entirely inside fs/namei.c is
	SyS_renameat2 -> user_path_parent -> filename_lookup -> path_lookupat ->
	path_init -> link_path_walk -> walk_component -> lookup_fast ->
	follow_managed -> ->d_automount
and it costs 1.1K; slightly more shallow chains lead from linkat(2) and
do_filp_open()/do_file_open_root().  Again, mainline is more than two times
worse on those.

->d_manage() gets hit on the same depth, ->d_revalidate() at a bit less than
that.

FWIW, the sums on a fairly sane amd64 config are
[->d_manage] 1104
[->d_automount] 1104
[->d_revalidate] 1024
[->lookup] 992
[->put_link] 896
[->permission] 832
[->follow_link] 768
[->d_hash] 768
[->d_weak_revalidate] 624
[->create] 544
[->tmpfile] 480
[->atomic_open] 480
[->rename] 336
[->rename2] 336
[->link] 240
[->unlink] 224
[->rmdir] 176
[->mknod] 160
[->symlink] 144
[->mkdir] 144

And ->d_automount() is easily the heaviest of that bunch in terms of stack
footprint.  So much that I really wonder if we ought to use something like
schedule_work() and (interruptibly) wait for completion; the interesting
question here is how much of the initiator's state is needed by worker.
The real namespace isn't - actual mounting is in finish_automount().
Credentials probably are; what about netns?  Suppose a process steps on
NFS4 referral point and triggers a new NFS mount; which netns should be
used - that of the triggering process or that of the NFS mount where we'd
run into the referral?  Ditto for AFS and CIFS...

BTW, if you want to torture the whole thing wrt stack footprint, try to make
/lib/firmware a symlink that would lead you through a referral point (on
the mainline - do so at the maximal nesting depth).  The thing is,
request_firmware() will chase that one *on* *caller's* *stack*.  Seeing that
we do have an async variant anyway (request_firmware_nowait()), could we
have request_firmware() itself go through schedule_work() (and wait for
completion, of course)?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/namei.c b/fs/namei.c
index 8fd8ccb..078db1d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -492,6 +492,7 @@  void path_put(const struct path *path)
 }
 EXPORT_SYMBOL(path_put);
 
+#define EMBEDDED_LEVELS 2
 struct nameidata {
 	struct path	path;
 	union {
@@ -509,9 +510,42 @@  struct nameidata {
 		struct path link;
 		void *cookie;
 		const char *name;
-	} stack[MAX_NESTED_LINKS + 1];
+	} *stack, internal[EMBEDDED_LEVELS];
 };
 
+static void set_nameidata(struct nameidata *nd)
+{
+	nd->stack = nd->internal;
+}
+
+static void restore_nameidata(struct nameidata *nd)
+{
+	if (nd->stack != nd->internal) {
+		kfree(nd->stack);
+		nd->stack = nd->internal;
+	}
+}
+
+static int __nd_alloc_stack(struct nameidata *nd)
+{
+	struct saved *p = kmalloc((MAXSYMLINKS + 1) * sizeof(struct saved),
+				  GFP_KERNEL);
+	if (unlikely(!p))
+		return -ENOMEM;
+	memcpy(p, nd->internal, sizeof(nd->internal));
+	nd->stack = p;
+	return 0;
+}
+
+static inline int nd_alloc_stack(struct nameidata *nd)
+{
+	if (likely(nd->depth != EMBEDDED_LEVELS - 1))
+		return 0;
+	if (likely(nd->stack != nd->internal))
+		return 0;
+	return __nd_alloc_stack(nd);
+}
+
 /*
  * Path walking has 2 modes, rcu-walk and ref-walk (see
  * Documentation/filesystems/path-lookup.txt).  In situations when we can't
@@ -857,7 +891,7 @@  const char *get_link(struct nameidata *nd)
 	if (nd->link.mnt == nd->path.mnt)
 		mntget(nd->link.mnt);
 
-	if (unlikely(current->total_link_count >= 40)) {
+	if (unlikely(current->total_link_count >= MAXSYMLINKS)) {
 		path_put(&nd->path);
 		path_put(&nd->link);
 		return ERR_PTR(-ELOOP);
@@ -1781,22 +1815,18 @@  Walked:
 		if (err) {
 			const char *s;
 
-			if (unlikely(current->link_count >= MAX_NESTED_LINKS)) {
-				path_put_conditional(&nd->link, nd);
-				path_put(&nd->path);
-				err = -ELOOP;
-				goto Err;
+			err = nd_alloc_stack(nd);
+			if (unlikely(err)) {
+				path_to_nameidata(&nd->link, nd);
+				break;
 			}
-			BUG_ON(nd->depth >= MAX_NESTED_LINKS);
 
 			nd->depth++;
-			current->link_count++;
 
 			s = get_link(nd);
 
 			if (unlikely(IS_ERR(s))) {
 				err = PTR_ERR(s);
-				current->link_count--;
 				nd->depth--;
 				goto Err;
 			}
@@ -1804,7 +1834,6 @@  Walked:
 			if (unlikely(!s)) {
 				/* jumped */
 				put_link(nd);
-				current->link_count--;
 				nd->depth--;
 			} else {
 				if (*s == '/') {
@@ -1834,7 +1863,6 @@  Walked:
 Err:
 	while (unlikely(nd->depth)) {
 		put_link(nd);
-		current->link_count--;
 		nd->depth--;
 	}
 	return err;
@@ -1843,7 +1871,6 @@  OK:
 		name = nd->stack[nd->depth].name;
 		err = walk_component(nd, LOOKUP_FOLLOW);
 		put_link(nd);
-		current->link_count--;
 		nd->depth--;
 		goto Walked;
 	}
@@ -2047,7 +2074,11 @@  static int path_lookupat(int dfd, const struct filename *name,
 static int filename_lookup(int dfd, struct filename *name,
 				unsigned int flags, struct nameidata *nd)
 {
-	int retval = path_lookupat(dfd, name, flags | LOOKUP_RCU, nd);
+	int retval;
+
+	set_nameidata(nd);
+	retval = path_lookupat(dfd, name, flags | LOOKUP_RCU, nd);
+
 	if (unlikely(retval == -ECHILD))
 		retval = path_lookupat(dfd, name, flags, nd);
 	if (unlikely(retval == -ESTALE))
@@ -2055,6 +2086,7 @@  static int filename_lookup(int dfd, struct filename *name,
 
 	if (likely(!retval))
 		audit_inode(name, nd->path.dentry, flags & LOOKUP_PARENT);
+	restore_nameidata(nd);
 	return retval;
 }
 
@@ -2385,6 +2417,7 @@  filename_mountpoint(int dfd, struct filename *name, struct path *path,
 	int error;
 	if (IS_ERR(name))
 		return PTR_ERR(name);
+	set_nameidata(&nd);
 	error = path_mountpoint(dfd, name, path, &nd, flags | LOOKUP_RCU);
 	if (unlikely(error == -ECHILD))
 		error = path_mountpoint(dfd, name, path, &nd, flags);
@@ -2392,6 +2425,7 @@  filename_mountpoint(int dfd, struct filename *name, struct path *path,
 		error = path_mountpoint(dfd, name, path, &nd, flags | LOOKUP_REVAL);
 	if (likely(!error))
 		audit_inode(name, path->dentry, 0);
+	restore_nameidata(&nd);
 	putname(name);
 	return error;
 }
@@ -3281,11 +3315,13 @@  struct file *do_filp_open(int dfd, struct filename *pathname,
 	int flags = op->lookup_flags;
 	struct file *filp;
 
+	set_nameidata(&nd);
 	filp = path_openat(dfd, pathname, &nd, op, flags | LOOKUP_RCU);
 	if (unlikely(filp == ERR_PTR(-ECHILD)))
 		filp = path_openat(dfd, pathname, &nd, op, flags);
 	if (unlikely(filp == ERR_PTR(-ESTALE)))
 		filp = path_openat(dfd, pathname, &nd, op, flags | LOOKUP_REVAL);
+	restore_nameidata(&nd);
 	return filp;
 }
 
@@ -3299,6 +3335,7 @@  struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 
 	nd.root.mnt = mnt;
 	nd.root.dentry = dentry;
+	set_nameidata(&nd);
 
 	if (d_is_symlink(dentry) && op->intent & LOOKUP_OPEN)
 		return ERR_PTR(-ELOOP);
@@ -3312,6 +3349,7 @@  struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 		file = path_openat(-1, filename, &nd, op, flags);
 	if (unlikely(file == ERR_PTR(-ESTALE)))
 		file = path_openat(-1, filename, &nd, op, flags | LOOKUP_REVAL);
+	restore_nameidata(&nd);
 	putname(filename);
 	return file;
 }
diff --git a/include/linux/namei.h b/include/linux/namei.h
index a5d5bed..3a6cc96 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -11,6 +11,8 @@  struct nameidata;
 
 enum { MAX_NESTED_LINKS = 8 };
 
+#define MAXSYMLINKS 40
+
 /*
  * Type of the last component on LOOKUP_PARENT
  */

[51/79] namei: remove restrictions on nesting depth

Commit Message

Comments

Patch