[v2,05/11] ovl: lookup redirect by file handle

Message ID	1493025256-27188-6-git-send-email-amir73il@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Amir Goldstein <amir73il@gmail.com> To: Miklos Szeredi <miklos@szeredi.hu> Cc: Vivek Goyal <vgoyal@redhat.com>, Al Viro <viro@zeniv.linux.org.uk>, linux-unionfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [PATCH v2 05/11] ovl: lookup redirect by file handle Date: Mon, 24 Apr 2017 12:14:10 +0300 Message-Id: <1493025256-27188-6-git-send-email-amir73il@gmail.com> In-Reply-To: <1493025256-27188-1-git-send-email-amir73il@gmail.com> References: <1493025256-27188-1-git-send-email-amir73il@gmail.com> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Amir Goldstein April 24, 2017, 9:14 a.m. UTC

When overlay.fh xattr is found in a directory inode, instead of lookup
of the dentry in next lower layer by name, first try to get it by calling
exportfs_decode_fh().

On failure to lookup by file handle to lower layer, fall back to lookup
by name with or without path redirect.

For now we only support following by file handle from upper if there is a
single lower layer, because fallback from lookup by file hande to lookup
by path in mid layers is not yet implemented.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/overlayfs/namei.c     | 185 +++++++++++++++++++++++++++++++++++++++++++----
 fs/overlayfs/overlayfs.h |   1 +
 fs/overlayfs/util.c      |  14 ++++
 3 files changed, 186 insertions(+), 14 deletions(-)

Amir Goldstein April 25, 2017, 8:10 a.m. UTC | #1

On Mon, Apr 24, 2017 at 12:14 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> When overlay.fh xattr is found in a directory inode, instead of lookup
> of the dentry in next lower layer by name, first try to get it by calling
> exportfs_decode_fh().
>
> On failure to lookup by file handle to lower layer, fall back to lookup
> by name with or without path redirect.
>
> For now we only support following by file handle from upper if there is a
> single lower layer, because fallback from lookup by file hande to lookup
> by path in mid layers is not yet implemented.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/overlayfs/namei.c     | 185 +++++++++++++++++++++++++++++++++++++++++++----
>  fs/overlayfs/overlayfs.h |   1 +
>  fs/overlayfs/util.c      |  14 ++++
>  3 files changed, 186 insertions(+), 14 deletions(-)
>
> diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
> index d660177..0d1cc8f 100644
> --- a/fs/overlayfs/namei.c
> +++ b/fs/overlayfs/namei.c
> @@ -9,9 +9,11 @@
>
>  #include <linux/fs.h>
>  #include <linux/cred.h>
> +#include <linux/mount.h>
>  #include <linux/namei.h>
>  #include <linux/xattr.h>
>  #include <linux/ratelimit.h>
> +#include <linux/exportfs.h>
>  #include "overlayfs.h"
>  #include "ovl_entry.h"
>
> @@ -21,7 +23,10 @@ struct ovl_lookup_data {
>         bool opaque;
>         bool stop;
>         bool last;
> -       char *redirect;
> +       bool by_path;           /* redirect by path */
> +       bool by_fh;             /* redirect by file handle */
> +       char *redirect;         /* path to follow */
> +       struct ovl_fh *fh;      /* file handle to follow */
>  };
>
>  static int ovl_check_redirect(struct dentry *dentry, struct ovl_lookup_data *d,
> @@ -81,6 +86,42 @@ static int ovl_check_redirect(struct dentry *dentry, struct ovl_lookup_data *d,
>         goto err_free;
>  }
>
> +static int ovl_check_redirect_fh(struct dentry *dentry,
> +                                struct ovl_lookup_data *d)
> +{
> +       int res;
> +       void *buf = NULL;
> +
> +       res = vfs_getxattr(dentry, OVL_XATTR_FH, NULL, 0);
> +       if (res < 0) {
> +               if (res == -ENODATA || res == -EOPNOTSUPP)
> +                       return 0;
> +               goto fail;
> +       }
> +       buf = kzalloc(res, GFP_TEMPORARY);
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       if (res == 0)
> +               goto fail;
> +
> +       res = vfs_getxattr(dentry, OVL_XATTR_FH, buf, res);
> +       if (res < 0 || !ovl_redirect_fh_ok(buf, res))
> +               goto fail;
> +
> +       kfree(d->fh);
> +       d->fh = buf;
> +
> +       return 0;
> +
> +err_free:
> +       kfree(buf);
> +       return 0;
> +fail:
> +       pr_warn_ratelimited("overlayfs: failed to get file handle (%i)\n", res);
> +       goto err_free;
> +}
> +
>  static bool ovl_is_opaquedir(struct dentry *dentry)
>  {
>         int res;
> @@ -96,22 +137,81 @@ static bool ovl_is_opaquedir(struct dentry *dentry)
>         return false;
>  }
>
> +/* Check if p1 is connected with a chain of hashed dentries to p2 */
> +static bool ovl_is_lookable(struct dentry *p1, struct dentry *p2)
> +{
> +       struct dentry *p;
> +
> +       for (p = p2; !IS_ROOT(p); p = p->d_parent) {
> +               if (d_unhashed(p))
> +                       return false;
> +               if (p->d_parent == p1)
> +                       return true;
> +       }
> +       return false;
> +}
> +
> +/* Check if dentry is reachable from mnt via path lookup */
> +static int ovl_dentry_under_mnt(void *ctx, struct dentry *dentry)
> +{
> +       struct vfsmount *mnt = ctx;
> +
> +       return ovl_is_lookable(mnt->mnt_root, dentry);
> +}
> +
> +static struct dentry *ovl_lookup_fh(struct vfsmount *mnt,
> +                                   const struct ovl_fh *fh)
> +{
> +       int bytes = (fh->len - offsetof(struct ovl_fh, fid));
> +
> +       /*
> +        * When redirect_fh is disabled, 'invalid' file handles are stored
> +        * to indicate that this entry has been copied up.
> +        */
> +       if (!bytes || (int)fh->type == FILEID_INVALID)
> +               return ERR_PTR(-ESTALE);
> +
> +       /*
> +        * Several layers can be on the same fs and decoded dentry may be in
> +        * either one of those layers. We are looking for a match of dentry
> +        * and mnt to find out to which layer the decoded dentry belongs to.
> +        */
> +       return exportfs_decode_fh(mnt, (struct fid *)fh->fid,
> +                                 bytes >> 2, (int)fh->type,
> +                                 ovl_dentry_under_mnt, mnt);
> +}
> +
>  static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
>                              const char *name, unsigned int namelen,
>                              size_t prelen, const char *post,
> -                            struct dentry **ret)
> +                            struct vfsmount *mnt, struct dentry **ret)
>  {
>         struct dentry *this;
>         int err;
>
> -       this = lookup_one_len_unlocked(name, base, namelen);
> +       /*
> +        * Lookup of upper is with null d->fh.
> +        * Lookup of lower is either by_fh with non-null d->fh
> +        * or by_path with null d->fh.
> +        */
> +       if (d->fh)
> +               this = ovl_lookup_fh(mnt, d->fh);
> +       else
> +               this = lookup_one_len_unlocked(name, base, namelen);
>         if (IS_ERR(this)) {
>                 err = PTR_ERR(this);
>                 this = NULL;
>                 if (err == -ENOENT || err == -ENAMETOOLONG)
>                         goto out;
> +               if (d->fh && err == -ESTALE)
> +                       goto out;
>                 goto out_err;
>         }
> +
> +       /* If found by file handle - don't follow that handle again */
> +       kfree(d->fh);
> +       d->fh = NULL;
> +
>         if (!this->d_inode)
>                 goto put_and_out;
>
> @@ -135,9 +235,18 @@ static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
>                 d->stop = d->opaque = true;
>                 goto out;
>         }
> -       err = ovl_check_redirect(this, d, prelen, post);
> -       if (err)
> -               goto out_err;
> +       if (d->last)
> +               goto out;
> +       if (d->by_path) {
> +               err = ovl_check_redirect(this, d, prelen, post);
> +               if (err)
> +                       goto out_err;
> +       }
> +       if (d->by_fh) {
> +               err = ovl_check_redirect_fh(this, d);
> +               if (err)
> +                       goto out_err;
> +       }
>  out:
>         *ret = this;
>         return 0;
> @@ -152,6 +261,12 @@ static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
>         return err;
>  }
>
> +static int ovl_lookup_layer_fh(struct path *path, struct ovl_lookup_data *d,
> +                              struct dentry **ret)
> +{
> +       return ovl_lookup_single(path->dentry, d, "", 0, 0, "", path->mnt, ret);
> +}
> +
>  static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
>                             struct dentry **ret)
>  {
> @@ -162,7 +277,7 @@ static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
>
>         if (d->name.name[0] != '/')
>                 return ovl_lookup_single(base, d, d->name.name, d->name.len,
> -                                        0, "", ret);
> +                                        0, "", NULL, ret);
>
>         while (!IS_ERR_OR_NULL(base) && d_can_lookup(base)) {
>                 const char *s = d->name.name + d->name.len - rem;
> @@ -175,7 +290,7 @@ static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
>                         return -EIO;
>
>                 err = ovl_lookup_single(base, d, s, thislen,
> -                                       d->name.len - rem, next, &base);
> +                                       d->name.len - rem, next, NULL, &base);
>                 dput(dentry);
>                 if (err)
>                         return err;
> @@ -220,6 +335,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>         const struct cred *old_cred;
>         struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
>         struct ovl_entry *poe = dentry->d_parent->d_fsdata;
> +       struct ovl_entry *roe = dentry->d_sb->s_root->d_fsdata;
>         struct path *stack = NULL;
>         struct dentry *upperdir, *upperdentry = NULL;
>         unsigned int ctr = 0;
> @@ -235,7 +351,10 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                 .opaque = false,
>                 .stop = false,
>                 .last = !poe->numlower,
> +               .by_path = true,
>                 .redirect = NULL,
> +               .by_fh = true,
> +               .fh = NULL,
>         };
>
>         if (dentry->d_name.len > ofs->namelen)
> @@ -259,13 +378,23 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                         if (!upperredirect)
>                                 goto out_put_upper;
>                         if (d.redirect[0] == '/')
> -                               poe = dentry->d_sb->s_root->d_fsdata;
> +                               poe = roe;
>                 }
>                 if (d.opaque)
>                         type |= __OVL_PATH_OPAQUE;
>         }
>
> -       if (!d.stop && poe->numlower) {
> +       /*
> +        * For now we only support lower by fh in single layer, because
> +        * fallback from lookup by fh to lookup by path in mid layers for
> +        * merge directory is not yet implemented.
> +        */
> +       if (!ofs->redirect_fh || ofs->numlower > 1) {
> +               kfree(d.fh);
> +               d.fh = NULL;
> +       }
> +
> +       if (!d.stop && (poe->numlower || d.fh)) {
>                 err = -ENOMEM;
>                 stack = kcalloc(ofs->numlower, sizeof(struct path),
>                                 GFP_TEMPORARY);
> @@ -273,6 +402,35 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                         goto out_put_upper;
>         }
>
> +       /* Try to lookup lower layers by file handle */
> +       d.by_path = false;
> +       for (i = 0; !d.stop && d.fh && i < roe->numlower; i++) {
> +               struct path lowerpath = poe->lowerstack[i];
> +
> +               d.last = i == poe->numlower - 1;

copy&paste bug: should be s/poe/roe in 2 lines above.
it matters especially when lower files are moved into an opaque dir
I am improving xfstest overlay/017 to cover this case.

> +               err = ovl_lookup_layer_fh(&lowerpath, &d, &this);
> +               if (err)
> +                       goto out_put;
> +
> +               if (!this)
> +                       continue;
> +
> +               stack[ctr].dentry = this;
> +               stack[ctr].mnt = lowerpath.mnt;
> +               ctr++;
> +               /*
> +                * Found by fh - won't lookup by path.
> +                * TODO: set d.redirect to dentry_path(this),
> +                *       so lookup can continue by path.
> +                */
> +               d.stop = true;
> +       }
> +
> +       /* Fallback to lookup lower layers by path */
> +       d.by_path = true;
> +       d.by_fh = false;
> +       kfree(d.fh);
> +       d.fh = NULL;
>         for (i = 0; !d.stop && i < poe->numlower; i++) {
>                 struct path lowerpath = poe->lowerstack[i];
>
> @@ -291,10 +449,8 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                 if (d.stop)
>                         break;
>
> -               if (d.redirect &&
> -                   d.redirect[0] == '/' &&
> -                   poe != dentry->d_sb->s_root->d_fsdata) {
> -                       poe = dentry->d_sb->s_root->d_fsdata;
> +               if (d.redirect && d.redirect[0] == '/' && poe != roe) {
> +                       poe = roe;
>
>                         /* Find the current layer on the root dentry */
>                         for (i = 0; i < poe->numlower; i++)
> @@ -354,6 +510,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>         dput(upperdentry);
>         kfree(upperredirect);
>  out:
> +       kfree(d.fh);
>         kfree(d.redirect);
>         revert_creds(old_cred);
>         return ERR_PTR(err);
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index c3cfbc5..08002ce 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -190,6 +190,7 @@ const char *ovl_dentry_get_redirect(struct dentry *dentry);
>  void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect);
>  bool ovl_redirect_fh(struct super_block *sb);
>  void ovl_clear_redirect_fh(struct super_block *sb);
> +bool ovl_redirect_fh_ok(const char *redirect, size_t size);
>  void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry);
>  void ovl_inode_init(struct inode *inode, struct inode *realinode,
>                     bool is_upper);
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index b3bc117..dba9753 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -254,6 +254,20 @@ void ovl_clear_redirect_fh(struct super_block *sb)
>         ofs->redirect_fh = false;
>  }
>
> +bool ovl_redirect_fh_ok(const char *redirect, size_t size)
> +{
> +       struct ovl_fh *fh = (void *)redirect;
> +
> +       if (size < sizeof(struct ovl_fh) || size < fh->len)
> +               return false;
> +
> +       if (fh->version > OVL_FH_VERSION ||
> +           fh->magic != OVL_FH_MAGIC)
> +               return false;
> +
> +       return true;
> +}
> +
>  void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
>  {
>         struct ovl_entry *oe = dentry->d_fsdata;
> --
> 2.7.4
>

Miklos Szeredi April 25, 2017, 3:13 p.m. UTC | #2

On Mon, Apr 24, 2017 at 11:14 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> When overlay.fh xattr is found in a directory inode, instead of lookup
> of the dentry in next lower layer by name, first try to get it by calling
> exportfs_decode_fh().
>
> On failure to lookup by file handle to lower layer, fall back to lookup
> by name with or without path redirect.
>
> For now we only support following by file handle from upper if there is a
> single lower layer, because fallback from lookup by file hande to lookup
> by path in mid layers is not yet implemented.
>
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>  fs/overlayfs/namei.c     | 185 +++++++++++++++++++++++++++++++++++++++++++----
>  fs/overlayfs/overlayfs.h |   1 +
>  fs/overlayfs/util.c      |  14 ++++
>  3 files changed, 186 insertions(+), 14 deletions(-)
>
> diff --git a/fs/overlayfs/namei.c b/fs/overlayfs/namei.c
> index d660177..0d1cc8f 100644
> --- a/fs/overlayfs/namei.c
> +++ b/fs/overlayfs/namei.c
> @@ -9,9 +9,11 @@
>
>  #include <linux/fs.h>
>  #include <linux/cred.h>
> +#include <linux/mount.h>
>  #include <linux/namei.h>
>  #include <linux/xattr.h>
>  #include <linux/ratelimit.h>
> +#include <linux/exportfs.h>
>  #include "overlayfs.h"
>  #include "ovl_entry.h"
>
> @@ -21,7 +23,10 @@ struct ovl_lookup_data {
>         bool opaque;
>         bool stop;
>         bool last;
> -       char *redirect;
> +       bool by_path;           /* redirect by path */
> +       bool by_fh;             /* redirect by file handle */
> +       char *redirect;         /* path to follow */
> +       struct ovl_fh *fh;      /* file handle to follow */
>  };
>
>  static int ovl_check_redirect(struct dentry *dentry, struct ovl_lookup_data *d,
> @@ -81,6 +86,42 @@ static int ovl_check_redirect(struct dentry *dentry, struct ovl_lookup_data *d,
>         goto err_free;
>  }
>
> +static int ovl_check_redirect_fh(struct dentry *dentry,
> +                                struct ovl_lookup_data *d)
> +{
> +       int res;
> +       void *buf = NULL;
> +
> +       res = vfs_getxattr(dentry, OVL_XATTR_FH, NULL, 0);
> +       if (res < 0) {
> +               if (res == -ENODATA || res == -EOPNOTSUPP)
> +                       return 0;
> +               goto fail;
> +       }
> +       buf = kzalloc(res, GFP_TEMPORARY);
> +       if (!buf)
> +               return -ENOMEM;
> +
> +       if (res == 0)
> +               goto fail;
> +
> +       res = vfs_getxattr(dentry, OVL_XATTR_FH, buf, res);
> +       if (res < 0 || !ovl_redirect_fh_ok(buf, res))
> +               goto fail;
> +
> +       kfree(d->fh);
> +       d->fh = buf;
> +
> +       return 0;
> +
> +err_free:
> +       kfree(buf);
> +       return 0;
> +fail:
> +       pr_warn_ratelimited("overlayfs: failed to get file handle (%i)\n", res);
> +       goto err_free;
> +}
> +
>  static bool ovl_is_opaquedir(struct dentry *dentry)
>  {
>         int res;
> @@ -96,22 +137,81 @@ static bool ovl_is_opaquedir(struct dentry *dentry)
>         return false;
>  }
>
> +/* Check if p1 is connected with a chain of hashed dentries to p2 */
> +static bool ovl_is_lookable(struct dentry *p1, struct dentry *p2)
> +{
> +       struct dentry *p;
> +
> +       for (p = p2; !IS_ROOT(p); p = p->d_parent) {
> +               if (d_unhashed(p))
> +                       return false;
> +               if (p->d_parent == p1)
> +                       return true;
> +       }
> +       return false;
> +}

Walking the dentry tree without RCU protection is dangerous and broken.

I'm also wondering if there's a better way to find the layer (e.g.
store the layer index in the handle as well).

> +
> +/* Check if dentry is reachable from mnt via path lookup */
> +static int ovl_dentry_under_mnt(void *ctx, struct dentry *dentry)
> +{
> +       struct vfsmount *mnt = ctx;
> +
> +       return ovl_is_lookable(mnt->mnt_root, dentry);
> +}
> +
> +static struct dentry *ovl_lookup_fh(struct vfsmount *mnt,
> +                                   const struct ovl_fh *fh)
> +{
> +       int bytes = (fh->len - offsetof(struct ovl_fh, fid));
> +
> +       /*
> +        * When redirect_fh is disabled, 'invalid' file handles are stored
> +        * to indicate that this entry has been copied up.
> +        */
> +       if (!bytes || (int)fh->type == FILEID_INVALID)
> +               return ERR_PTR(-ESTALE);
> +
> +       /*
> +        * Several layers can be on the same fs and decoded dentry may be in
> +        * either one of those layers. We are looking for a match of dentry
> +        * and mnt to find out to which layer the decoded dentry belongs to.
> +        */
> +       return exportfs_decode_fh(mnt, (struct fid *)fh->fid,
> +                                 bytes >> 2, (int)fh->type,
> +                                 ovl_dentry_under_mnt, mnt);
> +}
> +
>  static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
>                              const char *name, unsigned int namelen,
>                              size_t prelen, const char *post,
> -                            struct dentry **ret)
> +                            struct vfsmount *mnt, struct dentry **ret)

I think it would be better to split this function into path and fh
variants and extract the common parts into helper(s).

>  {
>         struct dentry *this;
>         int err;
>
> -       this = lookup_one_len_unlocked(name, base, namelen);
> +       /*
> +        * Lookup of upper is with null d->fh.
> +        * Lookup of lower is either by_fh with non-null d->fh
> +        * or by_path with null d->fh.
> +        */
> +       if (d->fh)
> +               this = ovl_lookup_fh(mnt, d->fh);
> +       else
> +               this = lookup_one_len_unlocked(name, base, namelen);
>         if (IS_ERR(this)) {
>                 err = PTR_ERR(this);
>                 this = NULL;
>                 if (err == -ENOENT || err == -ENAMETOOLONG)
>                         goto out;
> +               if (d->fh && err == -ESTALE)
> +                       goto out;
>                 goto out_err;
>         }
> +
> +       /* If found by file handle - don't follow that handle again */
> +       kfree(d->fh);
> +       d->fh = NULL;
> +
>         if (!this->d_inode)
>                 goto put_and_out;
>
> @@ -135,9 +235,18 @@ static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
>                 d->stop = d->opaque = true;
>                 goto out;
>         }
> -       err = ovl_check_redirect(this, d, prelen, post);
> -       if (err)
> -               goto out_err;
> +       if (d->last)
> +               goto out;
> +       if (d->by_path) {
> +               err = ovl_check_redirect(this, d, prelen, post);
> +               if (err)
> +                       goto out_err;
> +       }
> +       if (d->by_fh) {
> +               err = ovl_check_redirect_fh(this, d);
> +               if (err)
> +                       goto out_err;
> +       }
>  out:
>         *ret = this;
>         return 0;
> @@ -152,6 +261,12 @@ static int ovl_lookup_single(struct dentry *base, struct ovl_lookup_data *d,
>         return err;
>  }
>
> +static int ovl_lookup_layer_fh(struct path *path, struct ovl_lookup_data *d,
> +                              struct dentry **ret)
> +{
> +       return ovl_lookup_single(path->dentry, d, "", 0, 0, "", path->mnt, ret);
> +}
> +
>  static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
>                             struct dentry **ret)
>  {
> @@ -162,7 +277,7 @@ static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
>
>         if (d->name.name[0] != '/')
>                 return ovl_lookup_single(base, d, d->name.name, d->name.len,
> -                                        0, "", ret);
> +                                        0, "", NULL, ret);
>
>         while (!IS_ERR_OR_NULL(base) && d_can_lookup(base)) {
>                 const char *s = d->name.name + d->name.len - rem;
> @@ -175,7 +290,7 @@ static int ovl_lookup_layer(struct dentry *base, struct ovl_lookup_data *d,
>                         return -EIO;
>
>                 err = ovl_lookup_single(base, d, s, thislen,
> -                                       d->name.len - rem, next, &base);
> +                                       d->name.len - rem, next, NULL, &base);
>                 dput(dentry);
>                 if (err)
>                         return err;
> @@ -220,6 +335,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>         const struct cred *old_cred;
>         struct ovl_fs *ofs = dentry->d_sb->s_fs_info;
>         struct ovl_entry *poe = dentry->d_parent->d_fsdata;
> +       struct ovl_entry *roe = dentry->d_sb->s_root->d_fsdata;
>         struct path *stack = NULL;
>         struct dentry *upperdir, *upperdentry = NULL;
>         unsigned int ctr = 0;
> @@ -235,7 +351,10 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                 .opaque = false,
>                 .stop = false,
>                 .last = !poe->numlower,
> +               .by_path = true,
>                 .redirect = NULL,
> +               .by_fh = true,
> +               .fh = NULL,
>         };
>
>         if (dentry->d_name.len > ofs->namelen)
> @@ -259,13 +378,23 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                         if (!upperredirect)
>                                 goto out_put_upper;
>                         if (d.redirect[0] == '/')
> -                               poe = dentry->d_sb->s_root->d_fsdata;
> +                               poe = roe;
>                 }
>                 if (d.opaque)
>                         type |= __OVL_PATH_OPAQUE;
>         }
>
> -       if (!d.stop && poe->numlower) {
> +       /*
> +        * For now we only support lower by fh in single layer, because
> +        * fallback from lookup by fh to lookup by path in mid layers for
> +        * merge directory is not yet implemented.
> +        */
> +       if (!ofs->redirect_fh || ofs->numlower > 1) {
> +               kfree(d.fh);
> +               d.fh = NULL;
> +       }
> +
> +       if (!d.stop && (poe->numlower || d.fh)) {
>                 err = -ENOMEM;
>                 stack = kcalloc(ofs->numlower, sizeof(struct path),
>                                 GFP_TEMPORARY);
> @@ -273,6 +402,35 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                         goto out_put_upper;
>         }
>
> +       /* Try to lookup lower layers by file handle */
> +       d.by_path = false;
> +       for (i = 0; !d.stop && d.fh && i < roe->numlower; i++) {
> +               struct path lowerpath = poe->lowerstack[i];
> +
> +               d.last = i == poe->numlower - 1;
> +               err = ovl_lookup_layer_fh(&lowerpath, &d, &this);
> +               if (err)
> +                       goto out_put;
> +
> +               if (!this)
> +                       continue;
> +
> +               stack[ctr].dentry = this;
> +               stack[ctr].mnt = lowerpath.mnt;
> +               ctr++;
> +               /*
> +                * Found by fh - won't lookup by path.
> +                * TODO: set d.redirect to dentry_path(this),
> +                *       so lookup can continue by path.
> +                */
> +               d.stop = true;
> +       }
> +
> +       /* Fallback to lookup lower layers by path */
> +       d.by_path = true;
> +       d.by_fh = false;
> +       kfree(d.fh);
> +       d.fh = NULL;
>         for (i = 0; !d.stop && i < poe->numlower; i++) {
>                 struct path lowerpath = poe->lowerstack[i];
>
> @@ -291,10 +449,8 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>                 if (d.stop)
>                         break;
>
> -               if (d.redirect &&
> -                   d.redirect[0] == '/' &&
> -                   poe != dentry->d_sb->s_root->d_fsdata) {
> -                       poe = dentry->d_sb->s_root->d_fsdata;
> +               if (d.redirect && d.redirect[0] == '/' && poe != roe) {
> +                       poe = roe;
>
>                         /* Find the current layer on the root dentry */
>                         for (i = 0; i < poe->numlower; i++)
> @@ -354,6 +510,7 @@ struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
>         dput(upperdentry);
>         kfree(upperredirect);
>  out:
> +       kfree(d.fh);
>         kfree(d.redirect);
>         revert_creds(old_cred);
>         return ERR_PTR(err);
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index c3cfbc5..08002ce 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -190,6 +190,7 @@ const char *ovl_dentry_get_redirect(struct dentry *dentry);
>  void ovl_dentry_set_redirect(struct dentry *dentry, const char *redirect);
>  bool ovl_redirect_fh(struct super_block *sb);
>  void ovl_clear_redirect_fh(struct super_block *sb);
> +bool ovl_redirect_fh_ok(const char *redirect, size_t size);
>  void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry);
>  void ovl_inode_init(struct inode *inode, struct inode *realinode,
>                     bool is_upper);
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index b3bc117..dba9753 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -254,6 +254,20 @@ void ovl_clear_redirect_fh(struct super_block *sb)
>         ofs->redirect_fh = false;
>  }
>
> +bool ovl_redirect_fh_ok(const char *redirect, size_t size)
> +{
> +       struct ovl_fh *fh = (void *)redirect;
> +
> +       if (size < sizeof(struct ovl_fh) || size < fh->len)
> +               return false;
> +
> +       if (fh->version > OVL_FH_VERSION ||
> +           fh->magic != OVL_FH_MAGIC)
> +               return false;
> +
> +       return true;
> +}
> +
>  void ovl_dentry_update(struct dentry *dentry, struct dentry *upperdentry)
>  {
>         struct ovl_entry *oe = dentry->d_fsdata;
> --
> 2.7.4
>

Amir Goldstein April 25, 2017, 5:41 p.m. UTC | #3

On Tue, Apr 25, 2017 at 6:13 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Mon, Apr 24, 2017 at 11:14 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> When overlay.fh xattr is found in a directory inode, instead of lookup
>> of the dentry in next lower layer by name, first try to get it by calling
>> exportfs_decode_fh().
>>
>> On failure to lookup by file handle to lower layer, fall back to lookup
>> by name with or without path redirect.
>>
>> For now we only support following by file handle from upper if there is a
>> single lower layer, because fallback from lookup by file hande to lookup
>> by path in mid layers is not yet implemented.
>>
>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>> ---
>>  fs/overlayfs/namei.c     | 185 +++++++++++++++++++++++++++++++++++++++++++----
>>  fs/overlayfs/overlayfs.h |   1 +
>>  fs/overlayfs/util.c      |  14 ++++
>>  3 files changed, 186 insertions(+), 14 deletions(-)
>>
[...]
>>
>> +/* Check if p1 is connected with a chain of hashed dentries to p2 */
>> +static bool ovl_is_lookable(struct dentry *p1, struct dentry *p2)
>> +{
>> +       struct dentry *p;
>> +
>> +       for (p = p2; !IS_ROOT(p); p = p->d_parent) {
>> +               if (d_unhashed(p))
>> +                       return false;
>> +               if (p->d_parent == p1)
>> +                       return true;
>> +       }
>> +       return false;
>> +}
>
> Walking the dentry tree without RCU protection is dangerous and broken.
>

I wonder if is_subdir() would be correct here?
Or I could just follow its lead to implement the parent walk correctly.
I did want to verify that the found dentry is not only 'connected' to
root, but also 'lookable', because I don't want to find a deleted file
when looking in lower layers.
Maybe that was too much and in any case, I could just verify that
the decoded dentry itself is hashed.

> I'm also wondering if there's a better way to find the layer

The purpose of this test is not only to find the layer, but
also to verify that the found inode is linked under the layer root.
I think that decode_fh() will always be able to create a disconnected
dentry if decoding an inode that is on the same sb as the layer where
fh was encoded. I'm pretty sure this was what I found in my initial
tests which made me write the broken ovl_is_lookable().

> (e.g. store the layer index in the handle as well).
>

But the layer index is a volatile number that can change.
I would like to be able to find by fh also when more layers are added
to the stack.

The only thing I can think of is to store sb_uuid+layer_root_fh+lower_fh.
At mount, we build a hash of the lower sb_uuid (save same_lower_uuid
for now).
At lookup, we first find lower_sb by uuid (verify same_lower_uuid for now),
then decode lower_root by root_fh, then find lower_mnt by lower_root,
then decode lower_fh with lower_mnt.

Sound reasonable?

Amir Goldstein April 25, 2017, 7:11 p.m. UTC | #4

On Tue, Apr 25, 2017 at 8:41 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Tue, Apr 25, 2017 at 6:13 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Mon, Apr 24, 2017 at 11:14 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> When overlay.fh xattr is found in a directory inode, instead of lookup
>>> of the dentry in next lower layer by name, first try to get it by calling
>>> exportfs_decode_fh().
>>>
>>> On failure to lookup by file handle to lower layer, fall back to lookup
>>> by name with or without path redirect.
>>>
>>> For now we only support following by file handle from upper if there is a
>>> single lower layer, because fallback from lookup by file hande to lookup
>>> by path in mid layers is not yet implemented.
>>>
>>> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
>>> ---
>>>  fs/overlayfs/namei.c     | 185 +++++++++++++++++++++++++++++++++++++++++++----
>>>  fs/overlayfs/overlayfs.h |   1 +
>>>  fs/overlayfs/util.c      |  14 ++++
>>>  3 files changed, 186 insertions(+), 14 deletions(-)
>>>
> [...]
>>>
>>> +/* Check if p1 is connected with a chain of hashed dentries to p2 */
>>> +static bool ovl_is_lookable(struct dentry *p1, struct dentry *p2)
>>> +{
>>> +       struct dentry *p;
>>> +
>>> +       for (p = p2; !IS_ROOT(p); p = p->d_parent) {
>>> +               if (d_unhashed(p))
>>> +                       return false;
>>> +               if (p->d_parent == p1)
>>> +                       return true;
>>> +       }
>>> +       return false;
>>> +}
>>
>> Walking the dentry tree without RCU protection is dangerous and broken.
>>
>
> I wonder if is_subdir() would be correct here?
> Or I could just follow its lead to implement the parent walk correctly.
> I did want to verify that the found dentry is not only 'connected' to
> root, but also 'lookable', because I don't want to find a deleted file
> when looking in lower layers.
> Maybe that was too much and in any case, I could just verify that
> the decoded dentry itself is hashed.
>
>> I'm also wondering if there's a better way to find the layer
>
> The purpose of this test is not only to find the layer, but
> also to verify that the found inode is linked under the layer root.
> I think that decode_fh() will always be able to create a disconnected
> dentry if decoding an inode that is on the same sb as the layer where
> fh was encoded. I'm pretty sure this was what I found in my initial
> tests which made me write the broken ovl_is_lookable().
>
>> (e.g. store the layer index in the handle as well).
>>
>
> But the layer index is a volatile number that can change.
> I would like to be able to find by fh also when more layers are added
> to the stack.
>
> The only thing I can think of is to store sb_uuid+layer_root_fh+lower_fh.
> At mount, we build a hash of the lower sb_uuid (save same_lower_uuid
> for now).
> At lookup, we first find lower_sb by uuid (verify same_lower_uuid for now),
> then decode lower_root by root_fh, then find lower_mnt by lower_root,
> then decode lower_fh with lower_mnt.
>
> Sound reasonable?

Or maybe like this:

At mount time either set or verify the xattr in upper layer root inode:
overlay.root.$i [i:=0..numlower-1] - ovl_root_id of lower layer i
ovl_root_id includes for each layer:
- sb uuid
- fh of root inode

If mount was able to set or verify that all ovl_root_id[i] match their
respective lower layer sb and root inode, then redirect_fh can be enabled,
otherwise it is disabled.

With redirect_fh enabled, it is safe to lookup by the lower layer index,
root fh and lower inode fh.
With redirect_fh enabled, it is safe to store handles on copy up along
with lower layer index and root fh.

A lower layer can be used and reused by any number of overlay mounts
at different layer index.

An upper layer can be reused in an overlay mount with either copied lower
layers or with different lower stack and will have redirect_fh disabled.

An upper layer can be rotated as lower layer, because file handles are
never followed from a lower layer. Constant inode numbers code does
not need to follow by fh from lower layers.

With this scheme, there is no need to store nor match sb_uuid a for
every single copy up and every single lookup by fh.
There is no need to 'lookup' the layer, just use the index and compare
the root_fh.

It is quite safe from following handles to wrong fs, except if user copies
parts of an upper layer (without the layer root), but doing something like
that is equivalent to a user that takes down an NFS server, brings up
a server with the same network address and exports the same share
name from a different filesystem.

Maybe the chances are more slim, but the same interesting things could
happen.

Amir.

Miklos Szeredi April 26, 2017, 9:06 a.m. UTC | #5

On Tue, Apr 25, 2017 at 9:11 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> Or maybe like this:
>
> At mount time either set or verify the xattr in upper layer root inode:
> overlay.root.$i [i:=0..numlower-1] - ovl_root_id of lower layer i
> ovl_root_id includes for each layer:
> - sb uuid
> - fh of root inode
>
> If mount was able to set or verify that all ovl_root_id[i] match their
> respective lower layer sb and root inode, then redirect_fh can be enabled,
> otherwise it is disabled.
>
> With redirect_fh enabled, it is safe to lookup by the lower layer index,
> root fh and lower inode fh.
> With redirect_fh enabled, it is safe to store handles on copy up along
> with lower layer index and root fh.
>
> A lower layer can be used and reused by any number of overlay mounts
> at different layer index.
>
> An upper layer can be reused in an overlay mount with either copied lower
> layers or with different lower stack and will have redirect_fh disabled.
>
> An upper layer can be rotated as lower layer, because file handles are
> never followed from a lower layer. Constant inode numbers code does
> not need to follow by fh from lower layers.
>
> With this scheme, there is no need to store nor match sb_uuid a for
> every single copy up and every single lookup by fh.
> There is no need to 'lookup' the layer, just use the index and compare
> the root_fh.
>
> It is quite safe from following handles to wrong fs, except if user copies
> parts of an upper layer (without the layer root), but doing something like
> that is equivalent to a user that takes down an NFS server, brings up
> a server with the same network address and exports the same share
> name from a different filesystem.
>
> Maybe the chances are more slim, but the same interesting things could
> happen.

Checking UUID would be O(1) and very fast, so I wouldn't worry about
that.  Using is_subdir() to verify the layer is O(depth) but still
very fast.  I don't think that's an issue either.

Using is_subdir() to find the layer would be O(depth*numlower).  But
we can optimize that if we really want to:  have a function that
returns the first ancestor of a dentry that is a layer root (marked
with a flag).  Then we just need to map that dentry to the layer,
which can be done with a hash table or whatever.

And anyway uncached lookup will be slow, and we are only doing this
for copied up files and directories.  So I don't think we need to
worry too much about optimizing this.

So for now lets just go with the original patch but replace
ovl_is_lookable() with is_subdir().

Thanks,
Miklos

Amir Goldstein April 26, 2017, 9:40 a.m. UTC | #6

On Wed, Apr 26, 2017 at 12:06 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Tue, Apr 25, 2017 at 9:11 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> Or maybe like this:
>>
>> At mount time either set or verify the xattr in upper layer root inode:
>> overlay.root.$i [i:=0..numlower-1] - ovl_root_id of lower layer i
>> ovl_root_id includes for each layer:
>> - sb uuid
>> - fh of root inode
>>
>> If mount was able to set or verify that all ovl_root_id[i] match their
>> respective lower layer sb and root inode, then redirect_fh can be enabled,
>> otherwise it is disabled.
>>
>> With redirect_fh enabled, it is safe to lookup by the lower layer index,
>> root fh and lower inode fh.
>> With redirect_fh enabled, it is safe to store handles on copy up along
>> with lower layer index and root fh.
>>
>> A lower layer can be used and reused by any number of overlay mounts
>> at different layer index.
>>
>> An upper layer can be reused in an overlay mount with either copied lower
>> layers or with different lower stack and will have redirect_fh disabled.
>>
>> An upper layer can be rotated as lower layer, because file handles are
>> never followed from a lower layer. Constant inode numbers code does
>> not need to follow by fh from lower layers.
>>
>> With this scheme, there is no need to store nor match sb_uuid a for
>> every single copy up and every single lookup by fh.
>> There is no need to 'lookup' the layer, just use the index and compare
>> the root_fh.
>>
>> It is quite safe from following handles to wrong fs, except if user copies
>> parts of an upper layer (without the layer root), but doing something like
>> that is equivalent to a user that takes down an NFS server, brings up
>> a server with the same network address and exports the same share
>> name from a different filesystem.
>>
>> Maybe the chances are more slim, but the same interesting things could
>> happen.
>
> Checking UUID would be O(1) and very fast, so I wouldn't worry about
> that.  Using is_subdir() to verify the layer is O(depth) but still
> very fast.  I don't think that's an issue either.
>
> Using is_subdir() to find the layer would be O(depth*numlower).  But
> we can optimize that if we really want to:  have a function that
> returns the first ancestor of a dentry that is a layer root (marked
> with a flag).  Then we just need to map that dentry to the layer,
> which can be done with a hash table or whatever.
>
> And anyway uncached lookup will be slow, and we are only doing this
> for copied up files and directories.  So I don't think we need to
> worry too much about optimizing this.
>
> So for now lets just go with the original patch but replace
> ovl_is_lookable() with is_subdir().
>

Just to see that I understand you correctly.

I am now working on storing the following:

/*
 * The tuple origin.{fh,layer,uuid} is a universal unique identifier
 * for a copy up origin, where:
 * origin.fh    - exported file handle of the lower file
 * origin.root - exported file handle of the lower layer root
 * origin.uuid  - uuid of the lower filesystem
 *
 * origin.{fh,root} are stored in format of a variable length binary blob
 * with struct ovl_fh header (total blob size up to 20 bytes).
 * uuid is stored in raw format (16 bytes) as published by sb->s_uuid.
 */

I intend to implement lookup as follows:
- compare(origin.uuid, same_lower_sb->s_uuid)
# layer root dentries cannot be DCACHE_DISCONNECTED, so
# exportfs_decode_fh ignores mnt arg and returns the cached dentry
- root = exportfs_decode_fh(lowerstack[0].mnt, origin.root)
- find layer where lowerstack[layer].dentry == root
- this = exportfs_decode_fh(lowerstack[layer].mnt, origin.fh)

is_subdir() is NOT needed for decoding the layer root
is_subdir() is optional for decoding the lower file, because
it is not needed to identify the layer

The lookup is O(numlower)+O(depth) where O(depth) is
just as precousion.

Amir.

Miklos Szeredi April 26, 2017, 9:55 a.m. UTC | #7

On Wed, Apr 26, 2017 at 11:40 AM, Amir Goldstein <amir73il@gmail.com> wrote:

> Just to see that I understand you correctly.
>
> I am now working on storing the following:
>
> /*
>  * The tuple origin.{fh,layer,uuid} is a universal unique identifier
>  * for a copy up origin, where:
>  * origin.fh    - exported file handle of the lower file
>  * origin.root - exported file handle of the lower layer root
>  * origin.uuid  - uuid of the lower filesystem

I wouldn't even store origin.root.

>  *
>  * origin.{fh,root} are stored in format of a variable length binary blob
>  * with struct ovl_fh header (total blob size up to 20 bytes).
>  * uuid is stored in raw format (16 bytes) as published by sb->s_uuid.
>  */
>
> I intend to implement lookup as follows:
> - compare(origin.uuid, same_lower_sb->s_uuid)
> # layer root dentries cannot be DCACHE_DISCONNECTED, so
> # exportfs_decode_fh ignores mnt arg and returns the cached dentry
> - root = exportfs_decode_fh(lowerstack[0].mnt, origin.root)
> - find layer where lowerstack[layer].dentry == root
> - this = exportfs_decode_fh(lowerstack[layer].mnt, origin.fh)
>
> is_subdir() is NOT needed for decoding the layer root
> is_subdir() is optional for decoding the lower file, because
> it is not needed to identify the layer

Hmm, we can just force exportfs_decode_fh() to return a connected
dentry (return false from *acceptable() if the dentry is disconnected)
before going on to iterate the layers to see which one contains it.

Thanks,
Miklos

Amir Goldstein April 26, 2017, 10:17 a.m. UTC | #8

On Wed, Apr 26, 2017 at 12:55 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Wed, Apr 26, 2017 at 11:40 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> Just to see that I understand you correctly.
>>
>> I am now working on storing the following:
>>
>> /*
>>  * The tuple origin.{fh,layer,uuid} is a universal unique identifier
>>  * for a copy up origin, where:
>>  * origin.fh    - exported file handle of the lower file
>>  * origin.root - exported file handle of the lower layer root
>>  * origin.uuid  - uuid of the lower filesystem
>
> I wouldn't even store origin.root.
>
>>  *
>>  * origin.{fh,root} are stored in format of a variable length binary blob
>>  * with struct ovl_fh header (total blob size up to 20 bytes).
>>  * uuid is stored in raw format (16 bytes) as published by sb->s_uuid.
>>  */
>>
>> I intend to implement lookup as follows:
>> - compare(origin.uuid, same_lower_sb->s_uuid)
>> # layer root dentries cannot be DCACHE_DISCONNECTED, so
>> # exportfs_decode_fh ignores mnt arg and returns the cached dentry
>> - root = exportfs_decode_fh(lowerstack[0].mnt, origin.root)
>> - find layer where lowerstack[layer].dentry == root
>> - this = exportfs_decode_fh(lowerstack[layer].mnt, origin.fh)
>>
>> is_subdir() is NOT needed for decoding the layer root
>> is_subdir() is optional for decoding the lower file, because
>> it is not needed to identify the layer
>
> Hmm, we can just force exportfs_decode_fh() to return a connected
> dentry (return false from *acceptable() if the dentry is disconnected)
> before going on to iterate the layers to see which one contains it.
>

Hmm, this might work, but to quote from exportfs_decode_fh():
"It's not a directory.  Life is a little more complicated."

IIUC, 'connected' means 'connected to sb root', and not
'connected to mnt root', so in the optimal case where
all lower dentries are cached,  exportfs_decode_fh() will return
a connected dentry for every fh we give it regardless of the
mnt argument, so we will have to use is_subdir() to find the
right layer, which brings us back to O(numlower*depth)

With the extra cost of storing the deducible information origin.root,
we will have less complex and more efficient lookup code.

Let me try and implement it and see if I am right.
We can always discard origin.root from v4 if it turns
out to be unhelpful.

Amir.

Miklos Szeredi April 26, 2017, 12:15 p.m. UTC | #9

On Wed, Apr 26, 2017 at 12:17 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Apr 26, 2017 at 12:55 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Wed, Apr 26, 2017 at 11:40 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>
>>> Just to see that I understand you correctly.
>>>
>>> I am now working on storing the following:
>>>
>>> /*
>>>  * The tuple origin.{fh,layer,uuid} is a universal unique identifier
>>>  * for a copy up origin, where:
>>>  * origin.fh    - exported file handle of the lower file
>>>  * origin.root - exported file handle of the lower layer root
>>>  * origin.uuid  - uuid of the lower filesystem
>>
>> I wouldn't even store origin.root.
>>
>>>  *
>>>  * origin.{fh,root} are stored in format of a variable length binary blob
>>>  * with struct ovl_fh header (total blob size up to 20 bytes).
>>>  * uuid is stored in raw format (16 bytes) as published by sb->s_uuid.
>>>  */
>>>
>>> I intend to implement lookup as follows:
>>> - compare(origin.uuid, same_lower_sb->s_uuid)
>>> # layer root dentries cannot be DCACHE_DISCONNECTED, so
>>> # exportfs_decode_fh ignores mnt arg and returns the cached dentry
>>> - root = exportfs_decode_fh(lowerstack[0].mnt, origin.root)
>>> - find layer where lowerstack[layer].dentry == root
>>> - this = exportfs_decode_fh(lowerstack[layer].mnt, origin.fh)
>>>
>>> is_subdir() is NOT needed for decoding the layer root
>>> is_subdir() is optional for decoding the lower file, because
>>> it is not needed to identify the layer
>>
>> Hmm, we can just force exportfs_decode_fh() to return a connected
>> dentry (return false from *acceptable() if the dentry is disconnected)
>> before going on to iterate the layers to see which one contains it.
>>
>
> Hmm, this might work, but to quote from exportfs_decode_fh():
> "It's not a directory.  Life is a little more complicated."
>
> IIUC, 'connected' means 'connected to sb root', and not
> 'connected to mnt root', so in the optimal case where
> all lower dentries are cached,  exportfs_decode_fh() will return
> a connected dentry for every fh we give it regardless of the
> mnt argument, so we will have to use is_subdir() to find the
> right layer, which brings us back to O(numlower*depth)

It just means that we might have to make up an artificial mount which
has its root at the sb root to be able to decode the handle into a
connected one.

>
> With the extra cost of storing the deducible information origin.root,
> we will have less complex and more efficient lookup code.
>
> Let me try and implement it and see if I am right.
> We can always discard origin.root from v4 if it turns
> out to be unhelpful.

I don't have good feelings about storing the root fh just because we
don't special case the layer root anywhere yet, and I wouldn't want to
do that unless there's a good reason.

Thanks,
Miklos

Amir Goldstein April 26, 2017, 2:51 p.m. UTC | #10

On Wed, Apr 26, 2017 at 3:15 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Wed, Apr 26, 2017 at 12:17 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Wed, Apr 26, 2017 at 12:55 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>> On Wed, Apr 26, 2017 at 11:40 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>
>>>> Just to see that I understand you correctly.
>>>>
>>>> I am now working on storing the following:
>>>>
>>>> /*
>>>>  * The tuple origin.{fh,layer,uuid} is a universal unique identifier
>>>>  * for a copy up origin, where:
>>>>  * origin.fh    - exported file handle of the lower file
>>>>  * origin.root - exported file handle of the lower layer root
>>>>  * origin.uuid  - uuid of the lower filesystem
>>>
>>> I wouldn't even store origin.root.
>>>
>>>>  *
>>>>  * origin.{fh,root} are stored in format of a variable length binary blob
>>>>  * with struct ovl_fh header (total blob size up to 20 bytes).
>>>>  * uuid is stored in raw format (16 bytes) as published by sb->s_uuid.
>>>>  */
>>>>
>>>> I intend to implement lookup as follows:
>>>> - compare(origin.uuid, same_lower_sb->s_uuid)
>>>> # layer root dentries cannot be DCACHE_DISCONNECTED, so
>>>> # exportfs_decode_fh ignores mnt arg and returns the cached dentry
>>>> - root = exportfs_decode_fh(lowerstack[0].mnt, origin.root)
>>>> - find layer where lowerstack[layer].dentry == root
>>>> - this = exportfs_decode_fh(lowerstack[layer].mnt, origin.fh)
>>>>
>>>> is_subdir() is NOT needed for decoding the layer root
>>>> is_subdir() is optional for decoding the lower file, because
>>>> it is not needed to identify the layer
>>>
>>> Hmm, we can just force exportfs_decode_fh() to return a connected
>>> dentry (return false from *acceptable() if the dentry is disconnected)
>>> before going on to iterate the layers to see which one contains it.
>>>
>>
>> Hmm, this might work, but to quote from exportfs_decode_fh():
>> "It's not a directory.  Life is a little more complicated."
>>
>> IIUC, 'connected' means 'connected to sb root', and not
>> 'connected to mnt root', so in the optimal case where
>> all lower dentries are cached,  exportfs_decode_fh() will return
>> a connected dentry for every fh we give it regardless of the
>> mnt argument, so we will have to use is_subdir() to find the
>> right layer, which brings us back to O(numlower*depth)
>
> It just means that we might have to make up an artificial mount which
> has its root at the sb root to be able to decode the handle into a
> connected one.
>

I'm not sure I understand what this artificial mount buys us.

>>
>> With the extra cost of storing the deducible information origin.root,
>> we will have less complex and more efficient lookup code.
>>
>> Let me try and implement it and see if I am right.
>> We can always discard origin.root from v4 if it turns
>> out to be unhelpful.
>
> I don't have good feelings about storing the root fh just because we
> don't special case the layer root anywhere yet, and I wouldn't want to
> do that unless there's a good reason.
>

There are a few reasons for origin.root, not sure if they are good:
1. lookup is O(numlower+depth) instead of O(numlower*depth)
2. origin.uuid validates that we are still on the same sb
    origin.root validates that we are still using the same lower dirs
    and that files from old lower were not moved around to find themselves
    inside a different lower dir
3. hardlinks between layers (!!!) will still get to the right layer

I personally think that reason #1 is the important one, but I think we
disagree on the technical details of exportfs_decode_fh() and we
need to sort this out.

Here is my untested implementation of find layer by uuid/rootfh
with the relevant comments. Maybe it helps you point out what
I am missing or what you are missing:

/* Find lower layer index by layer root file handle and uuid */
static int ovl_find_layer_by_fh(struct dentry *dentry, struct
ovl_lookup_data *d)
{
        struct ovl_entry *roe = dentry->d_sb->s_root->d_fsdata;
        struct super_block *lower_sb = ovl_same_lower_sb(dentry->d_sb);
        struct dentry *this;
        int i;

        /*
         * For now, we only support lookup by fh for all lower layers on the
         * same sb.  Not all filesystems set sb->s_uuid.  For those who don't
         * this code will compare zeros, which at least ensures us that the
         * file handles are not crossing from filesystem with sb->s_uuid to
         * a filesystem without sb->s_uuid and vice versa.
         */
        if (!lower_sb || memcmp(lower_sb->s_uuid, &d->uuid, sizeof(d->uuid)))
                return -1;

        /*
         * Layer root dentries are pinned, there are no aliases for dirs, and
         * all lower layers are on the same sb.  If rootfh is correct,
         * exportfs_decode_fh() will find it in dcache and return the only
         * instance, regardless of the mnt argument and we can compare the
         * returned pointer with the pointers in lowerstack.
         */
        this = ovl_decode_fh(roe->lowerstack[0].mnt, d->rootfh, ovl_is_dir);
        if (IS_ERR(this))
                return -1;

        for (i = 0; i < roe->numlower; i++) {
                if (this == roe->lowerstack[i].dentry)
                        break;
        }

        dput(this);
        return i < roe->numlower ? i : -1;
}

Amir.

Amir Goldstein April 27, 2017, 6:27 a.m. UTC | #11

On Wed, Apr 26, 2017 at 5:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Apr 26, 2017 at 3:15 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Wed, Apr 26, 2017 at 12:17 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Wed, Apr 26, 2017 at 12:55 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>>>> On Wed, Apr 26, 2017 at 11:40 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>>>>
>>>>> Just to see that I understand you correctly.
>>>>>
>>>>> I am now working on storing the following:
>>>>>
>>>>> /*
>>>>>  * The tuple origin.{fh,layer,uuid} is a universal unique identifier
>>>>>  * for a copy up origin, where:
>>>>>  * origin.fh    - exported file handle of the lower file
>>>>>  * origin.root - exported file handle of the lower layer root
>>>>>  * origin.uuid  - uuid of the lower filesystem
>>>>
>>>> I wouldn't even store origin.root.
>>>>
>>>>>  *
>>>>>  * origin.{fh,root} are stored in format of a variable length binary blob
>>>>>  * with struct ovl_fh header (total blob size up to 20 bytes).
>>>>>  * uuid is stored in raw format (16 bytes) as published by sb->s_uuid.
>>>>>  */
>>>>>
>>>>> I intend to implement lookup as follows:
>>>>> - compare(origin.uuid, same_lower_sb->s_uuid)
>>>>> # layer root dentries cannot be DCACHE_DISCONNECTED, so
>>>>> # exportfs_decode_fh ignores mnt arg and returns the cached dentry
>>>>> - root = exportfs_decode_fh(lowerstack[0].mnt, origin.root)
>>>>> - find layer where lowerstack[layer].dentry == root
>>>>> - this = exportfs_decode_fh(lowerstack[layer].mnt, origin.fh)
>>>>>
>>>>> is_subdir() is NOT needed for decoding the layer root
>>>>> is_subdir() is optional for decoding the lower file, because
>>>>> it is not needed to identify the layer
>>>>
>>>> Hmm, we can just force exportfs_decode_fh() to return a connected
>>>> dentry (return false from *acceptable() if the dentry is disconnected)
>>>> before going on to iterate the layers to see which one contains it.
>>>>
>>>
>>> Hmm, this might work, but to quote from exportfs_decode_fh():
>>> "It's not a directory.  Life is a little more complicated."
>>>
>>> IIUC, 'connected' means 'connected to sb root', and not
>>> 'connected to mnt root', so in the optimal case where
>>> all lower dentries are cached,  exportfs_decode_fh() will return
>>> a connected dentry for every fh we give it regardless of the
>>> mnt argument, so we will have to use is_subdir() to find the
>>> right layer, which brings us back to O(numlower*depth)
>>
>> It just means that we might have to make up an artificial mount which
>> has its root at the sb root to be able to decode the handle into a
>> connected one.
>>
>
> I'm not sure I understand what this artificial mount buys us.

Let me try to explain the problem with a worse case, but not
improbable example:

Suppose I have an overlay with deep file at /a/b/c/.../z
Suppose the layers are at /old/{lower,upper} I copy them
over to /new/{lower,upper} and mount the overlay at new path.

Suppose that dcache is fully populated under /new and fully
evicted under /old.

When trying to decode the file handle for z, exportfs_decode_fh()
will call the file system to actually read all directories a..z from disk
in order to reconnect the dentry of old z all the way up to /old
and it will do that *before* calling the acceptable() callback.

Alternatively, if we first try to decode the file handle for /old/lower,
decoding will be very fast (most likely already in cache) and we will
not have to continue to decoding z and reading all directories a..z
from disk.

This is why and how I implemented lookup by origin.{root+fh}
in v3 patch set.

>
>>>
>>> With the extra cost of storing the deducible information origin.root,
>>> we will have less complex and more efficient lookup code.
>>>
>>> Let me try and implement it and see if I am right.
>>> We can always discard origin.root from v4 if it turns
>>> out to be unhelpful.
>>
>> I don't have good feelings about storing the root fh just because we
>> don't special case the layer root anywhere yet, and I wouldn't want to
>> do that unless there's a good reason.
>>

Wait, what do you mean by "we don't special case the layer root?"
Do you mean that we could mount an overlay at a subdir path?
i.e. in the example below, we could mount an overlay with
upperdir=/new/upper/a/b/c,lowerdir=/new/lower/a/b/c?

If this is what you mean then it is not true that we don't special case
layer root. We do it with path redirect relative to layer root.
If anything, we should be storing origin.root along with overlay.redirect
in order to verify that we are not redirecting into the wrong relative
path.

>
> There are a few reasons for origin.root, not sure if they are good:
> 1. lookup is O(numlower+depth) instead of O(numlower*depth)
> 2. origin.uuid validates that we are still on the same sb
>     origin.root validates that we are still using the same lower dirs
>     and that files from old lower were not moved around to find themselves
>     inside a different lower dir
> 3. hardlinks between layers (!!!) will still get to the right layer
>
> I personally think that reason #1 is the important one, but I think we
> disagree on the technical details of exportfs_decode_fh() and we
> need to sort this out.
>
> Here is my untested implementation of find layer by uuid/rootfh
> with the relevant comments. Maybe it helps you point out what
> I am missing or what you are missing:
>
> /* Find lower layer index by layer root file handle and uuid */
> static int ovl_find_layer_by_fh(struct dentry *dentry, struct
> ovl_lookup_data *d)
> {
>         struct ovl_entry *roe = dentry->d_sb->s_root->d_fsdata;
>         struct super_block *lower_sb = ovl_same_lower_sb(dentry->d_sb);
>         struct dentry *this;
>         int i;
>
>         /*
>          * For now, we only support lookup by fh for all lower layers on the
>          * same sb.  Not all filesystems set sb->s_uuid.  For those who don't
>          * this code will compare zeros, which at least ensures us that the
>          * file handles are not crossing from filesystem with sb->s_uuid to
>          * a filesystem without sb->s_uuid and vice versa.
>          */
>         if (!lower_sb || memcmp(lower_sb->s_uuid, &d->uuid, sizeof(d->uuid)))
>                 return -1;
>
>         /*
>          * Layer root dentries are pinned, there are no aliases for dirs, and
>          * all lower layers are on the same sb.  If rootfh is correct,
>          * exportfs_decode_fh() will find it in dcache and return the only
>          * instance, regardless of the mnt argument and we can compare the
>          * returned pointer with the pointers in lowerstack.
>          */
>         this = ovl_decode_fh(roe->lowerstack[0].mnt, d->rootfh, ovl_is_dir);
>         if (IS_ERR(this))
>                 return -1;
>
>         for (i = 0; i < roe->numlower; i++) {
>                 if (this == roe->lowerstack[i].dentry)
>                         break;
>         }
>
>         dput(this);
>         return i < roe->numlower ? i : -1;
> }
>
> Amir.

Miklos Szeredi April 27, 2017, 7:40 a.m. UTC | #12

On Wed, Apr 26, 2017 at 4:51 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Wed, Apr 26, 2017 at 3:15 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:

>> I don't have good feelings about storing the root fh just because we
>> don't special case the layer root anywhere yet, and I wouldn't want to
>> do that unless there's a good reason.
>>
>
> There are a few reasons for origin.root, not sure if they are good:
> 1. lookup is O(numlower+depth) instead of O(numlower*depth)

We can optimize to O(numlower+depth) even without origin.root.

> 2. origin.uuid validates that we are still on the same sb
>     origin.root validates that we are still using the same lower dirs
>     and that files from old lower were not moved around to find themselves
>     inside a different lower dir

Parent is encoded in the fh, so that makes it resistant to moving. See
the exportfs_get_name() trickery to get a non-dir connected.  It's
needed whether we have origin.root or not.  And yes, it's pretty
heavyweight.   Wondering if it's worth the trouble, since we are not
actually going to use the lower inode for anything else than getting
the inode number.  And then we could just store the inode number
instead of the fh, and be rid of this mess.

If file is moved to another layer by moving an ancestor directory then
we won't detect that.  Question is: do we care?  It's definitely in
the "you messed with lower dirs, you keep the pieces" territory.

> 3. hardlinks between layers (!!!) will still get to the right layer

Even without origin.root it should get the right layer, since we are
encoding the parent in the fh.

> I personally think that reason #1 is the important one, but I think we
> disagree on the technical details of exportfs_decode_fh() and we
> need to sort this out.
>
> Here is my untested implementation of find layer by uuid/rootfh
> with the relevant comments. Maybe it helps you point out what
> I am missing or what you are missing:

Yeah, it simplifies the implementation.  But implementation is
secondary to interface...

Thanks,
Miklos

Miklos Szeredi April 27, 2017, 7:48 a.m. UTC | #13

On Thu, Apr 27, 2017 at 8:27 AM, Amir Goldstein <amir73il@gmail.com> wrote:

> Let me try to explain the problem with a worse case, but not
> improbable example:
>
> Suppose I have an overlay with deep file at /a/b/c/.../z
> Suppose the layers are at /old/{lower,upper} I copy them
> over to /new/{lower,upper} and mount the overlay at new path.
>
> Suppose that dcache is fully populated under /new and fully
> evicted under /old.
>
> When trying to decode the file handle for z, exportfs_decode_fh()
> will call the file system to actually read all directories a..z from disk
> in order to reconnect the dentry of old z all the way up to /old
> and it will do that *before* calling the acceptable() callback.
>
> Alternatively, if we first try to decode the file handle for /old/lower,
> decoding will be very fast (most likely already in cache) and we will
> not have to continue to decoding z and reading all directories a..z
> from disk.

To answer my own question in the prev mail: we need to decode the fh
and not just blindly use the inum to prevent issues with
copied/mutilited/etc lower layers.

And yes, in the copied case decoding origin.root first would be a good
optimization that couldn't be done without it.

> Wait, what do you mean by "we don't special case the layer root?"
> Do you mean that we could mount an overlay at a subdir path?
> i.e. in the example below, we could mount an overlay with
> upperdir=/new/upper/a/b/c,lowerdir=/new/lower/a/b/c?
>
> If this is what you mean then it is not true that we don't special case
> layer root. We do it with path redirect relative to layer root.
> If anything, we should be storing origin.root along with overlay.redirect
> in order to verify that we are not redirecting into the wrong relative
> path.

Yeah, you're right, we are special casing layer root.

Amir Goldstein April 27, 2017, 9:22 a.m. UTC | #14

On Thu, Apr 27, 2017 at 10:48 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Apr 27, 2017 at 8:27 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> Let me try to explain the problem with a worse case, but not
>> improbable example:
>>
>> Suppose I have an overlay with deep file at /a/b/c/.../z
>> Suppose the layers are at /old/{lower,upper} I copy them
>> over to /new/{lower,upper} and mount the overlay at new path.
>>
>> Suppose that dcache is fully populated under /new and fully
>> evicted under /old.
>>
>> When trying to decode the file handle for z, exportfs_decode_fh()
>> will call the file system to actually read all directories a..z from disk
>> in order to reconnect the dentry of old z all the way up to /old
>> and it will do that *before* calling the acceptable() callback.
>>
>> Alternatively, if we first try to decode the file handle for /old/lower,
>> decoding will be very fast (most likely already in cache) and we will
>> not have to continue to decoding z and reading all directories a..z
>> from disk.
>
> To answer my own question in the prev mail: we need to decode the fh
> and not just blindly use the inum to prevent issues with
> copied/mutilited/etc lower layers.
>

I was going to refer you to this example when reading you question
in prev email. That's what we get for no read/write barriers in emails ;-)

> And yes, in the copied case decoding origin.root first would be a good
> optimization that couldn't be done without it.
>

Good, so we seem to have an agreement w.r.t. the lookup fh patch.

I've already applied a change to disable redirect_fh if lower s_uuid is
zeros and I verified that it works as expected by running the hard-link
constant inode test that relies on redirect_fh over xfs mounted with
-o nouuid.

I will be posting the enhanced xfstest for constant inodes later today.

Let me know when are are done reviewing the series, so I can rework it
with the binary blob change you requested.

Thanks,
Amir.

Miklos Szeredi April 27, 2017, 9:26 a.m. UTC | #15

On Thu, Apr 27, 2017 at 9:48 AM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Apr 27, 2017 at 8:27 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>
>> Let me try to explain the problem with a worse case, but not
>> improbable example:
>>
>> Suppose I have an overlay with deep file at /a/b/c/.../z
>> Suppose the layers are at /old/{lower,upper} I copy them
>> over to /new/{lower,upper} and mount the overlay at new path.
>>
>> Suppose that dcache is fully populated under /new and fully
>> evicted under /old.
>>
>> When trying to decode the file handle for z, exportfs_decode_fh()
>> will call the file system to actually read all directories a..z from disk
>> in order to reconnect the dentry of old z all the way up to /old
>> and it will do that *before* calling the acceptable() callback.
>>
>> Alternatively, if we first try to decode the file handle for /old/lower,
>> decoding will be very fast (most likely already in cache) and we will
>> not have to continue to decoding z and reading all directories a..z
>> from disk.
>
> To answer my own question in the prev mail: we need to decode the fh
> and not just blindly use the inum to prevent issues with
> copied/mutilited/etc lower layers.

Hmm, this is absurd.  Why are we going to all this trouble to find the
origin inode though decoding the file handle when this thing was meant
to be an *optimization*?  Without redirect, we can look up origin just
like we do for merge dirs.  Way faster than decoding a connected
dentry, which is going to result in a readdir of the parent directory
and whatnot.  The only thing we need is a bool "was this copied" flag.

For moved files, decoding the fh might be an optimization over walking
the redirect, but that depends on a various factors, and it might also
be a lot slower...  But it's needed for the snapshot case, right?

Am I missing something?

Thanks,
Miklos

Amir Goldstein April 27, 2017, 1:53 p.m. UTC | #16

[Adding back CC list after I unintentionally dropped it]

On Thu, Apr 27, 2017 at 4:11 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Apr 27, 2017 at 11:53 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Apr 27, 2017 at 12:26 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>>> Hmm, this is absurd.  Why are we going to all this trouble to find the
>>> origin inode though decoding the file handle when this thing was meant
>>> to be an *optimization*?  Without redirect, we can look up origin just
>>> like we do for merge dirs.  Way faster than decoding a connected
>>> dentry, which is going to result in a readdir of the parent directory
>>> and whatnot.  The only thing we need is a bool "was this copied" flag.
>>>
>>
>> Yes, that is what happens in non-same-lower-fs case.
>> (overlay.fh is a fancy 4 bytes binary boolean blob)
>>
>>> For moved files, decoding the fh might be an optimization over walking
>>> the redirect, but that depends on a various factors, and it might also
>>> be a lot slower...
>>
>> I think we can safely disable by_fh for non-dirs if redirect is not set.
>> redirect_fh will be a bit more efficient with large numlower, but maybe
>> that is less interesting.
>>
>>> But it's needed for the snapshot case, right?
>>
>> It sure is! but the redirect_fh infrastructure is also needed for the next
>> part of the work of NFS export and preserved hardlinks, although in
>> this case it may be used to follow forward and not backwards.
>>
>> It also makes overlayfs more robust to lower layer changes, so less
>> chance for bugs related to mangled lower fs.
>>
>>>
>>> Am I missing something?
>>>
>>
>> Not really. Without the context of this being the first part of a 2 parts
>> work, the amount of hustle that redirect_fh brings to this series should
>> have raised your eyebrows as it did now.
>>
>> I started with the 'stable inode' series for same-fs only case which
>> provides a lot more than just constant inodes.
>>
>> You commented that we should try harder to do constant inodes
>> for non-samefs case, so I yanked out the parts relevant only to
>> constant inodes and relaxed the samefs constrains.
>>
>> Redirect_fh was left in, because:
>> 1. It was always the basis the large work, so was easier to leave it
>>     like this, then to have patches that add it later.
>> 2. It has some merit for optimization and for some constant inode
>>     cases with hardlinks
>> 3. It serves the full work for this code to be reviewed and tested
>>     soon, which is what is happening now.
>>
>> That said, if you feel strongly about it, I could either:
>> - Yank it out and re-introduce in the next part
>> - Leave it in with redirect_fh disabled by default and enabled
>>   by mount option, so at least people could test it and see how
>>   it performs compared to redirect by path in different workloads.
>>
>> Thoughts?
>
> Maybe you misunderstood.  I wasn't saying we don't need
> overlay.origin. I'm saying that it doesn't make sense to use
> overlay.origin for lookup.
>
> Summing up what we know:
>
>  - for constant inode we need a bool flag indicating that this is a
> copied up file or directory,  with that flag, together with redirect
> for regular files, we can do constant inode for samefs and non-samefs
> case.  That bool flag can be the existence of the overlay.origin xattr
>

Agree.

>  - hardlinks: need to set overlay.redirect  when hardlink is created
> from a copied up file; similar to what we do on rename
>

Partly agree.
1. This is not atomic, because hardlink is not created in workdir.
2. Reverse mapping will take care of this anyway. Remember?
    there is an extra hardlink in workdir/inodes with has overlay.redirect
    set on first alias copy up

>  - hardlink un-breaking: need reverse mapping (from lower to overlay);
> not in the scope of this patchset
>

Agree.

>  - NFS export: need reverse mapping; not in the scope of this patchset
>

Agree.

>  - for the snapshotfs case we need a way to keep the overlay in sync
> with a changing lower layer.  It's impossible to atomically update
> overlay.redirect together with the location of the lower file;
> overlay.origin can fix that
>

Correct.

>  - for non-redirect case looking up by overlay.origin is almost surely
> a pessimization
>

Not sure about that.
For directories by fh and by name are probably on par -
At most one lookup of ".." compared to one lookup of d.name.

For non-dir there is a better way that is better than both (see below).

>  - lookup by overlay.origin can work if overlay.redirect would be too
> long to fit in the max xattr len
>

Sure.

>  - for the redirect case looking up by overlay.origin may be faster,
> but may be much slower; hard to determine which to use
>

Here I disagree.
I claim that is always faster to find the lower (disconneccted) dentry
by fh, much faster in fact. See explanation at the bottom.

>  - overlay.origin can be used to verify if the lower path looked up
> with overlay.redirect is indeed the same file/directory that was
> originally copied up.  Not sure if this is useful for anything else
> than the snapshotfs case.
>

Its true. Comparing file handles would be quite easy.

> So my conclusion is that unless we must (snapshotfs, overflow in
> overlay.redirect) we should not be using overlay.origin for looking up
> the lower file.
>

I can live with that. Comparing the lookup result to origin.fh can be
optional based on some 'strict' mount option and resort to lookup by
fh when compare fails.
But we need to reconsider in light of my new suggestion, because
I do believe that finding the lower inode of non-dir is always a win
by fh.

> Even for snapshotfs it might make sense to start with plain lookup
> (with out without overlay.redirect) using overlay.origin to verify and
> falling back to lookup by overlay.origin only on mismatch (and
> updating overlay.redirect).
>
> Do you disagree?
>

So I managed to confuse myself about the technical facts of decoding,
because I was used to dealing only with decoding of dir handles
(for snapshots) and just now added decoding of non-dir.

When decoding a directory handle, it is always being connected up to
root. It sounds harsh, but in fact I think it will always be faster or on par
with regular lookup by path, because:
When looking up backwards until the first connected ancestor, filesystem
always has to read the ".." entry.
When looking up forward by path, then filesystem need to read entries
from connected ancestor by name and that is most likely indexed only
worse then the ".." entry of the backwards lookup.

When decoding directories you also want to get a connected dentry and
verifying is_subdir() makes sense.

HOWEVER, and this is big thing that I missed, when decoding a non-dir
we DON'T need to get a connected dentry.
It's perfectly fine to get a disconnected alias and getting a disconnected
alias is always O(1) for the filesystem.
The only thing we really need from this alias is to know its inode number
(and to know that it is still valid).

So if we encode non-connectable fh for non-dir (like knfsd does by default)
then:
1. decoding them will always be faster then any other lookup method
2. we cannot verify is_subdir() - so what?

What's the worse thing that can happen if the decoded entry is not under
the layer anymore? We only use its inode number, and the only thing we
need to know is that it is unique within the lower layers inode namespace
and we don't need is_subdir() for that.

But I just realized something very very bad about non-samefs case.
We must use made up st_dev for lower layers, we can certainly no
longer use the real lower st_dev.
If we do, then we will have 2 files in 2 different overlay mounts,
who have the same lower inode but 2 different upper inodes with
different content and those 2 overlay files will have the same
st_dev/st_ino.

I just found that in my debian based  kvm-xfstests machine, diff
reports 2 broken hardlinks with different content as equal, because
they have the same st_dev/st_ino.

So in conclusion:
1. Encode non-connectable file handles to non-dir
2. Always try to lookup non-dir by fh first - it's O(1)
3. Non-samefs needs fake st_dev before reporting constant inodes
4. Broken hardlinks should NOT report same inode

Urgh this was long!

Do you agree with my analysis of decoding complexity and the conclusions?

Amir.

Miklos Szeredi April 27, 2017, 2:46 p.m. UTC | #17

On Thu, Apr 27, 2017 at 3:53 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Apr 27, 2017 at 4:11 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:

>>  - hardlinks: need to set overlay.redirect  when hardlink is created
>> from a copied up file; similar to what we do on rename
>>
>
> Partly agree.
> 1. This is not atomic, because hardlink is not created in workdir.

Doesn't matter; we can set overlay.redirect before we do the hardlink.
If the hardlink fails, we are left with the redirect, but that's not a
problem.

> 2. Reverse mapping will take care of this anyway. Remember?
>     there is an extra hardlink in workdir/inodes with has overlay.redirect
>     set on first alias copy up

Yeah, but I don't really see why we'd need to set overlay.redirect on
copy-up.  When reverse mapping (i.e. trying to reconstruct the overlay
dentry from the lower fh)  we'll just do the same thing as for normal
lookup: use the path from the upper root to the dentry and look up
each component in the lower layers (taking into account any
overlay.redirect encountered).

>>  - for non-redirect case looking up by overlay.origin is almost surely
>> a pessimization
>>
>
> Not sure about that.
> For directories by fh and by name are probably on par -
> At most one lookup of ".." compared to one lookup of d.name.
>
> For non-dir there is a better way that is better than both (see below).

Not that simple (see below).

> So I managed to confuse myself about the technical facts of decoding,
> because I was used to dealing only with decoding of dir handles
> (for snapshots) and just now added decoding of non-dir.
>
> When decoding a directory handle, it is always being connected up to
> root. It sounds harsh, but in fact I think it will always be faster or on par
> with regular lookup by path, because:
> When looking up backwards until the first connected ancestor, filesystem
> always has to read the ".." entry.
> When looking up forward by path, then filesystem need to read entries
> from connected ancestor by name and that is most likely indexed only
> worse then the ".." entry of the backwards lookup.

Problem is there's more going on than just lookup of "..".  In fact it
*must* entail the lookup of "name" as well, because that's the way the
dentry gets connected.  There's an even bigger snag: where do we get
the name?  There's a ->get_name() export op, but most fs don't define
it, and the default action is to iterate the parent dir and find the
one matching our inum.  There goes the performance...

That's why I'm saying it's almost certainly will be slower.  Exception
might be the cached case, but even there lookup by inum might be
slower than the super optimized cached path lookup of a few filenames.
Since we are looking up the overlay dentry, which isn't cached at this
point, so why would the lower ones be?

> When decoding directories you also want to get a connected dentry and
> verifying is_subdir() makes sense.
>
> HOWEVER, and this is big thing that I missed, when decoding a non-dir
> we DON'T need to get a connected dentry.
> It's perfectly fine to get a disconnected alias and getting a disconnected
> alias is always O(1) for the filesystem.

It's O(1) but so is a single component lookup (case without
overlay.redirect).  In the cold cache cache both will be slow, since
most of the time will be spent on getting the inode from disk.  In the
hot cache case, odds are the name lookup will win, since it's the more
optimized codepath...

> The only thing we really need from this alias is to know its inode number
> (and to know that it is still valid).
>
> So if we encode non-connectable fh for non-dir (like knfsd does by default)
> then:
> 1. decoding them will always be faster then any other lookup method
> 2. we cannot verify is_subdir() - so what?
>
> What's the worse thing that can happen if the decoded entry is not under
> the layer anymore? We only use its inode number, and the only thing we
> need to know is that it is unique within the lower layers inode namespace
> and we don't need is_subdir() for that.

Okay.

>
> But I just realized something very very bad about non-samefs case.
> We must use made up st_dev for lower layers, we can certainly no
> longer use the real lower st_dev.
> If we do, then we will have 2 files in 2 different overlay mounts,
> who have the same lower inode but 2 different upper inodes with
> different content and those 2 overlay files will have the same
> st_dev/st_ino.

Yeah, I told you about this issue a couple of mails back.

>
> I just found that in my debian based  kvm-xfstests machine, diff
> reports 2 broken hardlinks with different content as equal, because
> they have the same st_dev/st_ino.
>
> So in conclusion:
> 1. Encode non-connectable file handles to non-dir

Fine by me.

> 2. Always try to lookup non-dir by fh first - it's O(1)

Well... for simplicity sake, okay.  Probably not a big loss.

> 3. Non-samefs needs fake st_dev before reporting constant inodes

Yes.

> 4. Broken hardlinks should NOT report same inode

Yes.

Thanks,
Miklos

Amir Goldstein April 27, 2017, 4:08 p.m. UTC | #18

On Thu, Apr 27, 2017 at 5:46 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Thu, Apr 27, 2017 at 3:53 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Thu, Apr 27, 2017 at 4:11 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>>>  - hardlinks: need to set overlay.redirect  when hardlink is created
>>> from a copied up file; similar to what we do on rename
>>>
>>
>> Partly agree.
>> 1. This is not atomic, because hardlink is not created in workdir.
>
> Doesn't matter; we can set overlay.redirect before we do the hardlink.
> If the hardlink fails, we are left with the redirect, but that's not a
> problem.
>

Right...

>> 2. Reverse mapping will take care of this anyway. Remember?
>>     there is an extra hardlink in workdir/inodes with has overlay.redirect
>>     set on first alias copy up
>
> Yeah, but I don't really see why we'd need to set overlay.redirect on
> copy-up.  When reverse mapping (i.e. trying to reconstruct the overlay
> dentry from the lower fh)  we'll just do the same thing as for normal
> lookup: use the path from the upper root to the dentry and look up
> each component in the lower layers (taking into account any
> overlay.redirect encountered).

I was thinking we need overlay.redirect to find a lower alias by path
from any upper alias, so I knew we should have overlay.redirect in the upper
inode, but it's true that we can set it on the first ovl_link() and not at first
copy up time.


>
>>>  - for non-redirect case looking up by overlay.origin is almost surely
>>> a pessimization
>>>
>>
>> Not sure about that.
>> For directories by fh and by name are probably on par -
>> At most one lookup of ".." compared to one lookup of d.name.
>>
>> For non-dir there is a better way that is better than both (see below).
>
> Not that simple (see below).

Not that simple referring to directories (and I agree), but
not referring to non connected non-dir.


>
> Problem is there's more going on than just lookup of "..".  In fact it
> *must* entail the lookup of "name" as well, because that's the way the
> dentry gets connected.  There's an even bigger snag: where do we get
> the name?  There's a ->get_name() export op, but most fs don't define
> it, and the default action is to iterate the parent dir and find the
> one matching our inum.  There goes the performance...
>
> That's why I'm saying it's almost certainly will be slower.  Exception
> might be the cached case, but even there lookup by inum might be
> slower than the super optimized cached path lookup of a few filenames.
> Since we are looking up the overlay dentry, which isn't cached at this
> point, so why would the lower ones be?

Agree for directories, so we should not be looking directory by fh.
It's anyway hard to do for numlayers > 1.
I will see about comparing origin.fh with dir found by path - it may
be the way to go for snapshots.

>
>> When decoding directories you also want to get a connected dentry and
>> verifying is_subdir() makes sense.
>>
>> HOWEVER, and this is big thing that I missed, when decoding a non-dir
>> we DON'T need to get a connected dentry.
>> It's perfectly fine to get a disconnected alias and getting a disconnected
>> alias is always O(1) for the filesystem.
>
> It's O(1) but so is a single component lookup (case without
> overlay.redirect).  In the cold cache cache both will be slow, since
> most of the time will be spent on getting the inode from disk.  In the
> hot cache case, odds are the name lookup will win, since it's the more
> optimized codepath...
>

Correct.

To complete the picture, here is how better lookup by inode than
lookup by redirect path.

Lookup of inode should be always quite fast for filesystem, even with
cold indoe/dentry cache, inodes are easy to find by index, so worse case
is O(1 inode block read from disk at a a known location).

Compared to redirect by path of N elements this is much much better.
O(N synchronic reads of inode blocks and N directory blocks from disk)


>
>>
>> But I just realized something very very bad about non-samefs case.
>> We must use made up st_dev for lower layers, we can certainly no
>> longer use the real lower st_dev.
>> If we do, then we will have 2 files in 2 different overlay mounts,
>> who have the same lower inode but 2 different upper inodes with
>> different content and those 2 overlay files will have the same
>> st_dev/st_ino.
>
> Yeah, I told you about this issue a couple of mails back.
>

Yes you did. I guess I thought you were referring to not all lower
on same sb. Then need a unique st_dev per lower layer because
they don't share the same inode namespace.

I though that 'all lower on same fs' was ok to use the same_lower_sb
as st_dev, as is the case now for lower type entries, but it is not ok.


>>
>> I just found that in my debian based  kvm-xfstests machine, diff
>> reports 2 broken hardlinks with different content as equal, because
>> they have the same st_dev/st_ino.
>>
>> So in conclusion:
>> 1. Encode non-connectable file handles to non-dir
>
> Fine by me.
>
>> 2. Always try to lookup non-dir by fh first - it's O(1)
>
> Well... for simplicity sake, okay.  Probably not a big loss.
>
>> 3. Non-samefs needs fake st_dev before reporting constant inodes
>
> Yes.
>
>> 4. Broken hardlinks should NOT report same inode
>
> Yes.
>

So to list everything for v4 in one place:

5. store uuid together with lower fh inside struct ovl_fh (in overlay.origin)
There does not seem to be a reason to store root fh though for non-dir
and its not relevant for for lookup of dir for snapshot case either (single
lower layer case)

Thanks,
Amir.

[v2,05/11] ovl: lookup redirect by file handle

Commit Message

Comments

Patch