[v2] vfs: fix link vs. rename race

Message ID	20220221082002.508392-1-mszeredi@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Miklos Szeredi <mszeredi@redhat.com> To: Al Viro <viro@ZenIV.linux.org.uk> Cc: Xavier Roche <xavier.roche@algolia.com>, linux-fsdevel@vger.kernel.org Subject: [PATCH v2] vfs: fix link vs. rename race Date: Mon, 21 Feb 2022 09:20:02 +0100 Message-Id: <20220221082002.508392-1-mszeredi@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v2] vfs: fix link vs. rename race \| expand [v2] vfs: fix link vs. rename race

Miklos Szeredi Feb. 21, 2022, 8:20 a.m. UTC

There has been a longstanding race condition between rename(2) and link(2),
when those operations are done in parallel:

1. Moving a file to an existing target file (eg. mv file target)
2. Creating a link from the target file to a third file (eg. ln target
   link)

By the time vfs_link() locks the target inode, it might already be unlinked
by rename.  This results in vfs_link() returning -ENOENT in order to
prevent linking to already unlinked files.  This check was introduced in
v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for
deleted file").

This breaks apparent atomicity of rename(2), which is described in
standards and the man page:

    "If newpath already exists, it will be atomically replaced, so that
     there is no point at which another process attempting to access
     newpath will find it missing."

The simplest fix is to exclude renames for the complete link operation.

This patch introduces a global rw_semaphore that is locked for read in
rename and for write in link.  To prevent excessive contention, do not take
the lock in link on the first try.  If the source of the link was found to
be unlinked, then retry with the lock held.

Reuse the lock_rename()/unlock_rename() helpers for the rename part.  This
however needs special treatment for stacking fs (overlayfs, ecryptfs) to
prevent possible deadlocks.  Introduce [un]lock_rename_stacked() for this
purpose.

Reproducer can be found at:

  https://lore.kernel.org/all/20220216131814.GA2463301@xavier-xps/

Reported-by: Xavier Roche <xavier.roche@algolia.com>
Link: https://lore.kernel.org/all/20220214210708.GA2167841@xavier-xps/
Fixes: aae8a97d3ec3 ("fs: Don't allow to create hardlink for deleted file")
Tested-by: Xavier Roche <xavier.roche@algolia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
---
 fs/ecryptfs/inode.c    |  4 ++--
 fs/namei.c             | 35 ++++++++++++++++++++++++++++++++---
 fs/overlayfs/copy_up.c |  4 ++--
 fs/overlayfs/dir.c     | 12 ++++++------
 fs/overlayfs/util.c    |  4 ++--
 include/linux/namei.h  |  4 ++++
 6 files changed, 48 insertions(+), 15 deletions(-)

Xavier Roche Feb. 21, 2022, 8:56 a.m. UTC | #1

On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote:
> This patch introduces a global rw_semaphore that is locked for read in
> rename and for write in link.  To prevent excessive contention, do not take
> the lock in link on the first try.  If the source of the link was found to
> be unlinked, then retry with the lock held.

Tested-by: Xavier Roche <xavier.roche@algolia.com>

Revalidated patch (on a 5.16.5) using previous torture test;
- on ext4 filesystem
- on a cryptfs volume

Al Viro Sept. 13, 2022, 2:04 a.m. UTC | #2

On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote:

[digging through the old piles of mail]

Eyes-watering control flow in do_linkat() aside (it's bound to rot; too
much of it won't get any regression testing and it's convoluted enough
to break easily), the main problem I have with that is the DoS potential.

You have a system-wide lock, and if it's stuck you'll get every damn
rename(2) stuck as well.  Sure, having it taken only upon the race
with rename() (or unlink(), for that matter) make it harder to get
stuck with lock held, but that'll make the problem harder to reproduce
and debug...

Al Viro Sept. 13, 2022, 4:29 a.m. UTC | #3

On Tue, Sep 13, 2022 at 03:04:17AM +0100, Al Viro wrote:
> On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote:
> 
> [digging through the old piles of mail]
> 
> Eyes-watering control flow in do_linkat() aside (it's bound to rot; too
> much of it won't get any regression testing and it's convoluted enough
> to break easily), the main problem I have with that is the DoS potential.
> 
> You have a system-wide lock, and if it's stuck you'll get every damn
> rename(2) stuck as well.  Sure, having it taken only upon the race
> with rename() (or unlink(), for that matter) make it harder to get
> stuck with lock held, but that'll make the problem harder to reproduce
> and debug...

FWIW, how much trouble would we have if link(2) would do the following?

	find the parent of source
	lock it
	look the child up
	verify it's a non-directory
	bump child's i_nlink
		all failure exits past that point decrement child's i_nlink
	unlock the parent
	find the parent of destination
	lock it
	look the destination up
	call vfs_link
	decrement child's i_nlink - vfs_link has bumped it
	unlock the parent of destination

I do realize it can lead to leaked link count on a crash, obviously...

NeilBrown Sept. 13, 2022, 4:41 a.m. UTC | #4

On Mon, 21 Feb 2022, Miklos Szeredi wrote:
> There has been a longstanding race condition between rename(2) and link(2),
> when those operations are done in parallel:
> 
> 1. Moving a file to an existing target file (eg. mv file target)
> 2. Creating a link from the target file to a third file (eg. ln target
>    link)
> 
> By the time vfs_link() locks the target inode, it might already be unlinked
> by rename.  This results in vfs_link() returning -ENOENT in order to
> prevent linking to already unlinked files.  This check was introduced in
> v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for
> deleted file").
> 
> This breaks apparent atomicity of rename(2), which is described in
> standards and the man page:
> 
>     "If newpath already exists, it will be atomically replaced, so that
>      there is no point at which another process attempting to access
>      newpath will find it missing."
> 
> The simplest fix is to exclude renames for the complete link operation.

Alternately, lock the "from" directory as well as the "to" directory.
That would mean using lock_rename() and generally copying the structure
of do_renameat2() into do_linkat()

I wonder if you could get a similar race trying to create a file in
(empty directory) /tmp/foo while /tmp/bar was being renamed over it.

NeilBrown


> 
> This patch introduces a global rw_semaphore that is locked for read in
> rename and for write in link.  To prevent excessive contention, do not take
> the lock in link on the first try.  If the source of the link was found to
> be unlinked, then retry with the lock held.
> 
> Reuse the lock_rename()/unlock_rename() helpers for the rename part.  This
> however needs special treatment for stacking fs (overlayfs, ecryptfs) to
> prevent possible deadlocks.  Introduce [un]lock_rename_stacked() for this
> purpose.
> 
> Reproducer can be found at:
> 
>   https://lore.kernel.org/all/20220216131814.GA2463301@xavier-xps/
> 
> Reported-by: Xavier Roche <xavier.roche@algolia.com>
> Link: https://lore.kernel.org/all/20220214210708.GA2167841@xavier-xps/
> Fixes: aae8a97d3ec3 ("fs: Don't allow to create hardlink for deleted file")
> Tested-by: Xavier Roche <xavier.roche@algolia.com>
> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
> ---
>  fs/ecryptfs/inode.c    |  4 ++--
>  fs/namei.c             | 35 ++++++++++++++++++++++++++++++++---
>  fs/overlayfs/copy_up.c |  4 ++--
>  fs/overlayfs/dir.c     | 12 ++++++------
>  fs/overlayfs/util.c    |  4 ++--
>  include/linux/namei.h  |  4 ++++
>  6 files changed, 48 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
> index 16d50dface59..f5c37599bd40 100644
> --- a/fs/ecryptfs/inode.c
> +++ b/fs/ecryptfs/inode.c
> @@ -596,7 +596,7 @@ ecryptfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
>  
>  	target_inode = d_inode(new_dentry);
>  
> -	trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> +	trap = lock_rename_stacked(lower_old_dir_dentry, lower_new_dir_dentry);
>  	dget(lower_new_dentry);
>  	rc = -EINVAL;
>  	if (lower_old_dentry->d_parent != lower_old_dir_dentry)
> @@ -631,7 +631,7 @@ ecryptfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
>  		fsstack_copy_attr_all(old_dir, d_inode(lower_old_dir_dentry));
>  out_lock:
>  	dput(lower_new_dentry);
> -	unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry);
> +	unlock_rename_stacked(lower_old_dir_dentry, lower_new_dir_dentry);
>  	return rc;
>  }
>  
> diff --git a/fs/namei.c b/fs/namei.c
> index 3f1829b3ab5b..27e671c354a6 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -122,6 +122,8 @@
>   * PATH_MAX includes the nul terminator --RR.
>   */
>  
> +static DECLARE_RWSEM(link_rwsem);
> +
>  #define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))
>  
>  struct filename *
> @@ -2954,10 +2956,11 @@ static inline int may_create(struct user_namespace *mnt_userns,
>  	return inode_permission(mnt_userns, dir, MAY_WRITE | MAY_EXEC);
>  }
>  
> +
>  /*
>   * p1 and p2 should be directories on the same fs.
>   */
> -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
> +struct dentry *lock_rename_stacked(struct dentry *p1, struct dentry *p2)
>  {
>  	struct dentry *p;
>  
> @@ -2986,9 +2989,16 @@ struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
>  	inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2);
>  	return NULL;
>  }
> +EXPORT_SYMBOL(lock_rename_stacked);
> +
> +struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
> +{
> +	down_read(&link_rwsem);
> +	return lock_rename_stacked(p1, p2);
> +}
>  EXPORT_SYMBOL(lock_rename);
>  
> -void unlock_rename(struct dentry *p1, struct dentry *p2)
> +void unlock_rename_stacked(struct dentry *p1, struct dentry *p2)
>  {
>  	inode_unlock(p1->d_inode);
>  	if (p1 != p2) {
> @@ -2996,6 +3006,13 @@ void unlock_rename(struct dentry *p1, struct dentry *p2)
>  		mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
>  	}
>  }
> +EXPORT_SYMBOL(unlock_rename_stacked);
> +
> +void unlock_rename(struct dentry *p1, struct dentry *p2)
> +{
> +	unlock_rename_stacked(p1, p2);
> +	up_read(&link_rwsem);
> +}
>  EXPORT_SYMBOL(unlock_rename);
>  
>  /**
> @@ -4456,6 +4473,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
>  	struct path old_path, new_path;
>  	struct inode *delegated_inode = NULL;
>  	int how = 0;
> +	bool lock = false;
>  	int error;
>  
>  	if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) {
> @@ -4474,10 +4492,13 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
>  
>  	if (flags & AT_SYMLINK_FOLLOW)
>  		how |= LOOKUP_FOLLOW;
> +retry_lock:
> +	if (lock)
> +		down_write(&link_rwsem);
>  retry:
>  	error = filename_lookup(olddfd, old, how, &old_path, NULL);
>  	if (error)
> -		goto out_putnames;
> +		goto out_unlock_link;
>  
>  	new_dentry = filename_create(newdfd, new, &new_path,
>  					(how & LOOKUP_REVAL));
> @@ -4511,8 +4532,16 @@ int do_linkat(int olddfd, struct filename *old, int newdfd,
>  		how |= LOOKUP_REVAL;
>  		goto retry;
>  	}
> +	if (!lock && error == -ENOENT) {
> +		path_put(&old_path);
> +		lock = true;
> +		goto retry_lock;
> +	}
>  out_putpath:
>  	path_put(&old_path);
> +out_unlock_link:
> +	if (lock)
> +		up_write(&link_rwsem);
>  out_putnames:
>  	putname(old);
>  	putname(new);
> diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
> index e040970408d4..911c3cec43c2 100644
> --- a/fs/overlayfs/copy_up.c
> +++ b/fs/overlayfs/copy_up.c
> @@ -670,7 +670,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
>  
>  	/* workdir and destdir could be the same when copying up to indexdir */
>  	err = -EIO;
> -	if (lock_rename(c->workdir, c->destdir) != NULL)
> +	if (lock_rename_stacked(c->workdir, c->destdir) != NULL)
>  		goto unlock;
>  
>  	err = ovl_prep_cu_creds(c->dentry, &cc);
> @@ -711,7 +711,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c)
>  	if (S_ISDIR(inode->i_mode))
>  		ovl_set_flag(OVL_WHITEOUTS, inode);
>  unlock:
> -	unlock_rename(c->workdir, c->destdir);
> +	unlock_rename_stacked(c->workdir, c->destdir);
>  
>  	return err;
>  
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index f18490813170..fea397666174 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -416,7 +416,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
>  
>  	ovl_cleanup_whiteouts(upper, list);
>  	ovl_cleanup(wdir, upper);
> -	unlock_rename(workdir, upperdir);
> +	unlock_rename_stacked(workdir, upperdir);
>  
>  	/* dentry's upper doesn't match now, get rid of it */
>  	d_drop(dentry);
> @@ -427,7 +427,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry,
>  	ovl_cleanup(wdir, opaquedir);
>  	dput(opaquedir);
>  out_unlock:
> -	unlock_rename(workdir, upperdir);
> +	unlock_rename_stacked(workdir, upperdir);
>  out:
>  	return ERR_PTR(err);
>  }
> @@ -551,7 +551,7 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode,
>  out_dput:
>  	dput(upper);
>  out_unlock:
> -	unlock_rename(workdir, upperdir);
> +	unlock_rename_stacked(workdir, upperdir);
>  out:
>  	if (!hardlink) {
>  		posix_acl_release(acl);
> @@ -790,7 +790,7 @@ static int ovl_remove_and_whiteout(struct dentry *dentry,
>  out_dput_upper:
>  	dput(upper);
>  out_unlock:
> -	unlock_rename(workdir, upperdir);
> +	unlock_rename_stacked(workdir, upperdir);
>  out_dput:
>  	dput(opaquedir);
>  out:
> @@ -1187,7 +1187,7 @@ static int ovl_rename(struct user_namespace *mnt_userns, struct inode *olddir,
>  		}
>  	}
>  
> -	trap = lock_rename(new_upperdir, old_upperdir);
> +	trap = lock_rename_stacked(new_upperdir, old_upperdir);
>  
>  	olddentry = lookup_one_len(old->d_name.name, old_upperdir,
>  				   old->d_name.len);
> @@ -1281,7 +1281,7 @@ static int ovl_rename(struct user_namespace *mnt_userns, struct inode *olddir,
>  out_dput_old:
>  	dput(olddentry);
>  out_unlock:
> -	unlock_rename(new_upperdir, old_upperdir);
> +	unlock_rename_stacked(new_upperdir, old_upperdir);
>  out_revert_creds:
>  	revert_creds(old_cred);
>  	if (update_nlink)
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index f48284a2a896..9358282278b1 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -930,13 +930,13 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
>  		goto err;
>  
>  	/* Workdir should not be subdir of upperdir and vice versa */
> -	if (lock_rename(workdir, upperdir) != NULL)
> +	if (lock_rename_stacked(workdir, upperdir) != NULL)
>  		goto err_unlock;
>  
>  	return 0;
>  
>  err_unlock:
> -	unlock_rename(workdir, upperdir);
> +	unlock_rename_stacked(workdir, upperdir);
>  err:
>  	pr_err("failed to lock workdir+upperdir\n");
>  	return -EIO;
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index e89329bb3134..0a87bb0d56ce 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -77,6 +77,10 @@ extern int follow_up(struct path *);
>  extern struct dentry *lock_rename(struct dentry *, struct dentry *);
>  extern void unlock_rename(struct dentry *, struct dentry *);
>  
> +/* Special version of the above for stacking filesystems */
> +extern struct dentry *lock_rename_stacked(struct dentry *, struct dentry *);
> +extern void unlock_rename_stacked(struct dentry *, struct dentry *);
> +
>  extern int __must_check nd_jump_link(struct path *path);
>  
>  static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)
> -- 
> 2.34.1
> 
>

Al Viro Sept. 13, 2022, 5:20 a.m. UTC | #5

On Tue, Sep 13, 2022 at 02:41:51PM +1000, NeilBrown wrote:
> On Mon, 21 Feb 2022, Miklos Szeredi wrote:
> > There has been a longstanding race condition between rename(2) and link(2),
> > when those operations are done in parallel:
> > 
> > 1. Moving a file to an existing target file (eg. mv file target)
> > 2. Creating a link from the target file to a third file (eg. ln target
> >    link)
> > 
> > By the time vfs_link() locks the target inode, it might already be unlinked
> > by rename.  This results in vfs_link() returning -ENOENT in order to
> > prevent linking to already unlinked files.  This check was introduced in
> > v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for
> > deleted file").
> > 
> > This breaks apparent atomicity of rename(2), which is described in
> > standards and the man page:
> > 
> >     "If newpath already exists, it will be atomically replaced, so that
> >      there is no point at which another process attempting to access
> >      newpath will find it missing."
> > 
> > The simplest fix is to exclude renames for the complete link operation.
> 
> Alternately, lock the "from" directory as well as the "to" directory.
> That would mean using lock_rename() and generally copying the structure
> of do_renameat2() into do_linkat()

Ever done cp -al?  Cross-directory renames are relatively rare; cross-directory
links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex
held a _lot_.

> I wonder if you could get a similar race trying to create a file in
> (empty directory) /tmp/foo while /tmp/bar was being renamed over it.

	Neil, no offense, but... if you have plans regarding changes in
directory locking, you might consider reading through the file called
Documentation/filesystems/directory-locking.rst

	Occasionally documentation is where one could expect to find it...

Al Viro Sept. 13, 2022, 5:40 a.m. UTC | #6

On Tue, Sep 13, 2022 at 06:20:10AM +0100, Al Viro wrote:

> > Alternately, lock the "from" directory as well as the "to" directory.
> > That would mean using lock_rename() and generally copying the structure
> > of do_renameat2() into do_linkat()
> 
> Ever done cp -al?  Cross-directory renames are relatively rare; cross-directory
> links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex
> held a _lot_.
> 
> > I wonder if you could get a similar race trying to create a file in
> > (empty directory) /tmp/foo while /tmp/bar was being renamed over it.
> 
> 	Neil, no offense, but... if you have plans regarding changes in
> directory locking, you might consider reading through the file called
> Documentation/filesystems/directory-locking.rst
> 
> 	Occasionally documentation is where one could expect to find it...

... and that "..." above should've been ";-)" - it was not intended as
a dig, especially since locking in that area has subtle and badly
underdocumented parts (in particular, anything related to fh_to_dentry(),
rules regarding the stability of ->d_name and ->d_parent, mount vs. dentry
invalidation and too many other things), but the basic stuff like that
is actually covered.

FWIW, the answer to your question is that the victim of overwriting
rename is held locked by caller of ->rename(); combined with the lock
held on directory by anyone who modifies it that prevents the race
you are asking about.

See
        if (!is_dir || (flags & RENAME_EXCHANGE))
                lock_two_nondirectories(source, target);
        else if (target)
                inode_lock(target);
in vfs_rename().

Amir Goldstein Sept. 13, 2022, 8:02 a.m. UTC | #7

On Tue, Sep 13, 2022 at 7:38 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Sep 13, 2022 at 03:04:17AM +0100, Al Viro wrote:
> > On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote:
> >
> > [digging through the old piles of mail]
> >
> > Eyes-watering control flow in do_linkat() aside (it's bound to rot; too
> > much of it won't get any regression testing and it's convoluted enough
> > to break easily), the main problem I have with that is the DoS potential.
> >
> > You have a system-wide lock, and if it's stuck you'll get every damn
> > rename(2) stuck as well.  Sure, having it taken only upon the race
> > with rename() (or unlink(), for that matter) make it harder to get
> > stuck with lock held, but that'll make the problem harder to reproduce
> > and debug...
>
> FWIW, how much trouble would we have if link(2) would do the following?
>
>         find the parent of source

Well, only if source is not AT_EMPTY_PATH

>         lock it
>         look the child up
>         verify it's a non-directory
>         bump child's i_nlink
>                 all failure exits past that point decrement child's i_nlink

No need to bump i_nlink.
Sufficient to set I_LINKABLE.
and clean it up on failure if i_nlink > 0.

Or we can move cleanup of I_LINKABLE to drop_nlink()/clean_nlink()

>         unlock the parent
>         find the parent of destination
>         lock it
>         look the destination up
>         call vfs_link
>         decrement child's i_nlink - vfs_link has bumped it
>         unlock the parent of destination
>
> I do realize it can lead to leaked link count on a crash, obviously...

No such problem with I_LINKABLE.

Thanks,
Amir.

Miklos Szeredi Sept. 13, 2022, 10:03 a.m. UTC | #8

On Tue, 13 Sept 2022 at 10:02, Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Sep 13, 2022 at 7:38 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Tue, Sep 13, 2022 at 03:04:17AM +0100, Al Viro wrote:
> > > On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote:
> > >
> > > [digging through the old piles of mail]
> > >
> > > Eyes-watering control flow in do_linkat() aside (it's bound to rot; too
> > > much of it won't get any regression testing and it's convoluted enough
> > > to break easily), the main problem I have with that is the DoS potential.
> > >
> > > You have a system-wide lock, and if it's stuck you'll get every damn
> > > rename(2) stuck as well.  Sure, having it taken only upon the race
> > > with rename() (or unlink(), for that matter) make it harder to get
> > > stuck with lock held, but that'll make the problem harder to reproduce
> > > and debug...
> >
> > FWIW, how much trouble would we have if link(2) would do the following?
> >
> >         find the parent of source
>
> Well, only if source is not AT_EMPTY_PATH

The problem is not AT_EMPTY_PATH, but a disconnected dentry gotten
through magic symlink.    Theoretically link(2) should work for those,
but I didn't try.

>
> >         lock it
> >         look the child up
> >         verify it's a non-directory
> >         bump child's i_nlink
> >                 all failure exits past that point decrement child's i_nlink
>
> No need to bump i_nlink.
> Sufficient to set I_LINKABLE.
> and clean it up on failure if i_nlink > 0.

Bumping i_nlink seems hackish enough, currently all i_nlink
modification is done by filesystem code.

Setting I_LINKABLE seems safe for filesystems that have tmpfile, not otherwise.

Thanks,
Miklos

NeilBrown Sept. 13, 2022, 11:52 p.m. UTC | #9

On Tue, 13 Sep 2022, Al Viro wrote:
> On Tue, Sep 13, 2022 at 02:41:51PM +1000, NeilBrown wrote:
> > On Mon, 21 Feb 2022, Miklos Szeredi wrote:
> > > There has been a longstanding race condition between rename(2) and link(2),
> > > when those operations are done in parallel:
> > > 
> > > 1. Moving a file to an existing target file (eg. mv file target)
> > > 2. Creating a link from the target file to a third file (eg. ln target
> > >    link)
> > > 
> > > By the time vfs_link() locks the target inode, it might already be unlinked
> > > by rename.  This results in vfs_link() returning -ENOENT in order to
> > > prevent linking to already unlinked files.  This check was introduced in
> > > v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for
> > > deleted file").
> > > 
> > > This breaks apparent atomicity of rename(2), which is described in
> > > standards and the man page:
> > > 
> > >     "If newpath already exists, it will be atomically replaced, so that
> > >      there is no point at which another process attempting to access
> > >      newpath will find it missing."
> > > 
> > > The simplest fix is to exclude renames for the complete link operation.
> > 
> > Alternately, lock the "from" directory as well as the "to" directory.
> > That would mean using lock_rename() and generally copying the structure
> > of do_renameat2() into do_linkat()
> 
> Ever done cp -al?  Cross-directory renames are relatively rare; cross-directory
> links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex
> held a _lot_.

As long as the approach is correct, it might be a good starting point
for optimisation.

We only need s_vfs_rename_mutex for link() so that we can reliably
determine any parent/child relationship to get correct lock ordering to
avoid deadlocks.  So if we use trylock on the second lock and succeed
then we don't need the mutex at all.
e.g.
 - lookup the parents of both paths.
 - lock the "to" directory.
 - if the "from" directory is the same, or if a trylock of the from
   directory succeeds, then we have the locks and can perform the
   last component lookup and perform the link without racing with
   rename. 
 - if the trylock fails, we drop the lock on "to" and use lock_rename().
   We drop the s_vfs_rename_mutex immediately after lock_rename()
   so after the vfs_link() we just unlock both parent directories.

That should avoid the mutex is most cases including "cp -al"

Holding the "from" parent locked means that NFS could safely access the
parent and basename, and send the lookup to the server to reduce the
risk of a rename on one client racing with a link on another client.
NFS doesn't guarantee that would still work (ops in a compound are not
atomic) but it could still help.  And the NFS server is *prevented* from
making the lookup and link atomic.

NeilBrown

Al Viro Sept. 14, 2022, 12:13 a.m. UTC | #10

On Wed, Sep 14, 2022 at 09:52:37AM +1000, NeilBrown wrote:
>  - lookup the parents of both paths.
>  - lock the "to" directory.
>  - if the "from" directory is the same, or if a trylock of the from
>    directory succeeds, then we have the locks and can perform the
>    last component lookup and perform the link without racing with
>    rename. 
>  - if the trylock fails, we drop the lock on "to" and use lock_rename().
>    We drop the s_vfs_rename_mutex immediately after lock_rename()
>    so after the vfs_link() we just unlock both parent directories.

Umm...  Care to put together an update of deadlock avoidance proof?
The one in D/f/directory-locking.rst, that is.  I'm not saying it's
not a usable approach - it might very well work...

BTW, one testcase worth profiling would be something like
for i in `seq 100`; do
	cp -al linux.git linux.git-$i &
done

NeilBrown Sept. 14, 2022, 12:14 a.m. UTC | #11

On Tue, 13 Sep 2022, Al Viro wrote:
> On Tue, Sep 13, 2022 at 06:20:10AM +0100, Al Viro wrote:
> 
> > > Alternately, lock the "from" directory as well as the "to" directory.
> > > That would mean using lock_rename() and generally copying the structure
> > > of do_renameat2() into do_linkat()
> > 
> > Ever done cp -al?  Cross-directory renames are relatively rare; cross-directory
> > links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex
> > held a _lot_.
> > 
> > > I wonder if you could get a similar race trying to create a file in
> > > (empty directory) /tmp/foo while /tmp/bar was being renamed over it.
> > 
> > 	Neil, no offense, but... if you have plans regarding changes in
> > directory locking, you might consider reading through the file called
> > Documentation/filesystems/directory-locking.rst
> > 
> > 	Occasionally documentation is where one could expect to find it...
> 
> ... and that "..." above should've been ";-)" - it was not intended as
> a dig, especially since locking in that area has subtle and badly
> underdocumented parts (in particular, anything related to fh_to_dentry(),
> rules regarding the stability of ->d_name and ->d_parent, mount vs. dentry
> invalidation and too many other things), but the basic stuff like that
> is actually covered.
> 
> FWIW, the answer to your question is that the victim of overwriting
> rename is held locked by caller of ->rename(); combined with the lock
> held on directory by anyone who modifies it that prevents the race
> you are asking about.
> 
> See
>         if (!is_dir || (flags & RENAME_EXCHANGE))
>                 lock_two_nondirectories(source, target);
>         else if (target)
>                 inode_lock(target);
> in vfs_rename().
> 

I don't think those locks address the race I was thinking of.

Suppose a /tmp/shared-dir is an empty directory and one thread runs
  rename("/tmp/tmp-dir", "/tmp/shared-dir")
while another thread runs
  open("/tmp/shared-dir/Afile", O_CREAT | O_WRONLY)

If the first wins the file is created in what was tmp-dir.
If the second wins, the rename fails because shared-dir isn't empty.

But they can race.  The open completes the lookup of /tmp/share-dir and
holds the dentry, but the rename succeeds with inode_lock(target) in the
code fragment you provided above before the open() can get the lock.
By the time open() does get the lock, the dentry it holds will be marked
S_DEAD and the __lookup_hash() will fail.
So the open() returns -ENOENT - unexpected.

Test code below, based on the test code for the link race.

I wonder if S_DEAD should result in -ESTALE rather than -ENOENT.
That would cause the various retry_estale() loops to retry the whole
operation.  I suspect we'd actually need more subtlety than just that
simple change, but it might be worth exploring.

NeilBrown


/* Linux rename vs. create race condition.
 * Rationale: both (1) moving a directory to an empty target and
 *   (2) creating a file in the target can race.
 * The create should always succeed, but there should always be
 * directory there, either old or new.
 * The rename might fail if the target isn't empty.
 * Sample file courtesy of Xavier Grand at Algolia
 * g++ -pthread createbug.c -o createbug
 */

#include <thread>
#include <unistd.h>
#include <assert.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <iostream>
#include <string.h>

static const char* producedDir = "/tmp/shared-dir";
static const char* producedTmpDir = "/tmp/new-dir";
static const char* producedThreadFile = "/tmp/shared-dir/Afile";

bool createFile(const char* path)
{
	const int fdOut = open(path,
			       O_WRONLY | O_CREAT | O_TRUNC | O_EXCL | O_CLOEXEC,
			       S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH);
	if (fdOut < 0)
		return false;
	assert(write(fdOut, "Foo", 4) == 4);
	assert(close(fdOut) == 0);
	return true;
}

void func()
{
	int nbSuccess = 0;
	// Loop producedThread create file in dir
	while (true) {
		if (createFile(producedThreadFile) != true) {
		std::cout << "Failed after " << nbSuccess << " with " << strerror(errno) << std::endl;
			exit(EXIT_FAILURE);
		} else {
			nbSuccess++;
		}
		unlink(producedThreadFile);
	}
}

int main()
{
	// Setup env
	rmdir(producedTmpDir);
	unlink(producedThreadFile);
	rmdir(producedDir);
	mkdir(producedDir, 0777);

	// Async thread doing a hardlink and moving it
	std::thread t(func);
	// Loop creating a .tmp and moving it
	while (true) {
		assert(mkdir(producedTmpDir, 0777) == 0);
		while (rename(producedTmpDir, producedDir) != 0)
			unlink(producedThreadFile);
	}
	return 0;
}

Al Viro Sept. 14, 2022, 1:30 a.m. UTC | #12

On Wed, Sep 14, 2022 at 10:14:44AM +1000, NeilBrown wrote:

> But they can race.  The open completes the lookup of /tmp/share-dir and
> holds the dentry, but the rename succeeds with inode_lock(target) in the
> code fragment you provided above before the open() can get the lock.
> By the time open() does get the lock, the dentry it holds will be marked
> S_DEAD and the __lookup_hash() will fail.
> So the open() returns -ENOENT - unexpected.
> 
> Test code below, based on the test code for the link race.
> 
> I wonder if S_DEAD should result in -ESTALE rather than -ENOENT.
> That would cause the various retry_estale() loops to retry the whole
> operation.  I suspect we'd actually need more subtlety than just that
> simple change, but it might be worth exploring.

	I don't think that wording in POSIX prohibits that ENOENT.
It does not (and no UNIX I've ever seen provides) atomicity of all
link traversals involved in pathname resolution.

root@sonny:/tmp# mkdir fubar
root@sonny:/tmp# cd fubar/
root@sonny:/tmp/fubar# rmdir ../fubar
root@sonny:/tmp/fubar# touch ./foo
touch: cannot touch './foo': No such file or directory

Do you agree that this is a reasonable behaviour?  That, BTW, is why
"retry on S_DEAD" is wrong - it can be a persistent state.

Now, replace rmdir ../fubar with rename("/tmp/barf", "/tmp/fubar") and
you'll get pretty much your testcase.  You are asking not just for
atomicity of rename vs. traversal of "fubar" in /tmp - that we already
have.  You are asking for the atomicity of the entire pathname resolution
of open() argument wrt rename().  And that is a much scarier can of worms.

Basically, pathname resolution happens on per-component basis and it's
*NOT* guaranteed that filesystem won't be changed between those or that
we won't observe such modifications.  If you check POSIX, you'll see
that it avoids any such promises - all it says (for directories; non-directory
case has similar verbiage) is this:

	If the directory named by the new argument exists, it shall be removed
	and old renamed to new. In this case, a link named new shall exist
	throughout the renaming operation and shall refer either to the directory
	referred to by new or old before the operation began.

It carefully stays the hell out of any pathname resolution atomicity promises -
all it's talking about is resolution of a single link.  It's enough to
guarantee that rename("old", "new") won't race with mkdir("new") allowing the
latter to succeed, with similar for creat(2), etc. - "new" is promised to
refer to an existing object at all times.  And that's what rename() atomicity
is normally for.  Operation on the link in question will observe either old
or new state; operation that passed through that link on the way to whatever
it will act upon has *also* observed either of those states - and the next
pathname component had been looked up in either old or new directory, but
there is no promise that nothing has happened to that directory since we got
there.

[v2] vfs: fix link vs. rename race

Commit Message

Comments

Patch