Message ID | 20220221082002.508392-1-mszeredi@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] vfs: fix link vs. rename race | expand |
On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote: > This patch introduces a global rw_semaphore that is locked for read in > rename and for write in link. To prevent excessive contention, do not take > the lock in link on the first try. If the source of the link was found to > be unlinked, then retry with the lock held. Tested-by: Xavier Roche <xavier.roche@algolia.com> Revalidated patch (on a 5.16.5) using previous torture test; - on ext4 filesystem - on a cryptfs volume
On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote: [digging through the old piles of mail] Eyes-watering control flow in do_linkat() aside (it's bound to rot; too much of it won't get any regression testing and it's convoluted enough to break easily), the main problem I have with that is the DoS potential. You have a system-wide lock, and if it's stuck you'll get every damn rename(2) stuck as well. Sure, having it taken only upon the race with rename() (or unlink(), for that matter) make it harder to get stuck with lock held, but that'll make the problem harder to reproduce and debug...
On Tue, Sep 13, 2022 at 03:04:17AM +0100, Al Viro wrote: > On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote: > > [digging through the old piles of mail] > > Eyes-watering control flow in do_linkat() aside (it's bound to rot; too > much of it won't get any regression testing and it's convoluted enough > to break easily), the main problem I have with that is the DoS potential. > > You have a system-wide lock, and if it's stuck you'll get every damn > rename(2) stuck as well. Sure, having it taken only upon the race > with rename() (or unlink(), for that matter) make it harder to get > stuck with lock held, but that'll make the problem harder to reproduce > and debug... FWIW, how much trouble would we have if link(2) would do the following? find the parent of source lock it look the child up verify it's a non-directory bump child's i_nlink all failure exits past that point decrement child's i_nlink unlock the parent find the parent of destination lock it look the destination up call vfs_link decrement child's i_nlink - vfs_link has bumped it unlock the parent of destination I do realize it can lead to leaked link count on a crash, obviously...
On Mon, 21 Feb 2022, Miklos Szeredi wrote: > There has been a longstanding race condition between rename(2) and link(2), > when those operations are done in parallel: > > 1. Moving a file to an existing target file (eg. mv file target) > 2. Creating a link from the target file to a third file (eg. ln target > link) > > By the time vfs_link() locks the target inode, it might already be unlinked > by rename. This results in vfs_link() returning -ENOENT in order to > prevent linking to already unlinked files. This check was introduced in > v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for > deleted file"). > > This breaks apparent atomicity of rename(2), which is described in > standards and the man page: > > "If newpath already exists, it will be atomically replaced, so that > there is no point at which another process attempting to access > newpath will find it missing." > > The simplest fix is to exclude renames for the complete link operation. Alternately, lock the "from" directory as well as the "to" directory. That would mean using lock_rename() and generally copying the structure of do_renameat2() into do_linkat() I wonder if you could get a similar race trying to create a file in (empty directory) /tmp/foo while /tmp/bar was being renamed over it. NeilBrown > > This patch introduces a global rw_semaphore that is locked for read in > rename and for write in link. To prevent excessive contention, do not take > the lock in link on the first try. If the source of the link was found to > be unlinked, then retry with the lock held. > > Reuse the lock_rename()/unlock_rename() helpers for the rename part. This > however needs special treatment for stacking fs (overlayfs, ecryptfs) to > prevent possible deadlocks. Introduce [un]lock_rename_stacked() for this > purpose. > > Reproducer can be found at: > > https://lore.kernel.org/all/20220216131814.GA2463301@xavier-xps/ > > Reported-by: Xavier Roche <xavier.roche@algolia.com> > Link: https://lore.kernel.org/all/20220214210708.GA2167841@xavier-xps/ > Fixes: aae8a97d3ec3 ("fs: Don't allow to create hardlink for deleted file") > Tested-by: Xavier Roche <xavier.roche@algolia.com> > Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> > --- > fs/ecryptfs/inode.c | 4 ++-- > fs/namei.c | 35 ++++++++++++++++++++++++++++++++--- > fs/overlayfs/copy_up.c | 4 ++-- > fs/overlayfs/dir.c | 12 ++++++------ > fs/overlayfs/util.c | 4 ++-- > include/linux/namei.h | 4 ++++ > 6 files changed, 48 insertions(+), 15 deletions(-) > > diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c > index 16d50dface59..f5c37599bd40 100644 > --- a/fs/ecryptfs/inode.c > +++ b/fs/ecryptfs/inode.c > @@ -596,7 +596,7 @@ ecryptfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir, > > target_inode = d_inode(new_dentry); > > - trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry); > + trap = lock_rename_stacked(lower_old_dir_dentry, lower_new_dir_dentry); > dget(lower_new_dentry); > rc = -EINVAL; > if (lower_old_dentry->d_parent != lower_old_dir_dentry) > @@ -631,7 +631,7 @@ ecryptfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir, > fsstack_copy_attr_all(old_dir, d_inode(lower_old_dir_dentry)); > out_lock: > dput(lower_new_dentry); > - unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry); > + unlock_rename_stacked(lower_old_dir_dentry, lower_new_dir_dentry); > return rc; > } > > diff --git a/fs/namei.c b/fs/namei.c > index 3f1829b3ab5b..27e671c354a6 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -122,6 +122,8 @@ > * PATH_MAX includes the nul terminator --RR. > */ > > +static DECLARE_RWSEM(link_rwsem); > + > #define EMBEDDED_NAME_MAX (PATH_MAX - offsetof(struct filename, iname)) > > struct filename * > @@ -2954,10 +2956,11 @@ static inline int may_create(struct user_namespace *mnt_userns, > return inode_permission(mnt_userns, dir, MAY_WRITE | MAY_EXEC); > } > > + > /* > * p1 and p2 should be directories on the same fs. > */ > -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) > +struct dentry *lock_rename_stacked(struct dentry *p1, struct dentry *p2) > { > struct dentry *p; > > @@ -2986,9 +2989,16 @@ struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) > inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2); > return NULL; > } > +EXPORT_SYMBOL(lock_rename_stacked); > + > +struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) > +{ > + down_read(&link_rwsem); > + return lock_rename_stacked(p1, p2); > +} > EXPORT_SYMBOL(lock_rename); > > -void unlock_rename(struct dentry *p1, struct dentry *p2) > +void unlock_rename_stacked(struct dentry *p1, struct dentry *p2) > { > inode_unlock(p1->d_inode); > if (p1 != p2) { > @@ -2996,6 +3006,13 @@ void unlock_rename(struct dentry *p1, struct dentry *p2) > mutex_unlock(&p1->d_sb->s_vfs_rename_mutex); > } > } > +EXPORT_SYMBOL(unlock_rename_stacked); > + > +void unlock_rename(struct dentry *p1, struct dentry *p2) > +{ > + unlock_rename_stacked(p1, p2); > + up_read(&link_rwsem); > +} > EXPORT_SYMBOL(unlock_rename); > > /** > @@ -4456,6 +4473,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, > struct path old_path, new_path; > struct inode *delegated_inode = NULL; > int how = 0; > + bool lock = false; > int error; > > if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) { > @@ -4474,10 +4492,13 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, > > if (flags & AT_SYMLINK_FOLLOW) > how |= LOOKUP_FOLLOW; > +retry_lock: > + if (lock) > + down_write(&link_rwsem); > retry: > error = filename_lookup(olddfd, old, how, &old_path, NULL); > if (error) > - goto out_putnames; > + goto out_unlock_link; > > new_dentry = filename_create(newdfd, new, &new_path, > (how & LOOKUP_REVAL)); > @@ -4511,8 +4532,16 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, > how |= LOOKUP_REVAL; > goto retry; > } > + if (!lock && error == -ENOENT) { > + path_put(&old_path); > + lock = true; > + goto retry_lock; > + } > out_putpath: > path_put(&old_path); > +out_unlock_link: > + if (lock) > + up_write(&link_rwsem); > out_putnames: > putname(old); > putname(new); > diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c > index e040970408d4..911c3cec43c2 100644 > --- a/fs/overlayfs/copy_up.c > +++ b/fs/overlayfs/copy_up.c > @@ -670,7 +670,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c) > > /* workdir and destdir could be the same when copying up to indexdir */ > err = -EIO; > - if (lock_rename(c->workdir, c->destdir) != NULL) > + if (lock_rename_stacked(c->workdir, c->destdir) != NULL) > goto unlock; > > err = ovl_prep_cu_creds(c->dentry, &cc); > @@ -711,7 +711,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c) > if (S_ISDIR(inode->i_mode)) > ovl_set_flag(OVL_WHITEOUTS, inode); > unlock: > - unlock_rename(c->workdir, c->destdir); > + unlock_rename_stacked(c->workdir, c->destdir); > > return err; > > diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c > index f18490813170..fea397666174 100644 > --- a/fs/overlayfs/dir.c > +++ b/fs/overlayfs/dir.c > @@ -416,7 +416,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry, > > ovl_cleanup_whiteouts(upper, list); > ovl_cleanup(wdir, upper); > - unlock_rename(workdir, upperdir); > + unlock_rename_stacked(workdir, upperdir); > > /* dentry's upper doesn't match now, get rid of it */ > d_drop(dentry); > @@ -427,7 +427,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry, > ovl_cleanup(wdir, opaquedir); > dput(opaquedir); > out_unlock: > - unlock_rename(workdir, upperdir); > + unlock_rename_stacked(workdir, upperdir); > out: > return ERR_PTR(err); > } > @@ -551,7 +551,7 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode, > out_dput: > dput(upper); > out_unlock: > - unlock_rename(workdir, upperdir); > + unlock_rename_stacked(workdir, upperdir); > out: > if (!hardlink) { > posix_acl_release(acl); > @@ -790,7 +790,7 @@ static int ovl_remove_and_whiteout(struct dentry *dentry, > out_dput_upper: > dput(upper); > out_unlock: > - unlock_rename(workdir, upperdir); > + unlock_rename_stacked(workdir, upperdir); > out_dput: > dput(opaquedir); > out: > @@ -1187,7 +1187,7 @@ static int ovl_rename(struct user_namespace *mnt_userns, struct inode *olddir, > } > } > > - trap = lock_rename(new_upperdir, old_upperdir); > + trap = lock_rename_stacked(new_upperdir, old_upperdir); > > olddentry = lookup_one_len(old->d_name.name, old_upperdir, > old->d_name.len); > @@ -1281,7 +1281,7 @@ static int ovl_rename(struct user_namespace *mnt_userns, struct inode *olddir, > out_dput_old: > dput(olddentry); > out_unlock: > - unlock_rename(new_upperdir, old_upperdir); > + unlock_rename_stacked(new_upperdir, old_upperdir); > out_revert_creds: > revert_creds(old_cred); > if (update_nlink) > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c > index f48284a2a896..9358282278b1 100644 > --- a/fs/overlayfs/util.c > +++ b/fs/overlayfs/util.c > @@ -930,13 +930,13 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir) > goto err; > > /* Workdir should not be subdir of upperdir and vice versa */ > - if (lock_rename(workdir, upperdir) != NULL) > + if (lock_rename_stacked(workdir, upperdir) != NULL) > goto err_unlock; > > return 0; > > err_unlock: > - unlock_rename(workdir, upperdir); > + unlock_rename_stacked(workdir, upperdir); > err: > pr_err("failed to lock workdir+upperdir\n"); > return -EIO; > diff --git a/include/linux/namei.h b/include/linux/namei.h > index e89329bb3134..0a87bb0d56ce 100644 > --- a/include/linux/namei.h > +++ b/include/linux/namei.h > @@ -77,6 +77,10 @@ extern int follow_up(struct path *); > extern struct dentry *lock_rename(struct dentry *, struct dentry *); > extern void unlock_rename(struct dentry *, struct dentry *); > > +/* Special version of the above for stacking filesystems */ > +extern struct dentry *lock_rename_stacked(struct dentry *, struct dentry *); > +extern void unlock_rename_stacked(struct dentry *, struct dentry *); > + > extern int __must_check nd_jump_link(struct path *path); > > static inline void nd_terminate_link(void *name, size_t len, size_t maxlen) > -- > 2.34.1 > >
On Tue, Sep 13, 2022 at 02:41:51PM +1000, NeilBrown wrote: > On Mon, 21 Feb 2022, Miklos Szeredi wrote: > > There has been a longstanding race condition between rename(2) and link(2), > > when those operations are done in parallel: > > > > 1. Moving a file to an existing target file (eg. mv file target) > > 2. Creating a link from the target file to a third file (eg. ln target > > link) > > > > By the time vfs_link() locks the target inode, it might already be unlinked > > by rename. This results in vfs_link() returning -ENOENT in order to > > prevent linking to already unlinked files. This check was introduced in > > v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for > > deleted file"). > > > > This breaks apparent atomicity of rename(2), which is described in > > standards and the man page: > > > > "If newpath already exists, it will be atomically replaced, so that > > there is no point at which another process attempting to access > > newpath will find it missing." > > > > The simplest fix is to exclude renames for the complete link operation. > > Alternately, lock the "from" directory as well as the "to" directory. > That would mean using lock_rename() and generally copying the structure > of do_renameat2() into do_linkat() Ever done cp -al? Cross-directory renames are relatively rare; cross-directory links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex held a _lot_. > I wonder if you could get a similar race trying to create a file in > (empty directory) /tmp/foo while /tmp/bar was being renamed over it. Neil, no offense, but... if you have plans regarding changes in directory locking, you might consider reading through the file called Documentation/filesystems/directory-locking.rst Occasionally documentation is where one could expect to find it...
On Tue, Sep 13, 2022 at 06:20:10AM +0100, Al Viro wrote: > > Alternately, lock the "from" directory as well as the "to" directory. > > That would mean using lock_rename() and generally copying the structure > > of do_renameat2() into do_linkat() > > Ever done cp -al? Cross-directory renames are relatively rare; cross-directory > links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex > held a _lot_. > > > I wonder if you could get a similar race trying to create a file in > > (empty directory) /tmp/foo while /tmp/bar was being renamed over it. > > Neil, no offense, but... if you have plans regarding changes in > directory locking, you might consider reading through the file called > Documentation/filesystems/directory-locking.rst > > Occasionally documentation is where one could expect to find it... ... and that "..." above should've been ";-)" - it was not intended as a dig, especially since locking in that area has subtle and badly underdocumented parts (in particular, anything related to fh_to_dentry(), rules regarding the stability of ->d_name and ->d_parent, mount vs. dentry invalidation and too many other things), but the basic stuff like that is actually covered. FWIW, the answer to your question is that the victim of overwriting rename is held locked by caller of ->rename(); combined with the lock held on directory by anyone who modifies it that prevents the race you are asking about. See if (!is_dir || (flags & RENAME_EXCHANGE)) lock_two_nondirectories(source, target); else if (target) inode_lock(target); in vfs_rename().
On Tue, Sep 13, 2022 at 7:38 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Tue, Sep 13, 2022 at 03:04:17AM +0100, Al Viro wrote: > > On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote: > > > > [digging through the old piles of mail] > > > > Eyes-watering control flow in do_linkat() aside (it's bound to rot; too > > much of it won't get any regression testing and it's convoluted enough > > to break easily), the main problem I have with that is the DoS potential. > > > > You have a system-wide lock, and if it's stuck you'll get every damn > > rename(2) stuck as well. Sure, having it taken only upon the race > > with rename() (or unlink(), for that matter) make it harder to get > > stuck with lock held, but that'll make the problem harder to reproduce > > and debug... > > FWIW, how much trouble would we have if link(2) would do the following? > > find the parent of source Well, only if source is not AT_EMPTY_PATH > lock it > look the child up > verify it's a non-directory > bump child's i_nlink > all failure exits past that point decrement child's i_nlink No need to bump i_nlink. Sufficient to set I_LINKABLE. and clean it up on failure if i_nlink > 0. Or we can move cleanup of I_LINKABLE to drop_nlink()/clean_nlink() > unlock the parent > find the parent of destination > lock it > look the destination up > call vfs_link > decrement child's i_nlink - vfs_link has bumped it > unlock the parent of destination > > I do realize it can lead to leaked link count on a crash, obviously... No such problem with I_LINKABLE. Thanks, Amir.
On Tue, 13 Sept 2022 at 10:02, Amir Goldstein <amir73il@gmail.com> wrote: > > On Tue, Sep 13, 2022 at 7:38 AM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Tue, Sep 13, 2022 at 03:04:17AM +0100, Al Viro wrote: > > > On Mon, Feb 21, 2022 at 09:20:02AM +0100, Miklos Szeredi wrote: > > > > > > [digging through the old piles of mail] > > > > > > Eyes-watering control flow in do_linkat() aside (it's bound to rot; too > > > much of it won't get any regression testing and it's convoluted enough > > > to break easily), the main problem I have with that is the DoS potential. > > > > > > You have a system-wide lock, and if it's stuck you'll get every damn > > > rename(2) stuck as well. Sure, having it taken only upon the race > > > with rename() (or unlink(), for that matter) make it harder to get > > > stuck with lock held, but that'll make the problem harder to reproduce > > > and debug... > > > > FWIW, how much trouble would we have if link(2) would do the following? > > > > find the parent of source > > Well, only if source is not AT_EMPTY_PATH The problem is not AT_EMPTY_PATH, but a disconnected dentry gotten through magic symlink. Theoretically link(2) should work for those, but I didn't try. > > > lock it > > look the child up > > verify it's a non-directory > > bump child's i_nlink > > all failure exits past that point decrement child's i_nlink > > No need to bump i_nlink. > Sufficient to set I_LINKABLE. > and clean it up on failure if i_nlink > 0. Bumping i_nlink seems hackish enough, currently all i_nlink modification is done by filesystem code. Setting I_LINKABLE seems safe for filesystems that have tmpfile, not otherwise. Thanks, Miklos
On Tue, 13 Sep 2022, Al Viro wrote: > On Tue, Sep 13, 2022 at 02:41:51PM +1000, NeilBrown wrote: > > On Mon, 21 Feb 2022, Miklos Szeredi wrote: > > > There has been a longstanding race condition between rename(2) and link(2), > > > when those operations are done in parallel: > > > > > > 1. Moving a file to an existing target file (eg. mv file target) > > > 2. Creating a link from the target file to a third file (eg. ln target > > > link) > > > > > > By the time vfs_link() locks the target inode, it might already be unlinked > > > by rename. This results in vfs_link() returning -ENOENT in order to > > > prevent linking to already unlinked files. This check was introduced in > > > v2.6.39 by commit aae8a97d3ec3 ("fs: Don't allow to create hardlink for > > > deleted file"). > > > > > > This breaks apparent atomicity of rename(2), which is described in > > > standards and the man page: > > > > > > "If newpath already exists, it will be atomically replaced, so that > > > there is no point at which another process attempting to access > > > newpath will find it missing." > > > > > > The simplest fix is to exclude renames for the complete link operation. > > > > Alternately, lock the "from" directory as well as the "to" directory. > > That would mean using lock_rename() and generally copying the structure > > of do_renameat2() into do_linkat() > > Ever done cp -al? Cross-directory renames are relatively rare; cross-directory > links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex > held a _lot_. As long as the approach is correct, it might be a good starting point for optimisation. We only need s_vfs_rename_mutex for link() so that we can reliably determine any parent/child relationship to get correct lock ordering to avoid deadlocks. So if we use trylock on the second lock and succeed then we don't need the mutex at all. e.g. - lookup the parents of both paths. - lock the "to" directory. - if the "from" directory is the same, or if a trylock of the from directory succeeds, then we have the locks and can perform the last component lookup and perform the link without racing with rename. - if the trylock fails, we drop the lock on "to" and use lock_rename(). We drop the s_vfs_rename_mutex immediately after lock_rename() so after the vfs_link() we just unlock both parent directories. That should avoid the mutex is most cases including "cp -al" Holding the "from" parent locked means that NFS could safely access the parent and basename, and send the lookup to the server to reduce the risk of a rename on one client racing with a link on another client. NFS doesn't guarantee that would still work (ops in a compound are not atomic) but it could still help. And the NFS server is *prevented* from making the lookup and link atomic. NeilBrown
On Wed, Sep 14, 2022 at 09:52:37AM +1000, NeilBrown wrote: > - lookup the parents of both paths. > - lock the "to" directory. > - if the "from" directory is the same, or if a trylock of the from > directory succeeds, then we have the locks and can perform the > last component lookup and perform the link without racing with > rename. > - if the trylock fails, we drop the lock on "to" and use lock_rename(). > We drop the s_vfs_rename_mutex immediately after lock_rename() > so after the vfs_link() we just unlock both parent directories. Umm... Care to put together an update of deadlock avoidance proof? The one in D/f/directory-locking.rst, that is. I'm not saying it's not a usable approach - it might very well work... BTW, one testcase worth profiling would be something like for i in `seq 100`; do cp -al linux.git linux.git-$i & done
On Tue, 13 Sep 2022, Al Viro wrote: > On Tue, Sep 13, 2022 at 06:20:10AM +0100, Al Viro wrote: > > > > Alternately, lock the "from" directory as well as the "to" directory. > > > That would mean using lock_rename() and generally copying the structure > > > of do_renameat2() into do_linkat() > > > > Ever done cp -al? Cross-directory renames are relatively rare; cross-directory > > links can be fairly heavy on some payloads, and you'll get ->s_vfs_rename_mutex > > held a _lot_. > > > > > I wonder if you could get a similar race trying to create a file in > > > (empty directory) /tmp/foo while /tmp/bar was being renamed over it. > > > > Neil, no offense, but... if you have plans regarding changes in > > directory locking, you might consider reading through the file called > > Documentation/filesystems/directory-locking.rst > > > > Occasionally documentation is where one could expect to find it... > > ... and that "..." above should've been ";-)" - it was not intended as > a dig, especially since locking in that area has subtle and badly > underdocumented parts (in particular, anything related to fh_to_dentry(), > rules regarding the stability of ->d_name and ->d_parent, mount vs. dentry > invalidation and too many other things), but the basic stuff like that > is actually covered. > > FWIW, the answer to your question is that the victim of overwriting > rename is held locked by caller of ->rename(); combined with the lock > held on directory by anyone who modifies it that prevents the race > you are asking about. > > See > if (!is_dir || (flags & RENAME_EXCHANGE)) > lock_two_nondirectories(source, target); > else if (target) > inode_lock(target); > in vfs_rename(). > I don't think those locks address the race I was thinking of. Suppose a /tmp/shared-dir is an empty directory and one thread runs rename("/tmp/tmp-dir", "/tmp/shared-dir") while another thread runs open("/tmp/shared-dir/Afile", O_CREAT | O_WRONLY) If the first wins the file is created in what was tmp-dir. If the second wins, the rename fails because shared-dir isn't empty. But they can race. The open completes the lookup of /tmp/share-dir and holds the dentry, but the rename succeeds with inode_lock(target) in the code fragment you provided above before the open() can get the lock. By the time open() does get the lock, the dentry it holds will be marked S_DEAD and the __lookup_hash() will fail. So the open() returns -ENOENT - unexpected. Test code below, based on the test code for the link race. I wonder if S_DEAD should result in -ESTALE rather than -ENOENT. That would cause the various retry_estale() loops to retry the whole operation. I suspect we'd actually need more subtlety than just that simple change, but it might be worth exploring. NeilBrown /* Linux rename vs. create race condition. * Rationale: both (1) moving a directory to an empty target and * (2) creating a file in the target can race. * The create should always succeed, but there should always be * directory there, either old or new. * The rename might fail if the target isn't empty. * Sample file courtesy of Xavier Grand at Algolia * g++ -pthread createbug.c -o createbug */ #include <thread> #include <unistd.h> #include <assert.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <iostream> #include <string.h> static const char* producedDir = "/tmp/shared-dir"; static const char* producedTmpDir = "/tmp/new-dir"; static const char* producedThreadFile = "/tmp/shared-dir/Afile"; bool createFile(const char* path) { const int fdOut = open(path, O_WRONLY | O_CREAT | O_TRUNC | O_EXCL | O_CLOEXEC, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH); if (fdOut < 0) return false; assert(write(fdOut, "Foo", 4) == 4); assert(close(fdOut) == 0); return true; } void func() { int nbSuccess = 0; // Loop producedThread create file in dir while (true) { if (createFile(producedThreadFile) != true) { std::cout << "Failed after " << nbSuccess << " with " << strerror(errno) << std::endl; exit(EXIT_FAILURE); } else { nbSuccess++; } unlink(producedThreadFile); } } int main() { // Setup env rmdir(producedTmpDir); unlink(producedThreadFile); rmdir(producedDir); mkdir(producedDir, 0777); // Async thread doing a hardlink and moving it std::thread t(func); // Loop creating a .tmp and moving it while (true) { assert(mkdir(producedTmpDir, 0777) == 0); while (rename(producedTmpDir, producedDir) != 0) unlink(producedThreadFile); } return 0; }
On Wed, Sep 14, 2022 at 10:14:44AM +1000, NeilBrown wrote: > But they can race. The open completes the lookup of /tmp/share-dir and > holds the dentry, but the rename succeeds with inode_lock(target) in the > code fragment you provided above before the open() can get the lock. > By the time open() does get the lock, the dentry it holds will be marked > S_DEAD and the __lookup_hash() will fail. > So the open() returns -ENOENT - unexpected. > > Test code below, based on the test code for the link race. > > I wonder if S_DEAD should result in -ESTALE rather than -ENOENT. > That would cause the various retry_estale() loops to retry the whole > operation. I suspect we'd actually need more subtlety than just that > simple change, but it might be worth exploring. I don't think that wording in POSIX prohibits that ENOENT. It does not (and no UNIX I've ever seen provides) atomicity of all link traversals involved in pathname resolution. root@sonny:/tmp# mkdir fubar root@sonny:/tmp# cd fubar/ root@sonny:/tmp/fubar# rmdir ../fubar root@sonny:/tmp/fubar# touch ./foo touch: cannot touch './foo': No such file or directory Do you agree that this is a reasonable behaviour? That, BTW, is why "retry on S_DEAD" is wrong - it can be a persistent state. Now, replace rmdir ../fubar with rename("/tmp/barf", "/tmp/fubar") and you'll get pretty much your testcase. You are asking not just for atomicity of rename vs. traversal of "fubar" in /tmp - that we already have. You are asking for the atomicity of the entire pathname resolution of open() argument wrt rename(). And that is a much scarier can of worms. Basically, pathname resolution happens on per-component basis and it's *NOT* guaranteed that filesystem won't be changed between those or that we won't observe such modifications. If you check POSIX, you'll see that it avoids any such promises - all it says (for directories; non-directory case has similar verbiage) is this: If the directory named by the new argument exists, it shall be removed and old renamed to new. In this case, a link named new shall exist throughout the renaming operation and shall refer either to the directory referred to by new or old before the operation began. It carefully stays the hell out of any pathname resolution atomicity promises - all it's talking about is resolution of a single link. It's enough to guarantee that rename("old", "new") won't race with mkdir("new") allowing the latter to succeed, with similar for creat(2), etc. - "new" is promised to refer to an existing object at all times. And that's what rename() atomicity is normally for. Operation on the link in question will observe either old or new state; operation that passed through that link on the way to whatever it will act upon has *also* observed either of those states - and the next pathname component had been looked up in either old or new directory, but there is no promise that nothing has happened to that directory since we got there.
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index 16d50dface59..f5c37599bd40 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -596,7 +596,7 @@ ecryptfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir, target_inode = d_inode(new_dentry); - trap = lock_rename(lower_old_dir_dentry, lower_new_dir_dentry); + trap = lock_rename_stacked(lower_old_dir_dentry, lower_new_dir_dentry); dget(lower_new_dentry); rc = -EINVAL; if (lower_old_dentry->d_parent != lower_old_dir_dentry) @@ -631,7 +631,7 @@ ecryptfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir, fsstack_copy_attr_all(old_dir, d_inode(lower_old_dir_dentry)); out_lock: dput(lower_new_dentry); - unlock_rename(lower_old_dir_dentry, lower_new_dir_dentry); + unlock_rename_stacked(lower_old_dir_dentry, lower_new_dir_dentry); return rc; } diff --git a/fs/namei.c b/fs/namei.c index 3f1829b3ab5b..27e671c354a6 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -122,6 +122,8 @@ * PATH_MAX includes the nul terminator --RR. */ +static DECLARE_RWSEM(link_rwsem); + #define EMBEDDED_NAME_MAX (PATH_MAX - offsetof(struct filename, iname)) struct filename * @@ -2954,10 +2956,11 @@ static inline int may_create(struct user_namespace *mnt_userns, return inode_permission(mnt_userns, dir, MAY_WRITE | MAY_EXEC); } + /* * p1 and p2 should be directories on the same fs. */ -struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) +struct dentry *lock_rename_stacked(struct dentry *p1, struct dentry *p2) { struct dentry *p; @@ -2986,9 +2989,16 @@ struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2); return NULL; } +EXPORT_SYMBOL(lock_rename_stacked); + +struct dentry *lock_rename(struct dentry *p1, struct dentry *p2) +{ + down_read(&link_rwsem); + return lock_rename_stacked(p1, p2); +} EXPORT_SYMBOL(lock_rename); -void unlock_rename(struct dentry *p1, struct dentry *p2) +void unlock_rename_stacked(struct dentry *p1, struct dentry *p2) { inode_unlock(p1->d_inode); if (p1 != p2) { @@ -2996,6 +3006,13 @@ void unlock_rename(struct dentry *p1, struct dentry *p2) mutex_unlock(&p1->d_sb->s_vfs_rename_mutex); } } +EXPORT_SYMBOL(unlock_rename_stacked); + +void unlock_rename(struct dentry *p1, struct dentry *p2) +{ + unlock_rename_stacked(p1, p2); + up_read(&link_rwsem); +} EXPORT_SYMBOL(unlock_rename); /** @@ -4456,6 +4473,7 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, struct path old_path, new_path; struct inode *delegated_inode = NULL; int how = 0; + bool lock = false; int error; if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) { @@ -4474,10 +4492,13 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, if (flags & AT_SYMLINK_FOLLOW) how |= LOOKUP_FOLLOW; +retry_lock: + if (lock) + down_write(&link_rwsem); retry: error = filename_lookup(olddfd, old, how, &old_path, NULL); if (error) - goto out_putnames; + goto out_unlock_link; new_dentry = filename_create(newdfd, new, &new_path, (how & LOOKUP_REVAL)); @@ -4511,8 +4532,16 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, how |= LOOKUP_REVAL; goto retry; } + if (!lock && error == -ENOENT) { + path_put(&old_path); + lock = true; + goto retry_lock; + } out_putpath: path_put(&old_path); +out_unlock_link: + if (lock) + up_write(&link_rwsem); out_putnames: putname(old); putname(new); diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c index e040970408d4..911c3cec43c2 100644 --- a/fs/overlayfs/copy_up.c +++ b/fs/overlayfs/copy_up.c @@ -670,7 +670,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c) /* workdir and destdir could be the same when copying up to indexdir */ err = -EIO; - if (lock_rename(c->workdir, c->destdir) != NULL) + if (lock_rename_stacked(c->workdir, c->destdir) != NULL) goto unlock; err = ovl_prep_cu_creds(c->dentry, &cc); @@ -711,7 +711,7 @@ static int ovl_copy_up_workdir(struct ovl_copy_up_ctx *c) if (S_ISDIR(inode->i_mode)) ovl_set_flag(OVL_WHITEOUTS, inode); unlock: - unlock_rename(c->workdir, c->destdir); + unlock_rename_stacked(c->workdir, c->destdir); return err; diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index f18490813170..fea397666174 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -416,7 +416,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry, ovl_cleanup_whiteouts(upper, list); ovl_cleanup(wdir, upper); - unlock_rename(workdir, upperdir); + unlock_rename_stacked(workdir, upperdir); /* dentry's upper doesn't match now, get rid of it */ d_drop(dentry); @@ -427,7 +427,7 @@ static struct dentry *ovl_clear_empty(struct dentry *dentry, ovl_cleanup(wdir, opaquedir); dput(opaquedir); out_unlock: - unlock_rename(workdir, upperdir); + unlock_rename_stacked(workdir, upperdir); out: return ERR_PTR(err); } @@ -551,7 +551,7 @@ static int ovl_create_over_whiteout(struct dentry *dentry, struct inode *inode, out_dput: dput(upper); out_unlock: - unlock_rename(workdir, upperdir); + unlock_rename_stacked(workdir, upperdir); out: if (!hardlink) { posix_acl_release(acl); @@ -790,7 +790,7 @@ static int ovl_remove_and_whiteout(struct dentry *dentry, out_dput_upper: dput(upper); out_unlock: - unlock_rename(workdir, upperdir); + unlock_rename_stacked(workdir, upperdir); out_dput: dput(opaquedir); out: @@ -1187,7 +1187,7 @@ static int ovl_rename(struct user_namespace *mnt_userns, struct inode *olddir, } } - trap = lock_rename(new_upperdir, old_upperdir); + trap = lock_rename_stacked(new_upperdir, old_upperdir); olddentry = lookup_one_len(old->d_name.name, old_upperdir, old->d_name.len); @@ -1281,7 +1281,7 @@ static int ovl_rename(struct user_namespace *mnt_userns, struct inode *olddir, out_dput_old: dput(olddentry); out_unlock: - unlock_rename(new_upperdir, old_upperdir); + unlock_rename_stacked(new_upperdir, old_upperdir); out_revert_creds: revert_creds(old_cred); if (update_nlink) diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c index f48284a2a896..9358282278b1 100644 --- a/fs/overlayfs/util.c +++ b/fs/overlayfs/util.c @@ -930,13 +930,13 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir) goto err; /* Workdir should not be subdir of upperdir and vice versa */ - if (lock_rename(workdir, upperdir) != NULL) + if (lock_rename_stacked(workdir, upperdir) != NULL) goto err_unlock; return 0; err_unlock: - unlock_rename(workdir, upperdir); + unlock_rename_stacked(workdir, upperdir); err: pr_err("failed to lock workdir+upperdir\n"); return -EIO; diff --git a/include/linux/namei.h b/include/linux/namei.h index e89329bb3134..0a87bb0d56ce 100644 --- a/include/linux/namei.h +++ b/include/linux/namei.h @@ -77,6 +77,10 @@ extern int follow_up(struct path *); extern struct dentry *lock_rename(struct dentry *, struct dentry *); extern void unlock_rename(struct dentry *, struct dentry *); +/* Special version of the above for stacking filesystems */ +extern struct dentry *lock_rename_stacked(struct dentry *, struct dentry *); +extern void unlock_rename_stacked(struct dentry *, struct dentry *); + extern int __must_check nd_jump_link(struct path *path); static inline void nd_terminate_link(void *name, size_t len, size_t maxlen)