Message ID | 166330881189.15759.13499931397891560275@noble.neil.brown.name (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] VFS: lock source directory for link to avoid rename race. | expand |
On Fri, 16 Sept 2022 at 08:13, NeilBrown <neilb@suse.de> wrote: > @@ -4554,44 +4590,83 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, > if (flags & AT_SYMLINK_FOLLOW) > how |= LOOKUP_FOLLOW; > retry: > - error = filename_lookup(olddfd, old, how, &old_path, NULL); > + err2 = 0; > + error = filename_parentat(olddfd, old, how, &old_path, > + &old_last, &old_type); > if (error) > goto out_putnames; > + error = -EISDIR; > + if (old_type != LAST_NORM && !(flags & AT_EMPTY_PATH)) > + goto out_putnames; > + error = filename_parentat(newdfd, new, (how & LOOKUP_REVAL), &new_path, > + &new_last, &new_type); > + if (error) > + goto out_putoldpath; > > - new_dentry = filename_create(newdfd, new, &new_path, > - (how & LOOKUP_REVAL)); > - error = PTR_ERR(new_dentry); > - if (IS_ERR(new_dentry)) > - goto out_putpath; > + err2 = mnt_want_write(new_path.mnt); > > error = -EXDEV; > if (old_path.mnt != new_path.mnt) > - goto out_dput; > + goto out_putnewpath; > + lock_link(new_path.dentry, old_path.dentry, flags); > + > + new_dentry = __lookup_hash(&new_last, new_path.dentry, how & LOOKUP_REVAL); > + error = PTR_ERR(new_dentry); > + if (IS_ERR(new_dentry)) > + goto out_unlock; > + error = -EEXIST; > + if (d_is_positive(new_dentry)) > + goto out_dput_new; > + if (new_type != LAST_NORM) > + goto out_dput_new; > + > + error = err2; > + if (error) > + goto out_dput_new; > + > + if (flags & AT_EMPTY_PATH) > + old_dentry = dget(old_path.dentry); > + else > + old_dentry = __lookup_hash(&old_last, old_path.dentry, how); This will break AT_SYMLINK_FOLLOW. And yes, we can add all the lookup logic to do_linkat() at which point it will about 10x more complex than it was. Thanks, Miklos
On Fri, 16 Sep 2022, Miklos Szeredi wrote: > > This will break AT_SYMLINK_FOLLOW. > > And yes, we can add all the lookup logic to do_linkat() at which point > it will about 10x more complex than it was. Excellent point. I'll give that some thought. Thanks. NeilBrown
On Fri, Sep 16, 2022 at 08:28:06AM +0200, Miklos Szeredi wrote: > This will break AT_SYMLINK_FOLLOW. Right you are. > And yes, we can add all the lookup logic to do_linkat() at which point > it will about 10x more complex than it was. Especially since you can't reject an apparent cross-fs link until you'v looked the fucker up, since it just might be a symlink to be followed. Which means it would have to be something like find parents again: if on different mounts if !follow fuck off lock old parent look the last component up if not an existing symlink fuck off unlock the parent and try to follow that symlink goto again lock parents look the last components up if symlink to be followed unlock parents try to follow symlink goto again proceed Not exactly fatal, but...
On Fri, Sep 16, 2022 at 9:26 AM NeilBrown <neilb@suse.de> wrote: > > > rename(2) is documented as > > If newpath already exists, it will be atomically replaced, so > that there is no point at which another process attempting to > access newpath will find it missing. > > However link(2) from a given path can race with rename renaming to that > path so that link gets -ENOENT because the path has already been unlinked > by rename, and creating a link to an unlinked file is not permitted. > I have to ask. Is this a real problem or just a matter of respecting the laws of this man page? If we manage to return EBUSY in that case to link(2) will everyone be happy and we can avoid trying to make link(2) atomic w.r.t. rename(2)? Thanks, Amir.
On Fri, Sep 16, 2022 at 05:32:45PM +0300, Amir Goldstein wrote: > On Fri, Sep 16, 2022 at 9:26 AM NeilBrown <neilb@suse.de> wrote: > > > > > > rename(2) is documented as > > > > If newpath already exists, it will be atomically replaced, so > > that there is no point at which another process attempting to > > access newpath will find it missing. > > > > However link(2) from a given path can race with rename renaming to that > > path so that link gets -ENOENT because the path has already been unlinked > > by rename, and creating a link to an unlinked file is not permitted. > > > > I have to ask. Is this a real problem or just a matter of respecting > the laws of this man page? I have to say that I have the same reaction. The commit message doesn't really explain where the current behavior becomes an issue and whether there are any users seeing issues with this. And the patch makes do_linkat() way more complex than it was before.
On Mon, 19 Sep 2022, Christian Brauner wrote: > On Fri, Sep 16, 2022 at 05:32:45PM +0300, Amir Goldstein wrote: > > On Fri, Sep 16, 2022 at 9:26 AM NeilBrown <neilb@suse.de> wrote: > > > > > > > > > rename(2) is documented as > > > > > > If newpath already exists, it will be atomically replaced, so > > > that there is no point at which another process attempting to > > > access newpath will find it missing. > > > > > > However link(2) from a given path can race with rename renaming to that > > > path so that link gets -ENOENT because the path has already been unlinked > > > by rename, and creating a link to an unlinked file is not permitted. > > > > > > > I have to ask. Is this a real problem or just a matter of respecting > > the laws of this man page? > > I have to say that I have the same reaction. The commit message doesn't > really explain where the current behavior becomes an issue and whether > there are any users seeing issues with this. And the patch makes > do_linkat() way more complex than it was before. > A bug is a bug .... and in this case it is an intriguing puzzle too. Yes, the commit message could say a bit more about context. The patch also isn't correct, so the complexity is not relevant in this case. Some complexity will likely been needed (I do have a really simple patch that just retries the whole op, but I don't think that is safe), and we do need to balance the complexity against the value. Ideally we could end up making the code simpler ... I'm not sure I can manage that though :-) Thanks, NeilBrown
Greeting, FYI, we noticed the following commit (built with gcc-11): commit: 3fb4ec6faac286d97e27f48715f2cda56a701cd3 ("[PATCH RFC] VFS: lock source directory for link to avoid rename race.") url: https://github.com/intel-lab-lkp/linux/commits/NeilBrown/VFS-lock-source-directory-for-link-to-avoid-rename-race/20220916-141546 base: https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git for-next patch link: https://lore.kernel.org/linux-fsdevel/166330881189.15759.13499931397891560275@noble.neil.brown.name in testcase: ltp version: ltp-x86_64-14c1f76-1_20220829 with following parameters: disk: 1HDD fs: ext4 test: syscalls-01 test-description: The LTP testsuite contains a collection of tools for testing the Linux kernel and related features. test-url: http://linux-test-project.github.io/ on test machine: 4 threads Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (Skylake) with 32G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag | Reported-by: kernel test robot <oliver.sang@intel.com> | Link: https://lore.kernel.org/r/202209231042.4f771ffe-oliver.sang@intel.com <<<test_start>>> tag=linkat02 stime=1663772091 cmdline="linkat02" contacts="" analysis=exit <<<test_output>>> mke2fs 1.46.5 (30-Dec-2021) linkat02 0 TINFO : Found free device 0 '/dev/loop0' linkat02 0 TINFO : Formatting /dev/loop0 with ext2 opts='' extra opts='' linkat02 0 TINFO : the maximum number of hard links to emlink_dir/testfile0 is hit: 65000 linkat02 1 TPASS : linkat failed as expected: TEST_ERRNO=ENAMETOOLONG(36): File name too long linkat02 2 TPASS : linkat failed as expected: TEST_ERRNO=ENAMETOOLONG(36): File name too long linkat02 3 TPASS : linkat failed as expected: TEST_ERRNO=EEXIST(17): File exists linkat02 4 TFAIL : linkat02.c:132: linkat failed unexpectedly; expected: 40 - Too many levels of symbolic links: TEST_ERRNO=EEXIST(17): File exists linkat02 5 TPASS : linkat failed as expected: TEST_ERRNO=EACCES(13): Permission denied linkat02 6 TFAIL : linkat02.c:132: linkat failed unexpectedly; expected: 30 - Read-only file system: TEST_ERRNO=EXDEV(18): Invalid cross-device link linkat02 7 TPASS : linkat failed as expected: TEST_ERRNO=EMLINK(31): Too many links To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
diff --git a/Documentation/filesystems/directory-locking.rst b/Documentation/filesystems/directory-locking.rst index 504ba940c36c..da6fa5eff81d 100644 --- a/Documentation/filesystems/directory-locking.rst +++ b/Documentation/filesystems/directory-locking.rst @@ -11,7 +11,7 @@ When taking the i_rwsem on multiple non-directory objects, we always acquire the locks in order by increasing address. We'll call that "inode pointer" order in the following. -For our purposes all operations fall in 5 classes: +For our purposes all operations fall in 7 classes: 1) read access. Locking rules: caller locks directory we are accessing. The lock is taken shared. @@ -31,9 +31,9 @@ Then call the method. All locks are exclusive. NB: we might get away with locking the source (and target in exchange case) shared. -5) link creation. Locking rules: +5) link creation - source and target in name directory. Locking rules: - * lock parent + * lock parent before looking up base names * check that source is not a directory * lock source * call the method. @@ -58,6 +58,22 @@ rules: All ->i_rwsem are taken exclusive. Again, we might get away with locking the source (and target in exchange case) shared. +7) cross-directory link. This requires the source directory to be locked +so the source cannot be the target of a rename and so be unlinked before +the link happens (creating a link to an unlinked file is illegal). + +Same rules as cross-directory rename can be used (with different errors). +Locking the filesystem is expensive and often unnecessary so we have a +fast path that avoids it. Locking rules: + + * lock target parent + * trylock source parent. If this fails we unlock target parent + and fall back to full rename locking, then unlock filesystem once + directory locks are held. + * lookup base names + +Lock on source may be shared. + The rules above obviously guarantee that all directories that are going to be read, modified or removed by method will be locked by caller. @@ -101,6 +117,9 @@ non-directory objects are not included in the set of contended locks. Thus link creation can't be a part of deadlock - it can't be blocked on source and it means that it doesn't hold any locks. +The fast-path in link create cannot deadlock as it never blocks while +holding a lock. + Any contended object is either held by cross-directory rename or has a child that is also contended. Indeed, suppose that it is held by operation other than cross-directory rename. Then the lock this operation diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 4bb2627026ec..3190bb18f1c2 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -92,7 +92,7 @@ ops i_rwsem(inode) ============= ============================================= lookup: shared create: exclusive -link: exclusive (both) +link: exclusive (both) possibly shared on source dir mknod: exclusive symlink: exclusive mkdir: exclusive @@ -117,7 +117,11 @@ fileattr_set: exclusive Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem exclusive on victim. + ->rename() has ->i_rwsem on target if it exists, and also on + source if it is a non-directory. cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. + ->link() may have shared ->i_rwsem on source directory if it is + different from target directory. See Documentation/filesystems/directory-locking.rst for more detailed discussion of the locking scheme for directory operations. diff --git a/fs/namei.c b/fs/namei.c index 53b4bc094db2..877cac4e2e63 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -4518,6 +4518,39 @@ int vfs_link(struct dentry *old_dentry, struct user_namespace *mnt_userns, } EXPORT_SYMBOL(vfs_link); +static void lock_link(struct dentry *dest, struct dentry *source, int flags) +{ + inode_lock_nested(dest->d_inode, I_MUTEX_PARENT); + if (dest == source || (flags & AT_EMPTY_PATH)) + return; + if (inode_trylock_shared(source->d_inode)) + return; + + /* Need rename mutex */ + inode_unlock(dest->d_inode); + + mutex_lock(&dest->d_sb->s_vfs_rename_mutex); + + if (d_ancestor(dest, source)) { + inode_lock_nested(dest->d_inode, I_MUTEX_PARENT); + inode_lock_shared_nested(source->d_inode, I_MUTEX_CHILD); + } else if (d_ancestor(source, dest)) { + inode_lock_shared_nested(source->d_inode, I_MUTEX_PARENT); + inode_lock_nested(dest->d_inode, I_MUTEX_CHILD); + } else { + inode_lock_nested(dest->d_inode, I_MUTEX_PARENT); + inode_lock_shared_nested(source->d_inode, I_MUTEX_PARENT2); + } + mutex_unlock(&dest->d_sb->s_vfs_rename_mutex); +} + +static void unlock_link(struct dentry *dest, struct dentry *source, int flags) +{ + if (source != dest && !(flags & AT_EMPTY_PATH)) + inode_unlock_shared(source->d_inode); + inode_unlock(dest->d_inode); +} + /* * Hardlinks are often used in delicate situations. We avoid * security-related surprises by not following symlinks on the @@ -4531,11 +4564,14 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, struct filename *new, int flags) { struct user_namespace *mnt_userns; - struct dentry *new_dentry; - struct path old_path, new_path; + struct dentry *old_dentry, *new_dentry; + struct path old_path, new_path, link_path; + struct qstr old_last, new_last; + int old_type, new_type; struct inode *delegated_inode = NULL; int how = 0; int error; + int err2; if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0) { error = -EINVAL; @@ -4554,44 +4590,83 @@ int do_linkat(int olddfd, struct filename *old, int newdfd, if (flags & AT_SYMLINK_FOLLOW) how |= LOOKUP_FOLLOW; retry: - error = filename_lookup(olddfd, old, how, &old_path, NULL); + err2 = 0; + error = filename_parentat(olddfd, old, how, &old_path, + &old_last, &old_type); if (error) goto out_putnames; + error = -EISDIR; + if (old_type != LAST_NORM && !(flags & AT_EMPTY_PATH)) + goto out_putnames; + error = filename_parentat(newdfd, new, (how & LOOKUP_REVAL), &new_path, + &new_last, &new_type); + if (error) + goto out_putoldpath; - new_dentry = filename_create(newdfd, new, &new_path, - (how & LOOKUP_REVAL)); - error = PTR_ERR(new_dentry); - if (IS_ERR(new_dentry)) - goto out_putpath; + err2 = mnt_want_write(new_path.mnt); error = -EXDEV; if (old_path.mnt != new_path.mnt) - goto out_dput; + goto out_putnewpath; + lock_link(new_path.dentry, old_path.dentry, flags); + + new_dentry = __lookup_hash(&new_last, new_path.dentry, how & LOOKUP_REVAL); + error = PTR_ERR(new_dentry); + if (IS_ERR(new_dentry)) + goto out_unlock; + error = -EEXIST; + if (d_is_positive(new_dentry)) + goto out_dput_new; + if (new_type != LAST_NORM) + goto out_dput_new; + + error = err2; + if (error) + goto out_dput_new; + + if (flags & AT_EMPTY_PATH) + old_dentry = dget(old_path.dentry); + else + old_dentry = __lookup_hash(&old_last, old_path.dentry, how); + error = PTR_ERR(old_dentry); + if (IS_ERR(old_dentry)) + goto out_dput_new; + error = -ENOENT; + if (d_is_negative(old_dentry)) + goto out_dput_old; + mnt_userns = mnt_user_ns(new_path.mnt); - error = may_linkat(mnt_userns, &old_path); + link_path.mnt = old_path.mnt; + link_path.dentry = old_dentry; + error = may_linkat(mnt_userns, &link_path); if (unlikely(error)) - goto out_dput; - error = security_path_link(old_path.dentry, &new_path, new_dentry); + goto out_dput_old; + error = security_path_link(old_dentry, &new_path, new_dentry); if (error) - goto out_dput; - error = vfs_link(old_path.dentry, mnt_userns, new_path.dentry->d_inode, + goto out_dput_old; + error = vfs_link(old_dentry, mnt_userns, new_path.dentry->d_inode, new_dentry, &delegated_inode); -out_dput: - done_path_create(&new_path, new_dentry); +out_dput_old: + dput(old_dentry); +out_dput_new: + dput(new_dentry); +out_unlock: + unlock_link(new_path.dentry, old_path.dentry, flags); +out_putnewpath: + if (!err2) + mnt_drop_write(new_path.mnt); + path_put(&new_path); +out_putoldpath: + path_put(&old_path); if (delegated_inode) { error = break_deleg_wait(&delegated_inode); - if (!error) { - path_put(&old_path); + if (!error) goto retry; - } } if (retry_estale(error, how)) { - path_put(&old_path); how |= LOOKUP_REVAL; goto retry; } -out_putpath: - path_put(&old_path); out_putnames: putname(old); putname(new);
rename(2) is documented as If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing. However link(2) from a given path can race with rename renaming to that path so that link gets -ENOENT because the path has already been unlinked by rename, and creating a link to an unlinked file is not permitted. This can be fixed by locking the source directory before performing the lookup of the final component. The lock blocks rename from changing that component. We already lock the target directory, so when they are different we need to be careful about deadlocks. In the worst case we can use the same strategy as lock_rename() however the cost of s_vfs_rename_mutex is not always needed and is best avoided. Firstly we lock the target and if the source is different we try-lock that. This cannot deadlock as we never block while holding a lock. If the trylock fails, we drop the first lock, take s_vfs_rename_mutex, and follow the same pattern as rename_lock(). We only take a shared lock on the source directory. ->link() functions cannot expect the source directory to be locked as some callers of vfs_link() already have a dentry and do not perform the lookup that can race with rename. nfsd is a clear example - if there is a race it can happen on the client or between clients, and NFS as specified cannot avoid that. The handling of AT_EMPTY_PATH is a little inelegant. Reported-by: Xavier Roche <xavier.roche@algolia.com> Link: https://lore.kernel.org/all/20220214210708.GA2167841@xavier-xps/ Fixes: aae8a97d3ec3 ("fs: Don't allow to create hardlink for deleted file") Reported-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> --- .../filesystems/directory-locking.rst | 25 +++- Documentation/filesystems/locking.rst | 6 +- fs/namei.c | 119 ++++++++++++++---- 3 files changed, 124 insertions(+), 26 deletions(-)