Message ID | 20230601105830.13168-4-jack@suse.cz (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | fs: Fix directory corruption when moving directories | expand |
On Thu, Jun 01, 2023 at 12:58:24PM +0200, Jan Kara wrote: > Currently the locking order of inode locks for directories that are not > in ancestor relationship is not defined because all operations that > needed to lock two directories like this were serialized by > sb->s_vfs_rename_mutex. However some filesystems need to lock two > subdirectories for RENAME_EXCHANGE operations and for this we need the > locking order established even for two tree-unrelated directories. > Provide a helper function lock_two_inodes() that establishes lock > ordering for any two inodes and use it in lock_two_directories(). > > CC: stable@vger.kernel.org > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++ > fs/internal.h | 2 ++ > fs/namei.c | 4 ++-- > 3 files changed, 46 insertions(+), 2 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index 577799b7855f..4000ab08bbc0 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1103,6 +1103,48 @@ void discard_new_inode(struct inode *inode) > } > EXPORT_SYMBOL(discard_new_inode); > > +/** > + * lock_two_inodes - lock two inodes (may be regular files but also dirs) > + * > + * Lock any non-NULL argument. The caller must make sure that if he is passing > + * in two directories, one is not ancestor of the other. Zero, one or two > + * objects may be locked by this function. > + * > + * @inode1: first inode to lock > + * @inode2: second inode to lock > + * @subclass1: inode lock subclass for the first lock obtained > + * @subclass2: inode lock subclass for the second lock obtained > + */ > +void lock_two_inodes(struct inode *inode1, struct inode *inode2, > + unsigned subclass1, unsigned subclass2) > +{ > + if (!inode1 || !inode2) I think you forgot the opening bracket... I can just fix this up for you though. > + /* > + * Make sure @subclass1 will be used for the acquired lock. > + * This is not strictly necessary (no current caller cares) but > + * let's keep things consistent. > + */ > + if (!inode1) > + swap(inode1, inode2); > + goto lock; > + }
On Thu 01-06-23 15:58:58, Christian Brauner wrote: > On Thu, Jun 01, 2023 at 12:58:24PM +0200, Jan Kara wrote: > > Currently the locking order of inode locks for directories that are not > > in ancestor relationship is not defined because all operations that > > needed to lock two directories like this were serialized by > > sb->s_vfs_rename_mutex. However some filesystems need to lock two > > subdirectories for RENAME_EXCHANGE operations and for this we need the > > locking order established even for two tree-unrelated directories. > > Provide a helper function lock_two_inodes() that establishes lock > > ordering for any two inodes and use it in lock_two_directories(). > > > > CC: stable@vger.kernel.org > > Signed-off-by: Jan Kara <jack@suse.cz> > > --- > > fs/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++ > > fs/internal.h | 2 ++ > > fs/namei.c | 4 ++-- > > 3 files changed, 46 insertions(+), 2 deletions(-) > > > > diff --git a/fs/inode.c b/fs/inode.c > > index 577799b7855f..4000ab08bbc0 100644 > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -1103,6 +1103,48 @@ void discard_new_inode(struct inode *inode) > > } > > EXPORT_SYMBOL(discard_new_inode); > > > > +/** > > + * lock_two_inodes - lock two inodes (may be regular files but also dirs) > > + * > > + * Lock any non-NULL argument. The caller must make sure that if he is passing > > + * in two directories, one is not ancestor of the other. Zero, one or two > > + * objects may be locked by this function. > > + * > > + * @inode1: first inode to lock > > + * @inode2: second inode to lock > > + * @subclass1: inode lock subclass for the first lock obtained > > + * @subclass2: inode lock subclass for the second lock obtained > > + */ > > +void lock_two_inodes(struct inode *inode1, struct inode *inode2, > > + unsigned subclass1, unsigned subclass2) > > +{ > > + if (!inode1 || !inode2) > > I think you forgot the opening bracket... > I can just fix this up for you though. Oh, yes. Apparently I forgot to rerun git-format-patch after fixing up this bit. I'm sorry for that. The patch series has survived full ext4 fstests run on my end. Honza
... > > > + * Lock any non-NULL argument. The caller must make sure that if he is passing > > > + * in two directories, one is not ancestor of the other Not directly relevant to this change but is the 'not an ancestor' check actually robust? I found a condition in which the kernel 'pwd' code (which follows the inode chain) failed to stop at the base of a chroot. I suspect that the ancestor check would fail the same way. IIRC the problematic code used unshare() to 'escape' from a network natespace. If it was inside a chroot (that wasn't on a mount point) there ware two copies of the 'chroot /' inode and the match failed. I might be able to find the test case. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Jun 01, 2023 at 05:24:49PM +0200, Jan Kara wrote: > On Thu 01-06-23 15:58:58, Christian Brauner wrote: > > On Thu, Jun 01, 2023 at 12:58:24PM +0200, Jan Kara wrote: > > > Currently the locking order of inode locks for directories that are not > > > in ancestor relationship is not defined because all operations that > > > needed to lock two directories like this were serialized by > > > sb->s_vfs_rename_mutex. However some filesystems need to lock two > > > subdirectories for RENAME_EXCHANGE operations and for this we need the > > > locking order established even for two tree-unrelated directories. > > > Provide a helper function lock_two_inodes() that establishes lock > > > ordering for any two inodes and use it in lock_two_directories(). > > > > > > CC: stable@vger.kernel.org > > > Signed-off-by: Jan Kara <jack@suse.cz> > > > --- > > > fs/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++ > > > fs/internal.h | 2 ++ > > > fs/namei.c | 4 ++-- > > > 3 files changed, 46 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > index 577799b7855f..4000ab08bbc0 100644 > > > --- a/fs/inode.c > > > +++ b/fs/inode.c > > > @@ -1103,6 +1103,48 @@ void discard_new_inode(struct inode *inode) > > > } > > > EXPORT_SYMBOL(discard_new_inode); > > > > > > +/** > > > + * lock_two_inodes - lock two inodes (may be regular files but also dirs) > > > + * > > > + * Lock any non-NULL argument. The caller must make sure that if he is passing > > > + * in two directories, one is not ancestor of the other. Zero, one or two > > > + * objects may be locked by this function. > > > + * > > > + * @inode1: first inode to lock > > > + * @inode2: second inode to lock > > > + * @subclass1: inode lock subclass for the first lock obtained > > > + * @subclass2: inode lock subclass for the second lock obtained > > > + */ > > > +void lock_two_inodes(struct inode *inode1, struct inode *inode2, > > > + unsigned subclass1, unsigned subclass2) > > > +{ > > > + if (!inode1 || !inode2) > > > > I think you forgot the opening bracket... > > I can just fix this up for you though. > > Oh, yes. Apparently I forgot to rerun git-format-patch after fixing up this > bit. I'm sorry for that. The patch series has survived full ext4 fstests No problem at all!
On Thu 01-06-23 15:37:32, David Laight wrote: > ... > > > > + * Lock any non-NULL argument. The caller must make sure that if he is passing > > > > + * in two directories, one is not ancestor of the other > > Not directly relevant to this change but is the 'not an ancestor' > check actually robust? > > I found a condition in which the kernel 'pwd' code (which follows > the inode chain) failed to stop at the base of a chroot. > > I suspect that the ancestor check would fail the same way. Honestly, I'm not sure how this could be the case but I'm not a dcache expert. d_ancestor() works on dentries and the whole dcache code pretty much relies on the fact that there always is at most one dentry for any directory. Also in case we call d_ancestor() from this code, we have the whole filesystem locked from any other directory moves so the ancestor relationship of two dirs cannot change (which is different from pwd code AFAIK). So IMHO no failure is possible in our case. Honza > > IIRC the problematic code used unshare() to 'escape' from > a network natespace. > If it was inside a chroot (that wasn't on a mount point) there > ware two copies of the 'chroot /' inode and the match failed. > > I might be able to find the test case. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On Thu, Jun 01, 2023 at 06:13:53PM +0200, Jan Kara wrote: > On Thu 01-06-23 15:37:32, David Laight wrote: > > ... > > > > > + * Lock any non-NULL argument. The caller must make sure that if he is passing > > > > > + * in two directories, one is not ancestor of the other > > > > Not directly relevant to this change but is the 'not an ancestor' > > check actually robust? > > > > I found a condition in which the kernel 'pwd' code (which follows > > the inode chain) failed to stop at the base of a chroot. > > > > I suspect that the ancestor check would fail the same way. > > Honestly, I'm not sure how this could be the case but I'm not a dcache > expert. d_ancestor() works on dentries and the whole dcache code pretty > much relies on the fact that there always is at most one dentry for any > directory. Also in case we call d_ancestor() from this code, we have the > whole filesystem locked from any other directory moves so the ancestor > relationship of two dirs cannot change (which is different from pwd code > AFAIK). So IMHO no failure is possible in our case. Yes, this is a red herring. What matters is that the tree topology can't change which is up to the caller to guarantee. And where it's called we're under s_vfs_rename_mutex. It's also literally mentioned in the directory locking documentation.
From: Jan Kara <jack@suse.cz> > Sent: 01 June 2023 17:14 > > On Thu 01-06-23 15:37:32, David Laight wrote: > > ... > > > > > + * Lock any non-NULL argument. The caller must make sure that if he is passing > > > > > + * in two directories, one is not ancestor of the other > > > > Not directly relevant to this change but is the 'not an ancestor' > > check actually robust? > > > > I found a condition in which the kernel 'pwd' code (which follows > > the inode chain) failed to stop at the base of a chroot. > > > > I suspect that the ancestor check would fail the same way. > > Honestly, I'm not sure how this could be the case but I'm not a dcache > expert. d_ancestor() works on dentries and the whole dcache code pretty > much relies on the fact that there always is at most one dentry for any > directory. Also in case we call d_ancestor() from this code, we have the > whole filesystem locked from any other directory moves so the ancestor > relationship of two dirs cannot change (which is different from pwd code > AFAIK). So IMHO no failure is possible in our case. I've found the test program. This uses readlinkat() to get the full path /proc/self/fd/0. It should be inside the chroot, but the comparison done to detect the 'root' fails. Now maybe any rename that would hit this is invalid for other reasons. But something is awry somewhere. David The program below reproduces this when run with stdin redirected to a file in the current directory. This sequence is used by 'ip netns exec' so isn't actually that unusual. David #define _GNU_SOURCE #include <unistd.h> #include <stdio.h> #include <fcntl.h> #include <sched.h> static void print_link(const char *where, int fd) { char buf[256]; printf("%s: %.*s\n", where, (int)readlinkat(fd, "", buf, sizeof buf), buf); } int main(int argc, char **argv) { int link_fd = open("/proc/self/fd/0", O_PATH | O_NOFOLLOW); print_link("initial", link_fd); if (chroot(".")) return 1; print_link("after chroot", link_fd); if (unshare(CLONE_NEWNS)) return 2; print_link("after unshare", link_fd); return 0; } - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Hi Jan, kernel test robot noticed the following build errors: [auto build test ERROR on tytso-ext4/dev] [also build test ERROR on jaegeuk-f2fs/dev-test jaegeuk-f2fs/dev linus/master v6.4-rc4 next-20230601] [cannot apply to vfs-idmapping/for-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Jan-Kara/ext4-Remove-ext4-locking-of-moved-directory/20230601-225100 base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev patch link: https://lore.kernel.org/r/20230601105830.13168-4-jack%40suse.cz patch subject: [PATCH v2 4/6] fs: Establish locking order for unrelated directories config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20230602/202306020948.TBmCxtVw-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14) 12.2.0 reproduce (this is a W=1 build): # https://github.com/intel-lab-lkp/linux/commit/234d970a1de0d79e372cc04d6a8112d2aec56c44 git remote add linux-review https://github.com/intel-lab-lkp/linux git fetch --no-tags linux-review Jan-Kara/ext4-Remove-ext4-locking-of-moved-directory/20230601-225100 git checkout 234d970a1de0d79e372cc04d6a8112d2aec56c44 # save the config file mkdir build_dir && cp config build_dir/.config make W=1 O=build_dir ARCH=um SUBARCH=i386 olddefconfig make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash If you fix the issue, kindly add following tag where applicable | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202306020948.TBmCxtVw-lkp@intel.com/ All error/warnings (new ones prefixed by >>): fs/inode.c: In function 'lock_two_inodes': >> fs/inode.c:1121:9: warning: this 'if' clause does not guard... [-Wmisleading-indentation] 1121 | if (!inode1 || !inode2) | ^~ fs/inode.c:1129:17: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'if' 1129 | goto lock; | ^~~~ >> fs/inode.c:1129:17: error: label 'lock' used but not defined fs/inode.c: At top level: >> fs/inode.c:1136:9: error: expected identifier or '(' before 'if' 1136 | if (S_ISDIR(inode2->i_mode) == S_ISDIR(inode1->i_mode)) { | ^~ >> fs/inode.c:1139:11: error: expected identifier or '(' before 'else' 1139 | } else if (!S_ISDIR(inode1->i_mode)) | ^~~~ In file included from include/linux/kernel.h:27, from include/linux/cpumask.h:10, from include/linux/smp.h:13, from include/linux/lockdep.h:14, from include/linux/spinlock.h:63, from include/linux/wait.h:9, from include/linux/wait_bit.h:8, from include/linux/fs.h:6, from fs/inode.c:7: >> include/linux/minmax.h:167:63: error: expected identifier or '(' before 'while' 167 | do { typeof(a) __tmp = (a); (a) = (b); (b) = __tmp; } while (0) | ^~~~~ fs/inode.c:1140:17: note: in expansion of macro 'swap' 1140 | swap(inode1, inode2); | ^~~~ >> fs/inode.c:1141:5: error: expected '=', ',', ';', 'asm' or '__attribute__' before ':' token 1141 | lock: | ^ fs/inode.c:1144:9: error: expected identifier or '(' before 'if' 1144 | if (inode2 && inode2 != inode1) | ^~ >> fs/inode.c:1146:1: error: expected identifier or '(' before '}' token 1146 | } | ^ vim +/lock +1129 fs/inode.c 1105 1106 /** 1107 * lock_two_inodes - lock two inodes (may be regular files but also dirs) 1108 * 1109 * Lock any non-NULL argument. The caller must make sure that if he is passing 1110 * in two directories, one is not ancestor of the other. Zero, one or two 1111 * objects may be locked by this function. 1112 * 1113 * @inode1: first inode to lock 1114 * @inode2: second inode to lock 1115 * @subclass1: inode lock subclass for the first lock obtained 1116 * @subclass2: inode lock subclass for the second lock obtained 1117 */ 1118 void lock_two_inodes(struct inode *inode1, struct inode *inode2, 1119 unsigned subclass1, unsigned subclass2) 1120 { > 1121 if (!inode1 || !inode2) 1122 /* 1123 * Make sure @subclass1 will be used for the acquired lock. 1124 * This is not strictly necessary (no current caller cares) but 1125 * let's keep things consistent. 1126 */ 1127 if (!inode1) 1128 swap(inode1, inode2); > 1129 goto lock; 1130 } 1131 1132 /* 1133 * If one object is directory and the other is not, we must make sure 1134 * to lock directory first as the other object may be its child. 1135 */ > 1136 if (S_ISDIR(inode2->i_mode) == S_ISDIR(inode1->i_mode)) { 1137 if (inode1 > inode2) 1138 swap(inode1, inode2); > 1139 } else if (!S_ISDIR(inode1->i_mode)) 1140 swap(inode1, inode2); > 1141 lock: 1142 if (inode1) 1143 inode_lock_nested(inode1, subclass1); 1144 if (inode2 && inode2 != inode1) 1145 inode_lock_nested(inode2, subclass2); > 1146 } 1147
On Thu, Jun 01, 2023 at 04:33:58PM +0000, David Laight wrote: > From: Jan Kara <jack@suse.cz> > > Sent: 01 June 2023 17:14 > > > > On Thu 01-06-23 15:37:32, David Laight wrote: > > > ... > > > > > > + * Lock any non-NULL argument. The caller must make sure that if he is passing > > > > > > + * in two directories, one is not ancestor of the other > > > > > > Not directly relevant to this change but is the 'not an ancestor' > > > check actually robust? > > > > > > I found a condition in which the kernel 'pwd' code (which follows > > > the inode chain) failed to stop at the base of a chroot. > > > > > > I suspect that the ancestor check would fail the same way. > > > > Honestly, I'm not sure how this could be the case but I'm not a dcache > > expert. d_ancestor() works on dentries and the whole dcache code pretty > > much relies on the fact that there always is at most one dentry for any > > directory. Also in case we call d_ancestor() from this code, we have the > > whole filesystem locked from any other directory moves so the ancestor > > relationship of two dirs cannot change (which is different from pwd code > > AFAIK). So IMHO no failure is possible in our case. > > I've found the test program. > This uses readlinkat() to get the full path /proc/self/fd/0. > It should be inside the chroot, but the comparison done > to detect the 'root' fails. That's intentional and relied-upon behavior. In glibc alone for tty validation it wants the full link path returned. So any change in this is an immediate widespread userspace regression. > > Now maybe any rename that would hit this is invalid > for other reasons. > But something is awry somewhere. It really isn't. > > David > > The program below reproduces this when run with stdin > redirected to a file in the current directory. > > This sequence is used by 'ip netns exec' so isn't actually Fwiw, it doesn't use chroot() at all. > that unusual. > > David > > #define _GNU_SOURCE > #include <unistd.h> > #include <stdio.h> > #include <fcntl.h> > #include <sched.h> > > static void print_link(const char *where, int fd) > { > char buf[256]; > > printf("%s: %.*s\n", where, (int)readlinkat(fd, "", buf, sizeof buf), buf); > } > > int main(int argc, char **argv) > { > int link_fd = open("/proc/self/fd/0", O_PATH | O_NOFOLLOW); > > print_link("initial", link_fd); > if (chroot(".")) chroot(2): "This call changes an ingredient in the pathname resolution process and does nothing else. In particular, it is not intended [...] to restrict filesystem system calls. [...] This call does not close open file descriptors, and such file descriptors may allow access to files outside the chroot tree." > return 1; > print_link("after chroot", link_fd); > if (unshare(CLONE_NEWNS)) > return 2; > print_link("after unshare", link_fd); > return 0; > } But anyway, the code sample you provided is using O_PATH | O_NOFOLLOW to open magic link, i.e., /proc/<pid>/fd/<nr>. That means whatever the magic link refers to isn't really reopened. You can create these magic link references trivially for every path: int fd = open("/tmp", 0); // create fd referencing magic link sprintf(buf, "/proc/self/fd/%d", fd); int link_fd = open(buf, O_PATH | O_NOFOLLOW); In fact, you don't even need magic links for that. You can get the same behavior with any symlink: ln -sf /usr /BLUB linkt_fd = open("/BLUB", O_PATH | O_NOFOLLOW); If you pass such a fd to readlinkat() then d_path() will give you the full path whether it's accessible in your namespace/chroot/pivot_root() or not. Look at __prepend_path() current->fs->root is only used to terminate the walk for fds that are scopable _beneath_ your chroot: mkdir -p /A/B/C touch /A/B/C/D chroot("/A/B/C"); int fd = open("/A/B/C/D", 0); sprintf(buf, "/proc/self/fd/%d", fd); int link_fd = open(buf, O_PATH | O_NOFOLLOW); In this case, you'll see that after chroot("/A/B/C") it'll print: /D And this actually makes a lot of sense. The fd for /A/B/C/D is scoped beneath your chroot(). But an fd pointing outside of your chroot is not scoped by the chroot because you can also very well do: fchdir(fd-outside-chroot) And btw, orderings such as: chroot() unshare(CLONE_NEWNS) aren't intuitive. It seems that you're under the impression that the unshare(CLONE_NEWNS) doesn't have any effect on the chroot() but it does. Going back to the previous example: mkdir -p /A/B/C touch /A/B/C/D chroot("/A/B/C"); int fd = open("/A/B/C/D", 0); sprintf(buf, "/proc/self/fd/%d", fd); int link_fd = open(buf, O_PATH | O_NOFOLLOW); Compare what gets printed after the chroot() and after unshare(CLONE_NEWNS). You'll see /D after the chroot() but again the full path /A/B/C/D after the unshare(). Why? The reason is that if the mount that you're currently chroot()ed into is copied as part of the unshare(CLONE_NEWNS) then current->fs->root will be updated to refer to the copy. But since this is a copy it means that __prepend_path() doesn't terminate the walk at /D. That's seemingly counterintuitive but makes sense if you consider that you were moved into a new mount namespace. The mount the fd refers to is now inaccessible from your mount namespace and so the full path is returned again. Yes, that's not straightforward but heavily relied upon so even if we could change it to be less surprising it would break the hell out of everyone. And most of this doesn't have anything to do with ancestor relationships per se since this code is able to detect concurrent tree modifications through rename_lock seqlock iirc. That's a related but different problem. The effects you're seeing are caused by mount semantics more than anything else. And btw about /proc/self/fd/0 specifically... Not verifying an fd pointing to a pty device in any type of sandbox in the age of containers is ripe for confusion. Quoting from work I did on glibc years ago: "It's a common practice among container managers to allocate a PTY master/slave pair in the host's mount namespace (the slave having a path like "/dev/pty/$X"), bind mount the slave to "/dev/console" in the container's mount namespace, and send the slave FD to a process in the container. Inside of the container, the slave-end isn't available at its original path ("/dev/pts/$X"), since the container mount namespace has a separate devpts instance from the host (that path may or may not exist in the container; if it does exist, it's not the same PTY slave device)."
diff --git a/fs/inode.c b/fs/inode.c index 577799b7855f..4000ab08bbc0 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1103,6 +1103,48 @@ void discard_new_inode(struct inode *inode) } EXPORT_SYMBOL(discard_new_inode); +/** + * lock_two_inodes - lock two inodes (may be regular files but also dirs) + * + * Lock any non-NULL argument. The caller must make sure that if he is passing + * in two directories, one is not ancestor of the other. Zero, one or two + * objects may be locked by this function. + * + * @inode1: first inode to lock + * @inode2: second inode to lock + * @subclass1: inode lock subclass for the first lock obtained + * @subclass2: inode lock subclass for the second lock obtained + */ +void lock_two_inodes(struct inode *inode1, struct inode *inode2, + unsigned subclass1, unsigned subclass2) +{ + if (!inode1 || !inode2) + /* + * Make sure @subclass1 will be used for the acquired lock. + * This is not strictly necessary (no current caller cares) but + * let's keep things consistent. + */ + if (!inode1) + swap(inode1, inode2); + goto lock; + } + + /* + * If one object is directory and the other is not, we must make sure + * to lock directory first as the other object may be its child. + */ + if (S_ISDIR(inode2->i_mode) == S_ISDIR(inode1->i_mode)) { + if (inode1 > inode2) + swap(inode1, inode2); + } else if (!S_ISDIR(inode1->i_mode)) + swap(inode1, inode2); +lock: + if (inode1) + inode_lock_nested(inode1, subclass1); + if (inode2 && inode2 != inode1) + inode_lock_nested(inode2, subclass2); +} + /** * lock_two_nondirectories - take two i_mutexes on non-directory objects * diff --git a/fs/internal.h b/fs/internal.h index bd3b2810a36b..377030a50aca 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -152,6 +152,8 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry); bool in_group_or_capable(struct mnt_idmap *idmap, const struct inode *inode, vfsgid_t vfsgid); +void lock_two_inodes(struct inode *inode1, struct inode *inode2, + unsigned subclass1, unsigned subclass2); /* * fs-writeback.c diff --git a/fs/namei.c b/fs/namei.c index e4fe0879ae55..148570aabe74 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3028,8 +3028,8 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2) return p; } - inode_lock_nested(p1->d_inode, I_MUTEX_PARENT); - inode_lock_nested(p2->d_inode, I_MUTEX_PARENT2); + lock_two_inodes(p1->d_inode, p2->d_inode, + I_MUTEX_PARENT, I_MUTEX_PARENT2); return NULL; }
Currently the locking order of inode locks for directories that are not in ancestor relationship is not defined because all operations that needed to lock two directories like this were serialized by sb->s_vfs_rename_mutex. However some filesystems need to lock two subdirectories for RENAME_EXCHANGE operations and for this we need the locking order established even for two tree-unrelated directories. Provide a helper function lock_two_inodes() that establishes lock ordering for any two inodes and use it in lock_two_directories(). CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> --- fs/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++ fs/internal.h | 2 ++ fs/namei.c | 4 ++-- 3 files changed, 46 insertions(+), 2 deletions(-)