Message ID | 20221111220614.991928-5-stephen.s.brennan@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fsnotify: fix softlockups iterating over d_subdirs | expand |
On Sat, Nov 12, 2022 at 12:06 AM Stephen Brennan <stephen.s.brennan@oracle.com> wrote: > > With very large d_subdirs lists, iteration can take a long time. Since > iteration needs to hold alias->d_lock, this can trigger soft lockups. > It would be best to make this iteration sleepable. We can drop > alias->d_lock and sleep, by taking a reference to the current child. > However, we need to be careful, since it's possible for the child's > list pointers to be modified once we drop alias->d_lock. The following > are the only cases where the list pointers are modified: > > 1. dentry_unlist() in fs/dcache.c > > This is a helper of dentry_kill(). This function is quite careful to > check the reference count of the dentry once it has taken the > requisite locks, and bail out if a new reference was taken. So we > can be safe that, assuming we found the dentry and took a reference > before dropping alias->d_lock, any concurrent dentry_kill() should > bail out and leave our list pointers untouched. > > 2. __d_move() in fs/dcache.c > > If the child was moved to a new parent, then we can detect this by > testing d_parent and retrying the iteration. > > 3. Initialization code in d_alloc() family > > We are safe from this code, since we cannot encounter a dentry until > it has been initialized. > > 4. Cursor code in fs/libfs.c for dcache_readdir() > > Dentries with DCACHE_DENTRY_CURSOR should be skipped before sleeping, > since we could awaken to find they have skipped over a portion of the > child list. > > Given these considerations, we can carefully write a loop that iterates > over d_subdirs and is capable of going to sleep periodically. > > Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> > --- > > Notes: > v4: > - I've updated this patch so it should be safe even without the > inode locked, by handling cursors and d_move() races. > - Moved comments to lower indentation > - I didn't keep Amir's R-b since this was a fair bit of change. > v3: > - removed if statements around dput() > v2: > - added a check for child->d_parent != alias and retry logic > > fs/notify/fsnotify.c | 46 ++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 42 insertions(+), 4 deletions(-) > > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c > index 409d479cbbc6..0ba61211456c 100644 > --- a/fs/notify/fsnotify.c > +++ b/fs/notify/fsnotify.c > @@ -105,7 +105,7 @@ void fsnotify_sb_delete(struct super_block *sb) > */ > void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched) > { > - struct dentry *alias, *child; > + struct dentry *alias, *child, *last_ref = NULL; > > if (!S_ISDIR(inode->i_mode)) > return; > @@ -116,12 +116,49 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched) > return; > > /* > - * run all of the children of the original inode and fix their > - * d_flags to indicate parental interest (their parent is the > - * original inode) > + * These lists can get very long, so we may need to sleep during > + * iteration. Normally this would be impossible without a cursor, > + * but since we have the inode locked exclusive, we're guaranteed Not exactly true for v4 patchset order, but I do prefer that we make it true. > + * that the directory won't be modified, so whichever dentry we > + * pick to sleep on won't get moved. So, start a manual iteration > + * over d_subdirs which will allow us to sleep. > */ > spin_lock(&alias->d_lock); > +retry: Better if we can avoid this retry by inode lock. Note that it is enough to take inode_lock_shared() to protect this code. It means that tasks doing remove+add parent watch may iterate d_subdirs in parallel, but maybe that's even better then having them iterate d_subdirs sequentially? > list_for_each_entry(child, &alias->d_subdirs, d_child) { > + /* > + * We need to hold a reference while we sleep. But we cannot > + * sleep holding a reference to a cursor, or we risk skipping > + * through the list. > + * > + * When we wake, dput() could free the dentry, invalidating the > + * list pointers. We can't look at the list pointers until we > + * re-lock the parent, and we can't dput() once we have the > + * parent locked. So the solution is to hold onto our reference > + * and free it the *next* time we drop alias->d_lock: either at > + * the end of the function, or at the time of the next sleep. > + */ My usual nit picking: you could concat this explanation to the comment outside the loop. It won't make it any less clear, maybe even more clear in the wider context of how the children are safely iterated. Thanks, Amir.
Hi, Stephen Brennan, we made report upon v1 patch https://lore.kernel.org/all/202210271500.731e3808-yujie.liu@intel.com/ now we noticed similar issue still on v4. FYI. Greeting, FYI, we noticed WARNING:possible_recursive_locking_detected due to commit (built with gcc-11): commit: 5482581c7c96a8fe60bb29c181e86cdc9889b068 ("[PATCH v4 4/5] fsnotify: allow sleepable child flag update") url: https://github.com/intel-lab-lkp/linux/commits/Stephen-Brennan/fsnotify-clear-PARENT_WATCHED-flags-lazily/20221112-070656 base: https://git.kernel.org/cgit/linux/kernel/git/jack/linux-fs.git fsnotify patch subject: [PATCH v4 4/5] fsnotify: allow sleepable child flag update in testcase: boot on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag | Reported-by: kernel test robot <oliver.sang@intel.com> | Link: https://lore.kernel.org/oe-lkp/202211151309.1fdf8fd0-oliver.sang@intel.com [ 16.419535][ T369] WARNING: possible recursive locking detected [ 16.420080][ T369] 6.0.0-rc4-00068-g5482581c7c96 #1 Not tainted [ 16.420595][ T369] -------------------------------------------- [ 16.421105][ T369] dnsmasq/369 is trying to acquire lock: [ 16.421578][ T369] bfaa3858 (&dentry->d_lock){+.+.}-{2:2}, at: lockref_get (??:?) [ 16.422251][ T369] [ 16.422251][ T369] but task is already holding lock: [ 16.422868][ T369] bf804d60 (&dentry->d_lock){+.+.}-{2:2}, at: fsnotify_update_children_dentry_flags (??:?) [ 16.423733][ T369] [ 16.423733][ T369] other info that might help us debug this: [ 16.424438][ T369] Possible unsafe locking scenario: [ 16.424438][ T369] [ 16.425082][ T369] CPU0 [ 16.425376][ T369] ---- [ 16.425660][ T369] lock(&dentry->d_lock); [ 16.426052][ T369] lock(&dentry->d_lock); [ 16.426439][ T369] [ 16.426439][ T369] *** DEADLOCK *** [ 16.426439][ T369] [ 16.427109][ T369] May be due to missing lock nesting notation [ 16.427109][ T369] [ 16.427808][ T369] 2 locks held by dnsmasq/369: [ 16.428219][ T369] #0: 409474b0 (&group->mark_mutex){+.+.}-{3:3}, at: inotify_update_watch (inotify_user.c:?) [ 16.429047][ T369] #1: bf804d60 (&dentry->d_lock){+.+.}-{2:2}, at: fsnotify_update_children_dentry_flags (??:?) [ 16.430016][ T369] [ 16.430016][ T369] stack backtrace: [ 16.430538][ T369] CPU: 1 PID: 369 Comm: dnsmasq Not tainted 6.0.0-rc4-00068-g5482581c7c96 #1 [ 16.431276][ T369] Call Trace: [ 16.431560][ T369] ? dump_stack_lvl (??:?) [ 16.431990][ T369] ? dump_stack (??:?) [ 16.432347][ T369] ? validate_chain.cold (lockdep.c:?) [ 16.432791][ T369] ? __lock_acquire (lockdep.c:?) [ 16.433233][ T369] ? lock_acquire (??:?) [ 16.433651][ T369] ? lockref_get (??:?) [ 16.434034][ T369] ? __lock_release (lockdep.c:?) [ 16.434454][ T369] ? fsnotify_update_children_dentry_flags (??:?) [ 16.435069][ T369] ? _raw_spin_lock (??:?) [ 16.435465][ T369] ? lockref_get (??:?) [ 16.435832][ T369] ? lockref_get (??:?) [ 16.436190][ T369] ? fsnotify_update_children_dentry_flags (??:?) [ 16.436753][ T369] ? fsnotify_recalc_mask (??:?) [ 16.437229][ T369] ? fsnotify_add_mark_locked (??:?) [ 16.437716][ T369] ? inotify_new_watch (inotify_user.c:?) [ 16.438166][ T369] ? inotify_update_watch (inotify_user.c:?) [ 16.438627][ T369] ? __ia32_sys_inotify_add_watch (??:?) [ 16.439183][ T369] ? do_int80_syscall_32 (??:?) [ 16.439640][ T369] ? entry_INT80_32 (entry_32.o:?) LKP: ttyS0: 641: Kernel tests: Boot OK! LKP: ttyS0: 641: HOSTNAME vm-snb, MAC 52:54:00:12:34:56, kernel 6.0.0-rc4-00068-g5482581c7c96 1 LKP: ttyS0: 641: /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-meta-42/boot-1-openwrt-i386-generic-20190428.cgz-5482581c7c96a8fe60bb29c181e86cdc9889b068-20221115-34610-11px8cv-40.yaml [ 18.633802][ T655] mkdir: can't create directory '/sys/kernel/debug': Operation not permitted [ 18.633802][ T655] mount: mounting debug on /sys/kernel/debug failed: No such file or directory [ 21.139501][ T856] process 856 (wait) attempted a POSIX timer syscall while CONFIG_POSIX_TIMERS is not set [ 32.651080][ T957] rsync (957) used greatest stack depth: 6208 bytes left LKP: ttyS0: 641: LKP: rebooting forcely [ 42.464678][ T641] sysrq: Emergency Sync [ 42.465117][ T641] sysrq: Resetting Kboot worker: lkp-worker53 Elapsed time: 60 kvm=( qemu-system-x86_64 -enable-kvm -cpu SandyBridge -kernel $kernel -initrd initrd-vm-meta-42.cgz -m 16384 -smp 2 -device e1000,netdev=net0 -netdev user,id=net0,hostfwd=tcp::32032-:22 -boot order=nc -no-reboot -device i6300esb -watchdog-action debug -rtc base=localtime -serial stdio -display none -monitor null ) append=( ip=::::vm-meta-42::dhcp root=/dev/ram0 RESULT_ROOT=/result/boot/1/vm-snb/openwrt-i386-generic-20190428.cgz/i386-randconfig-a011-20210930/gcc-11/5482581c7c96a8fe60bb29c181e86cdc9889b068/73 BOOT_IMAGE=/pkg/linux/i386-randconfig-a011-20210930/gcc-11/5482581c7c96a8fe60bb29c181e86cdc9889b068/vmlinuz-6.0.0-rc4-00068-g5482581c7c96 branch=linux-review/Stephen-Brennan/fsnotify-clear-PARENT_WATCHED-flags-lazily/20221112-070656 job=/job-script user=lkp ARCH=i386 kconfig=i386-randconfig-a011-20210930 commit=5482581c7c96a8fe60bb29c181e86cdc9889b068 vmalloc=256M initramfs_async=0 page_owner=on initcall_debug max_uptime=600 result_service=tmpfs selinux=0 debug apic=debug To reproduce: # build kernel cd linux cp config-6.0.0-rc4-00068-g5482581c7c96 .config make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install cd <mod-install-dir> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 409d479cbbc6..0ba61211456c 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -105,7 +105,7 @@ void fsnotify_sb_delete(struct super_block *sb) */ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched) { - struct dentry *alias, *child; + struct dentry *alias, *child, *last_ref = NULL; if (!S_ISDIR(inode->i_mode)) return; @@ -116,12 +116,49 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched) return; /* - * run all of the children of the original inode and fix their - * d_flags to indicate parental interest (their parent is the - * original inode) + * These lists can get very long, so we may need to sleep during + * iteration. Normally this would be impossible without a cursor, + * but since we have the inode locked exclusive, we're guaranteed + * that the directory won't be modified, so whichever dentry we + * pick to sleep on won't get moved. So, start a manual iteration + * over d_subdirs which will allow us to sleep. */ spin_lock(&alias->d_lock); +retry: list_for_each_entry(child, &alias->d_subdirs, d_child) { + /* + * We need to hold a reference while we sleep. But we cannot + * sleep holding a reference to a cursor, or we risk skipping + * through the list. + * + * When we wake, dput() could free the dentry, invalidating the + * list pointers. We can't look at the list pointers until we + * re-lock the parent, and we can't dput() once we have the + * parent locked. So the solution is to hold onto our reference + * and free it the *next* time we drop alias->d_lock: either at + * the end of the function, or at the time of the next sleep. + */ + if (child->d_flags & DCACHE_DENTRY_CURSOR) + continue; + if (need_resched()) { + dget(child); + spin_unlock(&alias->d_lock); + dput(last_ref); + last_ref = child; + cond_resched(); + spin_lock(&alias->d_lock); + /* Check for races with __d_move() */ + if (child->d_parent != alias) + goto retry; + } + + /* + * This check is after the cond_resched() for a reason: it is + * safe to sleep holding a negative dentry reference. If the + * directory contained many negative dentries, and we skipped + * them checking need_resched(), we may never end up sleeping + * where necessary, and could trigger a soft lockup. + */ if (!child->d_inode) continue; @@ -133,6 +170,7 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched) spin_unlock(&child->d_lock); } spin_unlock(&alias->d_lock); + dput(last_ref); dput(alias); }
With very large d_subdirs lists, iteration can take a long time. Since iteration needs to hold alias->d_lock, this can trigger soft lockups. It would be best to make this iteration sleepable. We can drop alias->d_lock and sleep, by taking a reference to the current child. However, we need to be careful, since it's possible for the child's list pointers to be modified once we drop alias->d_lock. The following are the only cases where the list pointers are modified: 1. dentry_unlist() in fs/dcache.c This is a helper of dentry_kill(). This function is quite careful to check the reference count of the dentry once it has taken the requisite locks, and bail out if a new reference was taken. So we can be safe that, assuming we found the dentry and took a reference before dropping alias->d_lock, any concurrent dentry_kill() should bail out and leave our list pointers untouched. 2. __d_move() in fs/dcache.c If the child was moved to a new parent, then we can detect this by testing d_parent and retrying the iteration. 3. Initialization code in d_alloc() family We are safe from this code, since we cannot encounter a dentry until it has been initialized. 4. Cursor code in fs/libfs.c for dcache_readdir() Dentries with DCACHE_DENTRY_CURSOR should be skipped before sleeping, since we could awaken to find they have skipped over a portion of the child list. Given these considerations, we can carefully write a loop that iterates over d_subdirs and is capable of going to sleep periodically. Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> --- Notes: v4: - I've updated this patch so it should be safe even without the inode locked, by handling cursors and d_move() races. - Moved comments to lower indentation - I didn't keep Amir's R-b since this was a fair bit of change. v3: - removed if statements around dput() v2: - added a check for child->d_parent != alias and retry logic fs/notify/fsnotify.c | 46 ++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-)